带有 ngram 索引数据的 Elasticsearch Phrase 建议问题

如何解决带有 ngram 索引数据的 Elasticsearch Phrase 建议问题

我需要为拼写检查查询实现一个短语建议器。我有一个使用 edge_ngram 标记器的分析器索引的数据。

"suggestion_tokenizer": {
      "type": "edge_ngram","min_gram": 2,"max_gram": 10,"token_chars": [
        "letter","digit","symbol"
      ]
    }

我在此配置中使用短语建议器：

"suggest": {
"text": "helo worl","custom_suggester": {
  "phrase": {
    "field": "item.title","max_errors": 3,"size": 5,"direct_generator" : [{
      "field": "item.title","prefix_length": 0,"max_edits": 1,"min_word_length": 3
    }]
  }
}

当我执行短语建议时，它可以很好地处理错误的单词，即：

"helo world" ---> "hello world"

问题是如果一个查询：

"helo worl" ---> "hello worl"

短语建议器正确地将“helo”更正为“hello”，但没有处理缺少“d”字母的“worl”，因为“worl”存在倒排索引（由 edge_ngram 生成）索引数据时的标记器），除了 ES 在 WORLd 中找到匹配。

我该如何解决这个问题？

解决方法

我使用带瓦片过滤器的三元分析器解决了这个问题here。缺点是我必须重新索引所有数据，以便 ES 可以为单词对（即“hello world”）创建倒排索引。

提高结果的另一件事是添加

"suggest_mode": "always"

有了这个，直接生成器为短语建议者提供了每个术语的更多选项，以使用 ngram 语言模型进行评估。就我而言，结果更好。