微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

一份文件中的联动短语频率

如何解决一份文件中的联动短语频率

我正在尝试查找文本中短语的出现频率。但是,如果在一个文档中有多个短语,Whoosh仍将整个文档视为命中,而不是短语短语。 示例:

self.analyzer = StandardAnalyzer(expression=r'([.,!?;:]+|\w+((\-|\'|\.)?\w+)*)',minsize=1,stoplist=[])
self.schema = Schema(tag=STORED,content=TEXT(analyzer=self.analyzer))
self.index = create_in("index",self.schema)
self.parser = QueryParser('content',self.index.schema)
writer = self.index.writer()
writer.add_document(tag=u"tag1",content=u"One two Search Phrase three four Search Phrase")
writer.add_document(tag=u"tag2",content=u"Foo bar Search Phrase foo bar")
writer.commit()
self.searcher = self.index.searcher()

query = self.parser.parse('"Search Phrase"') #The Phrase we need to find
results = self.searcher.search(query,limit=None)

# Here we will achieve only 2 hits because every document contains the search phrase,but how Could we achieve 3 hits?
res_count = len(results) 

对于术语,我们有频率计数:

# Number of times content:wobble appears in all documents
freq = searcher.frequency("content","wobble")

# Number of documents containing content:wobble
docfreq = searcher.doc_frequency("content","wobble")

但是上面的代码不适用于短语。短语有相似之处吗?我想念什么吗?我没有在文档中找到任何有用的信息。 非常感谢您的帮助!

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。