在 gensim.corpora.textcorpus.TextCorpus 中使用我自己的停用词列表

如何解决在 gensim.corpora.textcorpus.TextCorpus 中使用我自己的停用词列表

在 gensim.corpora.textcorpus.TextCorpus 的 gensim 4.0 子类中应用默认预处理，包括 remove_stopwords()。此函数使用存储在 gensim.parsing.preprocessing.STOPWORDS 中的停用词列表。

我如何用我自己的列表替换这个列表？我可以执行以下操作

import gensim 
gensim.parsing.preprocessing.STOPWORDS = frozenset({'aber','alle','allem','allen' })

它适用于 gensim.parsing.preprocessing.remove_stopwords(s)。所以这按预期工作：

gensim.parsing.preprocessing.remove_stopwords("aber alle lachten")
> 'lachten'

但是当我使用类 gensim.corpora.textcorpus.TextDirectoryCorpus（它是 TextCorpus 的子类）处理我的文件时，不使用该列表。例如：

import os
os.mkdir('test123')
with open('test123/test.txt','w') as fout:
    fout.write('aber alle lachten \n allen gefallen \n')

corpus = gensim.corpora.textcorpus.TextDirectoryCorpus('test123')    

for text in corpus.get_texts():
    print(text)

> ['aber','lachten','allen','gefallen']

我知道我可以编写自己的子类并覆盖预处理文件的方法，但这对于替换停用词列表似乎有点过分。

解决方法

如果您查看 gensim.corpora.textcorpus.py 的源代码...

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/textcorpus.py

...你可以看到：

首先，该模块有自己的 remove_stopwords() 函数 - 查询 gensim.parsing.preprocessing.STOPWORDS，但是（以 Python 默认参数的风格）仅在函数定义的那一刻. （另外，有点令人困惑的是，虽然这个 gensim.corpora.textcorpus.remove_stopwords() 函数与 gensim.parsing.preprocessing.remove_stopwords() 中的另一个函数具有相同的（模块非限定）名称，但它需要一个 tokens 列表，而另一个函数需要一个空格分隔的字符串。）

这种行为可能会得到改进，因为更容易理解/定制，以尊重 STOPWORDS 变量的当前值每次调用 - 但它现在的定义方式，此刻该函数已定义（当您的代码导入模块时），它会在 gensim.parsing.preprocessing.STOPWORDS 的当前值处“捕获”。如果随后更改该值，该函数仍将引用旧的停用词集。

第二，TextCorpus 类（由 TextDirectoryCorpus 使用）在其用于新对象初始化的 __init__() 方法中，将使用此本地 {{ 1}} （连同 remove_stopwords() 作为 remove_short() 的默认值，如果没有另外指定。所以：只需指定您自己的 token_filters 就足以断言完全控制什么停止顺序正在删除，不需要任何其他新的子类。

因此，您有两种可能的方式来获得您想要的行为：

替换值 token_filters before gensim.parsing.preprocessing.STOPWORDS 模块定义其功能。（考虑到各种 Gensim 导入可能会自动引入其他模块，这可能很棘手，或者可能很容易 - 我没有检查过。）
在初始化您的 gensim.corpora.textcorpus 时指定您自己的 token_filters - 这样就不会使用先前定义的捕获旧 TextDirectoryCorpus 的函数。例如，这样做可能就足够了：

STOPWORDS

在 gensim.corpora.textcorpus.TextCorpus 中使用我自己的停用词列表

如何解决在 gensim.corpora.textcorpus.TextCorpus 中使用我自己的停用词列表

解决方法

相关推荐