如何解决从长文本中删除/替换子字符串的最快方法
我有一个很大的语料库,我想从中删除某些单词。类似于从文本中删除停用词,但我现在想从语料库中删除二元组。我有我的 bigrams 列表,但显然删除停用词的简单列表理解方法不会削减它。我正在考虑使用正则表达式并从单词列表中编译一个模式,然后替换这些单词。下面是一些示例代码:
txt = 'He was the type of guy who liked Christmas lights on his house in the middle of July. He picked up trash in his spare time to dump in his neighbors yard. If eating three-egg omelets causes weight-gain,budgie eggs are a good substitute. We should play with legos at camp. She cried diamonds. She had some amazing news to share but nobody to share it with. He decided water-skiing on a frozen lake wasn’t a good idea. His eyes met mine on the street. When he asked her favorite number,she answered without hesitation that it was diamonds. She is never happy until she finds something to be unhappy about; then,she is overjoyed.'
--
import re
words_to_remove = ['this is','We should','Christmas lights']
pattrn = re.compile(r' | '.join(words_to_remove))
pattrn.sub(' ',txt)
%timeit pattrn.sub(' ',txt)
--
timeit 1: 9.18 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs,100000 loops each)
有没有更快的方法来删除这些二元组?实际语料的长度为 556,694,135 个字符,二元组的数量为 3,205,182,这在实际数据集上进行时确实很慢。
解决方法
您可以重写您的正则表达式以具有特里树的结构(而不是 word|worse|wild
使用 w(or(d|se)|ild)
),或者甚至更好,抛弃正则表达式并使用 Aho–Corasick 算法。当然,您可以为此使用一个库,例如 FlashText(这是 Aho-Corasick 的 slimmed down version,专门用于搜索和替换整个单词,如您的情况)。
FlashText 的作者声称 »Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.«
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。