如何解决读写大文本文件python太慢
这段代码遍历了一个 5.1GB 的大文本文件,并检查是否有出现次数少于 100 次的单词。然后将 5.1GB 重写为输出文本文件,并用 unk 替换这些单词。主要问题是 output.txt 的创建需要很长时间。 我怀疑 write_text() 方法在打开数据集文件和输出文件时导致了问题。
这个脚本背后的目标:我有一个预先构建的词汇和一个文本。文本可能有我的词汇中没有的新词,所以我想将它们添加到我的词汇中。但我只想添加相关的新词(出现超过 100 次)。文中出现少于100次的新词是一次性的,不重要所以我想把它们改成“unk”。
from collections import Counter
extra_words = []
new_words = []
add_words = []
def get_vocab():
vocab = set()
with open('vocab.txt','r',encoding='utf-8') as rd:
lines = rd.readlines()
for line in lines:
tokens = line.split(' ')
word = tokens[0]
vocab.add(word)
return vocab
def _count(text):
vocab = get_vocab()
with open(text,encoding='utf-8') as fd:
for line in fd.readlines():
for token in line.split():
if token not in vocab:
extra_words.append(token)
word_count = Counter(extra_words)
# add del word_count[punctuation] to remove it from list
#del word_count['"']
for word in word_count:
if word_count[word] < 100:
new_words.append(word)
else:
add_words.append(word)
write_text()
#return len(new_words),word_count.most_common()[0]
def write_text():
with open('dataset',encoding='utf-8') as fd:
f = fd.readlines()
with open('output.txt','w',encoding='utf-8') as rd:
new_text = []
for line in f:
new_line = []
for token in line.split():
if token in new_words:
new_line.append('<unk>')
else:
new_line.append(token)
new_text.append(' '.join(new_line))
print('\n'.join(new_text),file=rd)
#print(' '.join(new_line),file=rd)
def add_vocab():
ln = len(get_vocab())
with open('vocab.txt',encoding='utf-8') as fd:
for idx,word in add_words:
print(f'{word} {ln + idx + 1}\n',file=fd)
pass
print(_count('dataset'))
add_vocab()
解决方法
我用莎士比亚的全集对此进行了测试。您还有很多与大小写和标点符号相关的工作要做。它在大约 15 秒内为我复制了 100 份他的作品(500meg)。如果这需要更多不可接受的时间,您可能需要查看分析您的代码。请注意,我使用了您的词汇文件的简化版本,因为我没有按照您希望在其中看到的内容进行操作。我用的版本只是一行一行的字。
import collections
def get_vocabulary(path):
with open(path,'r',encoding='utf-8') as file_in:
tokens = [line.strip("\n") for line in file_in]
return set(tokens)
def get_interesting_word_counts(path,vocabulary):
word_counts = collections.Counter()
with open(path,encoding='utf-8') as file_in:
for line in file_in:
word_counts.update([token for token in line.split() if token not in vocabulary])
return word_counts
def get_cleaned_text(path,vocabulary,uncommon_words):
with open(path,encoding='utf-8') as file_in:
for line in file_in:
#line_out = " ".join(["<unk>" if token in uncommon_words else token for token in line.strip("\n").split()])
line_out = " ".join([
token if token in vocabulary or token not in uncommon_words else "<unk>"
for token in line.strip("\n").split()
])
yield "{}\n".format(line_out)
vocabulary = get_vocabulary("vocabulary.txt")
word_counts = get_interesting_word_counts("shakespeare.txt",vocabulary)
## --------------------------------------
## Add frequent but missing words to vocabulary
## --------------------------------------
common_words = set([item[0] for item in word_counts.items() if item[1] >= 100])
with open('vocabulary.txt','a',encoding='utf-8') as file_out:
for word in common_words:
file_out.write("{}\n".format(word))
## --------------------------------------
## --------------------------------------
## Rewite the text censuring uncommon words
## --------------------------------------
uncommon_words = set([item[0] for item in word_counts.items() if item[1] < 100])
cleaned_text = get_cleaned_text("shakespeare.txt",uncommon_words)
with open('shakespeare_out.txt','w',encoding='utf-8') as file_out:
file_out.writelines(cleaned_text)
## --------------------------------------
您可以在此处获取我使用的文本:http://www.gutenberg.org/ebooks/100
源码开始:
The Project Gutenberg eBook of The Complete Works of William Shakespeare,by William Shakespeare
生成的文件开始:
<unk> <unk> <unk> <unk> of The <unk> <unk> of <unk> <unk> by <unk> <unk>
更新的词汇文件开始:
as
run
he’s
this.
there’s
like
you.
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。