如何解决使用python计算文本中短语旁边的三元组频率
我正在尝试从 Python 中的文本文件中计算短语(列表中的一个)两侧的三元组。例如在这样的句子中:
“我们知道,在很多情况下,人们不时会更加重视”
如果“不时”是所讨论的短语,那么“在许多情况下”和“有一个”这两个词将是令人感兴趣的词。此代码的一个版本适用于列表中的单个单词,但对于短语输出完全不正确。
对于短语“不时”,输出应如下所示。
不时{'there': 4,'Could': 12,'scotland': 12,'imagine': 15}
实际上看起来像这样。
我的代码如下
context_d = {}
for filename in glob.glob(os.path.join(path,'*.txt')):
if filename.endswith('.txt'):
f = open(filename)
file = f.read()
# txt = file.lower()
txt = file.split()
txt = [item.replace('May','') for item in txt] # locate and replace all months of May before lowering
# txt = list([[word.lower() for word in line.split()] for line in txt])
txt = (list(map(lambda x: x.lower(),txt)))
for n in range(len(word_list)):
for j in range(len(txt)):
if (j + 3) < len(txt):
if txt[j] in word_list[n]:
if txt[j] in context_d:
context_d[txt[j]] += txt[(j - 3):j]
context_d[txt[j]] += txt[(j + 1):(j + 3)]
else:
context_d[txt[j]] = txt[(j - 3):j]
context_d[txt[j]] += txt[(j + 1):(j + 3)]
print(filename)
'''
clean the symbols
'''
for word in context_d:
# print(word)
for i in range(len(context_d[word])):
for letter in context_d[word][i]:
if letter in stopsb:
# print(context_d[word][i])
context_d[word][i] = context_d[word][i].replace(letter,'')
'''
count the frequency
'''
context_freq = dict()
for word in context_d:
# print(word)
context_freq[word] = dict(Counter(context_d[word]))
'''
top 300
'''
top300_context = dict()
for word in context_freq:
# print(word)
if len(context_freq[word]) > 300:
top300_context[word] = dict(Counter(context_freq[word]).most_common(300))
else:
top300_context[word] = context_freq[word]
'''
clean stop words
'''
cleaned_context = top300_context
for word in top300_context:
# print(word)
for context in list(top300_context[word]):
if context in stopws or len(context) == 0 or len(context) == 1 or any(
component in digits for component in context):
del (cleaned_context[word][context])
'''
top 30
'''
top30_context = dict()
for word in cleaned_context:
# print(word)
if len(cleaned_context[word]) > 30:
top30_context[word] = dict(Counter(cleaned_context[word]).most_common(30))
else:
top30_context[word] = cleaned_context[word]
path = 'D:/Testing10'
os.chdir(path)
with open('Words_Context.csv','wb') as f,TextIOWrapper(f,encoding='utf-8',newline='') as wrapper:
csvwriter = csv.writer(wrapper)
sort_top30_context = sorted(top30_context.items(),key=lambda x: x[0],reverse=False)
for i in sort_top30_context:
print(i[0],i[1])
csvwriter.writerow([i[0]] + list(i[1]))
csvwriter.writerow([""] + list(i[1].values()))```
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。