使用python计算文本中短语旁边的三元组频率

如何解决使用python计算文本中短语旁边的三元组频率

我正在尝试从 Python 中的文本文件中计算短语（列表中的一个）两侧的三元组。例如在这样的句子中：

“我们知道，在很多情况下，人们不时会更加重视”

如果“不时”是所讨论的短语，那么“在许多情况下”和“有一个”这两个词将是令人感兴趣的词。此代码的一个版本适用于列表中的单个单词，但对于短语输出完全不正确。

对于短语“不时”，输出应如下所示。

不时{'there': 4,'Could': 12,'scotland': 12,'imagine': 15}

实际上看起来像这样。

我的代码如下



context_d = {}
for filename in glob.glob(os.path.join(path,'*.txt')):
    if filename.endswith('.txt'):
        f = open(filename)
        file = f.read()
        # txt = file.lower()
        txt = file.split()
        txt = [item.replace('May','') for item in txt]  # locate and replace all months of May before lowering
        # txt = list([[word.lower() for word in line.split()] for line in txt])
        txt = (list(map(lambda x: x.lower(),txt)))
        for n in range(len(word_list)):

        for j in range(len(txt)):
             if (j + 3) < len(txt):
              if txt[j] in word_list[n]:
                    if txt[j] in context_d:
                        context_d[txt[j]] += txt[(j - 3):j]
                        context_d[txt[j]] += txt[(j + 1):(j + 3)]
                    else:
                        context_d[txt[j]] = txt[(j - 3):j]
                        context_d[txt[j]] += txt[(j + 1):(j + 3)]

    print(filename)

'''
clean the symbols
'''
for word in context_d:
    # print(word)
    for i in range(len(context_d[word])):
        for letter in context_d[word][i]:
            if letter in stopsb:
                #  print(context_d[word][i])
                context_d[word][i] = context_d[word][i].replace(letter,'')

'''
count the frequency
'''
context_freq = dict()
for word in context_d:
    # print(word)
    context_freq[word] = dict(Counter(context_d[word]))

'''
top 300
'''
top300_context = dict()
for word in context_freq:
    # print(word)
    if len(context_freq[word]) > 300:
        top300_context[word] = dict(Counter(context_freq[word]).most_common(300))
    else:
        top300_context[word] = context_freq[word]

'''
clean stop words
'''
cleaned_context = top300_context
for word in top300_context:
    #  print(word)
    for context in list(top300_context[word]):
        if context in stopws or len(context) == 0 or len(context) == 1 or any(
                component in digits for component in context):
            del (cleaned_context[word][context])

'''
top 30
'''
top30_context = dict()
for word in cleaned_context:
    # print(word)
    if len(cleaned_context[word]) > 30:
        top30_context[word] = dict(Counter(cleaned_context[word]).most_common(30))
    else:
        top30_context[word] = cleaned_context[word]


path = 'D:/Testing10'
os.chdir(path)
with open('Words_Context.csv','wb') as f,TextIOWrapper(f,encoding='utf-8',newline='') as wrapper:
    csvwriter = csv.writer(wrapper)
    sort_top30_context = sorted(top30_context.items(),key=lambda x: x[0],reverse=False)

    for i in sort_top30_context:
        print(i[0],i[1])
        csvwriter.writerow([i[0]] + list(i[1]))
        csvwriter.writerow([""] + list(i[1].values()))```

使用python计算文本中短语旁边的三元组频率

如何解决使用python计算文本中短语旁边的三元组频率

相关推荐