如何解决在python中定义互信息功能
我与一个语料库一起工作,该语料库包含由两位审稿人撰写的180份电影评论文件。每个文档都是由一位审阅者撰写的一部电影的审阅。前80条评论由Berardinelli撰写,其余100条由Schwartz发表。我已经计算了两位作者之间针对特定单词的共同信息。现在,我必须与文档作者及其各自的相互信息找到信息量最高的前十个单词。 (在Python注释中)通过与文档作者保持较高的相互信息来解释单词的含义。有人可以帮忙吗?在下面的代码中,我必须查找两位作者之间关于“导演”一词的共同信息。from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/xniu2/Desktop/PyData/MovieReviews'
filelists = PlaintextCorpusReader(corpus_root,'.*',encoding='latin-1')
filelists.fileids()
reviews = []
for fileid in filelists.fileids():
reviews.append(filelists.raw(fileid))
import shorttext
preprocess = shorttext.utils.standard_text_preprocessor_1()
corpus = [preprocess(article).split(' ') for article in reviews]
dtm = shorttext.utils.DocumentTermMatrix(corpus,docids = filelists.fileids())
corpus
dtm.get_token_occurences('director')
import numpy as np
import math
def entropy(p):
if sum(p) == 0:
return 0
p = p/sum(p)
p = p[ p > 0 ]
H = -sum(p*np.log2(p))
return H
dtm.get_token_occurences('director').values()
director_dis = list(dtm.get_token_occurences('director').values())
entropy(director_dis)
director_docs = list(dtm.get_token_occurences('director').keys())
director_docs
import re
count_B = 0
for item in director_docs:
m = re.search('^\d{4}\.txt$',item)
if (m):
count_B += 1
print(count_B)
import re
count_S = 0
for item in director_docs:
m = re.search('^\d{5}\.txt$',item)
if (m):
count_S += 1
print(count_S)
# In[51]:
#make an array,rows represent "Berardinelli" and "Schwartz" respectively. Columns represent the number of reviews that contains the word "director" and the number of reviews that do NOT contain the word "director"
array = np.reshape((count_B,80-count_B,count_S,100-count_S),(2,2))
array
np.sum(array,axis = 0)
np.sum(array,axis = 1)
marginal_entropy = entropy(np.sum(array,axis = 1))
column_probs = np.sum(array,axis = 0)/180
column_probs
column_entropy = np.apply_along_axis(entropy,array)
column_entropy
conditional_entropy = sum(column_probs*column_entropy)
# In[62]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/xniu2/Desktop/PyData/MovieReviews'
filelists = PlaintextCorpusReader(corpus_root,array)
column_entropy
conditional_entropy = sum(column_probs*column_entropy)
conditional_entropy
MI_director_authors = marginal_entropy - conditional_entropy
MI_director_authors
conditional_entropy
#calculate the mutual information between the word "director" and the two authors
MI_director_authors = marginal_entropy - conditional_entropy
MI_director_authors
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。