如何解决如何将 TF 句子表示和基于单词的特征 (Lexicon) 连接起来并用作不同算法的输入?
enter image description here我有一个数据集(阿拉伯语推文)和词典,我想通过机器学习算法检测情绪。我是 python 初学者,我该怎么做这一步:
我完成了预处理步骤和其他功能,如下面的代码所示。只是我想应用这些步骤:
-
计算 TF 方案以获得表达式(术语、单词)在文档中出现的频率。
-
为了合并情感词汇特征,我们检查句子中词汇术语的存在情况,并获得一个表示每个情感类别(愤怒、恐惧、悲伤和喜悦)的向量。
-
最后,为了进行分类,将 TF 句子表示和基于词的特征的串联用作不同算法(SVM、LR、MLP、MultinomialNB)的输入。
df=pd.read_csv("C:/Users/User/Desktop/Dataset with stopword.csv") df.shape def noramlize(Tweet): Tweet = re.sub(r"[إأٱآا]","ا",Tweet) Tweet = re.sub(r"ى","ي",Tweet) Tweet = re.sub(r"ؤ","ء",Tweet) Tweet = re.sub(r"ئ",Tweet) Tweet = re.sub(r'[^ا-ي ]',"",Tweet) noise = re.compile(""" ّ | # Tashdid َ | # Fatha ً | # Tanwin Fath ُ | # damma ٌ | # Tanwin damm ِ | # Kasra ٍ | # Tanwin Kasr ْ | # Sukun ـ # Tatwil/Kashida """,re.VERBOSE) Tweet = re.sub(noise,'',Tweet) return Tweet def stopWordRmove(Tweet): ar_stop_list = open("ar_stop_word_list.txt","r",encoding="utf8") stop_words = ar_stop_list.read().split('\n') needed_words = [] words = word_tokenize(Tweet) for w in words: if w not in (stop_words): needed_words.append(w) filtered_sentence = " ".join(needed_words) return filtered_sentence def stemming(Tweet): st = ISRIstemmer() stemmed_words = [] words = word_tokenize(Tweet) for w in words: stemmed_words.append(st.stem(w)) stemmed_sentence = " ".join(stemmed_words) return stemmed_sentence def prepareDataSets(df): sentences = [] for index,r in df.iterrows(): Tweet = noramlize(r['Tweet']) Tweet = stopWordRmove(r['Tweet']) Tweet = stemming(r['Tweet']) if r['Affect Dimension'] == 'fear': sentences.append([Tweet,'fear']) if r['Affect Dimension'] == 'anger': sentences.append([Tweet,'anger']) if r['Affect Dimension'] == 'joy': sentences.append([Tweet,'joy']) if r['Affect Dimension'] == 'sadness': sentences.append([Tweet,'sadness']) df_sentences = DataFrame(sentences,columns=['Tweet','Affect Dimension']) return df_sentences preprocessed_df = prepareDataSets(df) preprocessed_df def featureExtraction(data): vectorizer = CountVectorizer() Count_data = vectorizer.fit_transform(data) return Count_data def learning(clf,X,Y): X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33,random_state=0) classifer = clf() classifer.fit(X_train,Y_train) predict = cross_val_predict(classifer,Y_test,cv=10,fit_params=None) scores = cross_val_score(classifer,cv=10) print (scores) print ("Accuracy of %s: %0.2f (+/- %0.2f)" % (classifer,scores.mean(),scores.std() *2)) print (classification_report(Y_test,predict)) main(SVC) clfs = [LogisticRegression,MultinomialNB,MLPClassifier] for clf in clfs: main(clf)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。