微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何将 TF 句子表示和基于单词的特征 (Lexicon) 连接起来并用作不同算法的输入?

如何解决如何将 TF 句子表示和基于单词的特征 (Lexicon) 连接起来并用作不同算法的输入?

enter image description here我有一个数据集(阿拉伯语推文)和词典,我想通过机器学习算法检测情绪。我是 python 初学者,我该怎么做这一步:

我完成了预处理步骤和其他功能,如下面的代码所示。只是我想应用这些步骤:

  1. 计算 TF 方案以获得表达式(术语、单词)在文档中出现的频率。

  2. 为了合并情感词汇特征,我们检查句子中词汇术语的存在情况,并获得一个表示每个情感类别(愤怒、恐惧、悲伤和喜悦)的向量。

  3. 最后,为了进行分类,将 TF 句子表示和基于词的特征的串联用作不同算法(SVM、LR、MLP、MultinomialNB)的输入。

    df=pd.read_csv("C:/Users/User/Desktop/Dataset with stopword.csv")
    df.shape
    
    def noramlize(Tweet):
        Tweet = re.sub(r"[إأٱآا]","ا",Tweet)
        Tweet = re.sub(r"ى","ي",Tweet)
        Tweet = re.sub(r"ؤ","ء",Tweet)
        Tweet = re.sub(r"ئ",Tweet)
        Tweet = re.sub(r'[^ا-ي ]',"",Tweet)
    
     noise = re.compile(""" ّ    | # Tashdid
                          َ    | # Fatha
                          ً    | # Tanwin Fath
                          ُ    | # damma
                          ٌ    | # Tanwin damm
                          ِ    | # Kasra
                          ٍ    | # Tanwin Kasr
                          ْ    | # Sukun
                          ـ     # Tatwil/Kashida
                      """,re.VERBOSE)
     Tweet = re.sub(noise,'',Tweet)
     return Tweet
    
     def stopWordRmove(Tweet):
         ar_stop_list = open("ar_stop_word_list.txt","r",encoding="utf8")
         stop_words = ar_stop_list.read().split('\n')
     needed_words = []
     words = word_tokenize(Tweet)
     for w in words:
         if w not in (stop_words):
             needed_words.append(w)
     filtered_sentence = " ".join(needed_words)
     return filtered_sentence
    
     def stemming(Tweet):
         st = ISRIstemmer()
         stemmed_words = []
         words = word_tokenize(Tweet)
         for w in words:
             stemmed_words.append(st.stem(w))
         stemmed_sentence = " ".join(stemmed_words)
         return stemmed_sentence
    
    
     def prepareDataSets(df):
         sentences = []
     for index,r in df.iterrows():
         Tweet = noramlize(r['Tweet'])
         Tweet = stopWordRmove(r['Tweet'])
         Tweet = stemming(r['Tweet'])
    
         if r['Affect Dimension'] == 'fear': 
             sentences.append([Tweet,'fear'])
    
         if r['Affect Dimension'] == 'anger': 
             sentences.append([Tweet,'anger'])
    
         if r['Affect Dimension'] == 'joy': 
             sentences.append([Tweet,'joy'])
    
         if r['Affect Dimension'] == 'sadness': 
             sentences.append([Tweet,'sadness'])
    
     df_sentences = DataFrame(sentences,columns=['Tweet','Affect Dimension'])
     return df_sentences
    
     preprocessed_df = prepareDataSets(df)
     preprocessed_df
    
     def featureExtraction(data):
     vectorizer = CountVectorizer()
     Count_data = vectorizer.fit_transform(data)
     return Count_data
    
    def learning(clf,X,Y):
    X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.33,random_state=0)
    classifer = clf()
    classifer.fit(X_train,Y_train)
    
    predict = cross_val_predict(classifer,Y_test,cv=10,fit_params=None)
    
    scores = cross_val_score(classifer,cv=10)
    
    print (scores)
    
    print ("Accuracy of %s: %0.2f (+/- %0.2f)" % (classifer,scores.mean(),scores.std() *2))
    print (classification_report(Y_test,predict))
    
    main(SVC) 
    
    clfs = [LogisticRegression,MultinomialNB,MLPClassifier]
    
    for clf in clfs:
        main(clf)
    

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。