微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

重叠挖掘的对话

如何解决重叠挖掘的对话

我正在玩弄 twitter API 并试图提取对话。数据框“尝试”从最近的推文到最新的推文排序。现在我想通过数据框和我的对话。这意味着我想从谈话的结束到谈话的开始。如果“in_reply_to_status_id_str”是空值,则定义对话的开始。我能够成功提取对话,但我遇到的问题是长度超过 2 的对话将包含子集。

例如,一个定义为 [10,4,5,6] 的对话,但是 [4,6] 和 [5,6] 被定义为对话,即使它们只是更大对话的子集.所以我们的目标是摆脱这些子集,只保留重要的对话。这里再举一个例子:[100,50,30] 被正确定义为一个会话,然后在运行代码之后 [50,30] 被定义为一个会话(是一个子集不想要这个)。在某些情况下, [60,30] 是正确定义的对话,因此一旦正确定义了对话,我就不能删除行,因为在这些情况下,有 2 个用户正在回复“30”。 您会在我的代码底部找到更多示例。


df = pd.read_csv('file.csv',dtype=str)
df['created_at'] = pd.to_datetime(df['created_at'])
df_sorted = df.sort_values(by=['created_at'],ascending=False)
trying = df_sorted.copy()

'''Will take a bit above 3 hours to run

Trying to get conversation from most recent to start of conversation. Beginning of Conversation
is defined when "in_reply_to_status_id_str" is a null value.

Take for example row1: has reply_status = 9 and id_str = 10
find row in which the reply_status of row1 is equal to this row's id_str. 
Continue this process until reply_status is equal to null (have found start)'''

for index,row in trying.iterrows():
    single_convo = []
    
    response_var = row['in_reply_to_status_id_str']
    
    if pd.notnull(trying.loc[index,'in_reply_to_status_id_str']):
        single_convo.append([index,row['user_id_str'],row['id_str']])
        
    cont = True #checks if you need to continue checking for whole conversation 
    
    while cont:
        for index2,row2 in trying.iterrows():
            if response_var == row2['id_str']:
                single_convo.append([index2,row2['user_id_str'],row2['id_str']])
                if pd.isnull(trying.loc[index2,'in_reply_to_status_id_str']):
                    single_convo.append('start of convo found')
                    display(single_convo)
                    cont = False 
                    break
                else:
                    response_var = row2['in_reply_to_status_id_str']
            if len(trying[trying['id_str'] == response_var]) == 0:
                cont = False
                break
                
    if len(single_convo) > 0: #single convo will be empty when row['in_reply_to_status_id_str'] = NaN
        conversations.append(single_convo)
    
display(conversations)

print("My program took",(time.time() - start_time)/60,"minutes to run")


'''[3493,3491,3487,3473] next one is [3491,3473],will create another for [3487,3473] 
only want to keep longest one 

there will be instances when: 

[[1365,'1.1450879244148859e+18'],[1361,'1.1450866432647905e+18'],[1353,'1.1450837341458719e+18']
 ['start of convo found']],.... (subset does not always appear right after)

[[1361,'1.1450837341458719e+18'],['start of convo found']] (don't want)

[[1360,'1.1450864696206049e+18'],['start of convo found']] (want to keep)
'''


版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。