重叠挖掘的对话 - 编程之家

如何解决重叠挖掘的对话

我正在玩弄 twitter API 并试图提取对话。数据框“尝试”从最近的推文到最新的推文排序。现在我想通过数据框和我的对话。这意味着我想从谈话的结束到谈话的开始。如果“in_reply_to_status_id_str”是空值，则定义对话的开始。我能够成功提取对话，但我遇到的问题是长度超过 2 的对话将包含子集。

例如，一个定义为 [10,4,5,6] 的对话，但是 [4,6] 和 [5,6] 被定义为对话，即使它们只是更大对话的子集.所以我们的目标是摆脱这些子集，只保留重要的对话。这里再举一个例子：[100,50,30] 被正确定义为一个会话，然后在运行代码之后 [50,30] 被定义为一个会话（是一个子集不想要这个）。在某些情况下， [60,30] 是正确定义的对话，因此一旦正确定义了对话，我就不能删除行，因为在这些情况下，有 2 个用户正在回复“30”。您会在我的代码底部找到更多示例。


df = pd.read_csv('file.csv',dtype=str)
df['created_at'] = pd.to_datetime(df['created_at'])
df_sorted = df.sort_values(by=['created_at'],ascending=False)
trying = df_sorted.copy()

'''Will take a bit above 3 hours to run

Trying to get conversation from most recent to start of conversation. Beginning of Conversation
is defined when "in_reply_to_status_id_str" is a null value.

Take for example row1: has reply_status = 9 and id_str = 10
find row in which the reply_status of row1 is equal to this row's id_str. 
Continue this process until reply_status is equal to null (have found start)'''

for index,row in trying.iterrows():
    single_convo = []
    
    response_var = row['in_reply_to_status_id_str']
    
    if pd.notnull(trying.loc[index,'in_reply_to_status_id_str']):
        single_convo.append([index,row['user_id_str'],row['id_str']])
        
    cont = True #checks if you need to continue checking for whole conversation 
    
    while cont:
        for index2,row2 in trying.iterrows():
            if response_var == row2['id_str']:
                single_convo.append([index2,row2['user_id_str'],row2['id_str']])
                if pd.isnull(trying.loc[index2,'in_reply_to_status_id_str']):
                    single_convo.append('start of convo found')
                    display(single_convo)
                    cont = False 
                    break
                else:
                    response_var = row2['in_reply_to_status_id_str']
            if len(trying[trying['id_str'] == response_var]) == 0:
                cont = False
                break
                
    if len(single_convo) > 0: #single convo will be empty when row['in_reply_to_status_id_str'] = NaN
        conversations.append(single_convo)
    
display(conversations)

print("My program took",(time.time() - start_time)/60,"minutes to run")


'''[3493,3491,3487,3473] next one is [3491,3473],will create another for [3487,3473] 
only want to keep longest one 

there will be instances when: 

[[1365,'1.1450879244148859e+18'],[1361,'1.1450866432647905e+18'],[1353,'1.1450837341458719e+18']
 ['start of convo found']],.... (subset does not always appear right after)

[[1361,'1.1450837341458719e+18'],['start of convo found']] (don't want)

[[1360,'1.1450864696206049e+18'],['start of convo found']] (want to keep)
'''