如何解决将部分文本与全文匹配并替换
我有一个推文数据集,其中一些推文是原创的,而其他推文是转推的。由于某种原因,转推被 ...
截断,因此整个文本不存在。在我的数据集中,原始推文(希望)始终存在,因此我想找到原始推文并用它替换截断的推文。
例如:
my_data <- tribble(
~user,~text,"Peter","Hello,this is Peter,I like ice cream!","John","RT @Peter: Hello,I like ...","Martha","Julia","Hi,I really like apples!","Bjorn","RT @Julia: I really like ..."
)
# A tibble: 5 x 2
user text
<chr> <chr>
1 Peter Hello,I like ice cream!
2 John RT @Peter: Hello,I like ...
3 Martha RT @Peter: Hello,I like ...
4 Julia Hi,I really like apples!
5 Bjorn RT @Julia: I really like ...
我想找到 RT@ username: some text...
的每个实例,并将其替换为完整的推文。基本上:
# A tibble: 5 x 2
user text
<chr> <chr>
1 Peter Hello,I like ice cream!
2 John RT @Peter: Hello,I like ice cream!
3 Martha RT @Peter: Hello,I like ice cream!
4 Julia Hi,I really like apples!
5 Bjorn RT @Julia: Hi,I really like apples!
我已经提取了正在转推的句柄并按以下方式分组:
retweet_pattern <- "^RT @([a-zA-Z0-9_]*): (.*)"
str_match(my_data$text,retweet_pattern)
但是,我不完全确定如何进行。由于用户/文本对不一定是唯一的(即,一个用户可能有多条被转发的推文),因此仅查找转发句柄并更改整个文本是行不通的。也许我需要使用字符串指标,比如 Levenshtein?
谢谢。
解决方法
由于转推文本与非转推数据完全一致,您可以试试这个。
library(dplyr)
library(tidyr)
#Create a separate dataframe for retweet data
#separate the username and tweet in different columns
rt_data <- my_data %>%
filter(grepl('RT',text)) %>%
separate(text,c('name','text'),sep = ':\\s*')
#Create a separate dataframe for tweets which are not retweets.
no_rt_data <- my_data %>% filter(!grepl('RT',text))
#Clean the retweet string and find the corresponding match
#in non-retweet data
rt_data$text <- sapply(gsub('RT @\\w+:\\s*|\\s*\\.+$','',rt_data$text),function(x) no_rt_data$text[grepl(x,no_rt_data$text)])
#Combine the username and tweet
rt_data <- rt_data %>% unite(text,name,text,sep = ' :')
#combine the two dataframes
bind_rows(no_rt_data,rt_data)
# user text
# <chr> <chr>
#1 Peter Hello,this is Peter,I like ice cream!
#2 Julia Hi,I really like apples!
#3 John RT @Peter :Hello,I like ice cream!
#4 Martha RT @Peter :Hello,I like ice cream!
#5 Bjorn RT @Julia :Hi,I really like apples!
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。