如何解决如何使用tidyverse/regex识别R中包含非字母字符的行
我有一个数据框,其中包含表示“全名”的字符串。有些是完整的、正常的全名,有些不是基于非字母字符的“完整”或“准确”。
数据框示例:
Full name
----------
Mikki Clancy
Hermsdorfer,Mark (retired)
CSP,PSECU Lan Unit (typo)
Clifton Gurlen
G�mez,Oscar Prieto
Sj�¶strand,Anders
Lisa Terry
Meloy,Wilson {old}
Gregory Stevens
Charles Gruenberg
df <- structure(list(Full_name = c("Jane Clancy","Hermsdorfer,Mark (retired)","CSP,PSECU Lan Unit (typo)","Clif Gurlen","G�mez,Oscar Prieto","Sj�¶strand,Anders","Liza Terry","Meloy,Will {old}","Garret Stevens","Charly Ruenberg"),Group = c("a","b","c","d","e","f","g","h","i","j")),class = "data.frame",row.names = c(NA,-10L))
要求基于包含非 ascii 字符的字符串(例如来自上述值 - '{},(),&,�')对完整数据帧进行子集化。
所需的输出将是包含这些字符的名称列,然后是总行数,以便我可以从“不完整”或“准确”的完整数据框中计算百分比。
Not Complete Full name
----------------------
Hermsdorfer,PSECU Lan Unit (typo)
G�mez,Anders
Meloy,Wilson {old}
解决方法
为了更全面地了解字母,我从 this question about matching letters 借用了正则表达式。
library(dplyr)
df %>% mutate(
has_non_letters = grepl("[^\\p{L} ]",df$names,perl = TRUE)
)
# names has_non_letters
# 1 Mikki Clancy FALSE
# 2 Hermsdorfer,Mark (retired) TRUE
# 3 CSP,PSECU Lan Unit (typo) TRUE
# 4 Clifton Gurlen FALSE
# 5 G<U+FFFD>mez,Oscar Prieto TRUE
# 6 Sj�¶strand,Anders TRUE
# 7 Lisa Terry FALSE
# 8 Meloy,Wilson {old} TRUE
# 9 Gregory Stevens FALSE
# 10 Charles Gruenberg FALSE
我会为您提供额外的总结 - 您可以根据自己的喜好sum
或 mean
TRUE/FALSE 值。
使用这些数据:
df = data.frame(names = c(
"Mikki Clancy","Hermsdorfer,Mark (retired)","CSP,PSECU Lan Unit (typo)","Clifton Gurlen","G�mez,Oscar Prieto","Sj�¶strand,Anders","Lisa Terry","Meloy,Wilson {old}","Gregory Stevens","Charles Gruenberg"
))
,
我们可以使用str_detect
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Full_name,"[^A-Za-z,]+"))
Full_name Group
1 Hermsdorfer,Mark (retired) b
2 CSP,PSECU Lan Unit (typo) c
3 G�mez,Oscar Prieto e
4 Sj�¶strand,Anders f
5 Meloy,Will {old} h
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。