如何解决R:如何根据匹配的特定列标题查找文件夹中的选择文件
对通用问题很抱歉。我正在寻找用于整理数据文件夹的指针,其中有许多.txt文件。它们都有不同的标题,并且对于绝大多数文件而言,文件具有相同的维,即列号相同。但是,麻烦的是某些文件,尽管具有相同的列数,但具有不同的列名。也就是说,在那些文件中,还测量了其他一些变量。
我想清除这些文件,而不能简单地通过比较列号来做到。有什么方法可以传递列名并检查目录中有该列的文件,以便可以将它们删除到另一个文件夹中?
更新:
我已经创建了一个虚拟文件夹以包含文件来反映问题 请查看下面的链接以访问我的Google驱动器上的文件。在此文件夹中,我取出了包含问题列的4个文件。
https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing
问题在于代码似乎能够找到与选择标准匹配的文件,也就是问题列的实际名称,但是我无法在列表中提取此类文件的真实索引。有指针吗?
library(data.table)
#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT",header = T,sep = "\t")
#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT",sep = "\t")
#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)
same.titles <- var.names %in% standar.names
dff.titles <- !var.names %in% standar.names
#confirm the only 3 columns of problem is column 129,130 and 131
mismatched.names <- colnames(df_var[129:131])
#visual check the names of the problematic columns
mismatched.names
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],sep = "\t",nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files,names)
# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files,but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector
to_keep <- which(unlist(column_names)%in% unique_names[1])
#Now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]
#once I have a list of targeted files,I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep,"C:/Users/mli/Desktop/weeding/need to reanalysis" )
解决方法
如果您可以根据列名将要保留的文件与要删除的文件区分开,则可以使用以下几行:
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],sep = ';',header = T,nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files,names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])
files_to_keep <- files_in_wd[to_keep]
如果有很多文件,您应该避免循环或只是读入相应文件的标题。
在评论后进行编辑:
- 通过增加nrows = 2,代码仅读取前2行+标头。
- 我假设文件夹中的第一个文件具有您想要保留的结构,这就是为什么对照unique_names [1]检查column_names的原因。
- files_to_keep包含您要保留的文件的名称
- 您可以尝试在部分数据上运行它,然后查看它是否有效,并在以后担心效率问题。我认为向量化方法可能会更好。
编辑: 此代码适用于您的虚拟数据。
library(filesstrings)
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],sep = "\t",nrows = 2,encoding = "UTF-8",check.names = FALSE
)
}
# get column names of all files
column_names <- lapply(l_files,names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok
# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),'filename' = files_in_wd,'keep' = NA)
for(i in 2:length(files_in_wd)){
df_filehelper$keep[i] <- identical(to_keep,column_names[[i]])
}
df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns
# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept
file.move(files_to_move,"C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")
,
由于文件数量众多且大小庞大,可能值得寻找R的替代方案,例如在bash中:
for f in ctrl*.txt
do
if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
then echo "$f"
fi
done
此命令将“好文件”的列名与每个文件的列名进行比较,并打印出不匹配的文件名。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。