R：如何根据匹配的特定列标题查找文件夹中的选择文件

如何解决R：如何根据匹配的特定列标题查找文件夹中的选择文件

对通用问题很抱歉。我正在寻找用于整理数据文件夹的指针，其中有许多.txt文件。它们都有不同的标题，并且对于绝大多数文件而言，文件具有相同的维，即列号相同。但是，麻烦的是某些文件，尽管具有相同的列数，但具有不同的列名。也就是说，在那些文件中，还测量了其他一些变量。

我想清除这些文件，而不能简单地通过比较列号来做到。有什么方法可以传递列名并检查目录中有该列的文件，以便可以将它们删除到另一个文件夹中？

更新：

我已经创建了一个虚拟文件夹以包含文件来反映问题请查看下面的链接以访问我的Google驱动器上的文件。在此文件夹中，我取出了包含问题列的4个文件。

https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing

问题在于代码似乎能够找到与选择标准匹配的文件，也就是问题列的实际名称，但是我无法在列表中提取此类文件的真实索引。有指针吗？

library(data.table)

#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT",header = T,sep = "\t")

#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT",sep = "\t")

#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)

same.titles <- var.names %in% standar.names

dff.titles <- !var.names %in% standar.names

#confirm the only 3 columns of problem is column 129,130 and 131 
mismatched.names <- colnames(df_var[129:131])

#visual check the names of the problematic columns
mismatched.names


# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],sep = "\t",nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files,names)

# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files,but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector

to_keep <- which(unlist(column_names)%in% unique_names[1])


#Now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]

#once I have a list of targeted files,I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep,"C:/Users/mli/Desktop/weeding/need to reanalysis" )

解决方法

如果您可以根据列名将要保留的文件与要删除的文件区分开，则可以使用以下几行：

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],sep = ';',header = T,nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files,names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])

files_to_keep <- files_in_wd[to_keep]

如果有很多文件，您应该避免循环或只是读入相应文件的标题。

在评论后进行编辑：

通过增加nrows = 2，代码仅读取前2行+标头。
我假设文件夹中的第一个文件具有您想要保留的结构，这就是为什么对照unique_names [1]检查column_names的原因。
files_to_keep包含您要保留的文件的名称
您可以尝试在部分数据上运行它，然后查看它是否有效，并在以后担心效率问题。我认为向量化方法可能会更好。

编辑：此代码适用于您的虚拟数据。

library(filesstrings)

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],sep = "\t",nrows = 2,encoding = "UTF-8",check.names = FALSE
                            )
}

# get column names of all files
column_names <- lapply(l_files,names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok

# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),'filename' = files_in_wd,'keep' = NA)

for(i in 2:length(files_in_wd)){
  df_filehelper$keep[i] <- identical(to_keep,column_names[[i]])
}

df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns

# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept

file.move(files_to_move,"C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")

由于文件数量众多且大小庞大，可能值得寻找R的替代方案，例如在bash中：

for f in ctrl*.txt
do
  if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
    then echo "$f"
  fi
done

此命令将“好文件”的列名与每个文件的列名进行比较，并打印出不匹配的文件名。