微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

R:如何根据匹配的特定列标题查找文件夹中的选择文件

如何解决R:如何根据匹配的特定列标题查找文件夹中的选择文件

对通用问题很抱歉。我正在寻找用于整理数据文件夹的指针,其中有许多.txt文件。它们都有不同的标题,并且对于绝大多数文件而言,文件具有相同的维,即列号相同。但是,麻烦的是某些文件,尽管具有相同的列数,但具有不同的列名。也就是说,在那些文件中,还测量了其他一些变量。

我想清除这些文件,而不能简单地通过比较列号来做到。有什么方法可以传递列名并检查目录中有该列的文件,以便可以将它们删除到另一个文件夹中?

更新:

我已经创建了一个虚拟文件夹以包含文件来反映问题 请查看下面的链接以访问我的Google驱动器上的文件。在此文件夹中,我取出了包含问题列的4个文件

https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing

问题在于代码似乎能够找到与选择标准匹配的文件,也就是问题列的实际名称,但是我无法在列表中提取此类文件的真实索引。有指针吗?

library(data.table)

#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT",header = T,sep = "\t")

#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT",sep = "\t")

#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)

same.titles <- var.names %in% standar.names

dff.titles <- !var.names %in% standar.names

#confirm the only 3 columns of problem is column 129,130 and 131 
mismatched.names <- colnames(df_var[129:131])

#visual check the names of the problematic columns
mismatched.names


# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],sep = "\t",nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files,names)

# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files,but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector

to_keep <- which(unlist(column_names)%in% unique_names[1])


#Now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]

#once I have a list of targeted files,I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep,"C:/Users/mli/Desktop/weeding/need to reanalysis" )

解决方法

如果您可以根据列名将要保留的文件与要删除的文件区分开,则可以使用以下几行:

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],sep = ';',header = T,nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files,names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])

files_to_keep <- files_in_wd[to_keep]

如果有很多文件,您应该避免循环或只是读入相应文件的标题。

在评论后进行编辑:

  • 通过增加nrows = 2,代码仅读取前2行+标头。
  • 我假设文件夹中的第一个文件具有您想要保留的结构,这就是为什么对照unique_names [1]检查column_names的原因。
  • files_to_keep包含您要保留的文件的名称
  • 您可以尝试在部分数据上运行它,然后查看它是否有效,并在以后担心效率问题。我认为向量化方法可能会更好。

编辑: 此代码适用于您的虚拟数据。

library(filesstrings)

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],sep = "\t",nrows = 2,encoding = "UTF-8",check.names = FALSE
                            )
}

# get column names of all files
column_names <- lapply(l_files,names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok

# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),'filename' = files_in_wd,'keep' = NA)

for(i in 2:length(files_in_wd)){
  df_filehelper$keep[i] <- identical(to_keep,column_names[[i]])
}

df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns

# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept

file.move(files_to_move,"C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")
,

由于文件数量众多且大小庞大,可能值得寻找R的替代方案,例如在bash中:

for f in ctrl*.txt
do
  if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
    then echo "$f"
  fi
done

此命令将“好文件”的列名与每个文件的列名进行比较,并打印出不匹配的文件名。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。