如何解决R-匹配索引中嵌套列表值和返回值的组合
嗨,我有两个数据集。第一个是与给定簇(0-7)相关的基因列表:
# gene output
Cluster <- rep(0:7,each = 10)
Gene <- c("LMO3","NEUROD6","NFIB","SNAP25","RTN1","CPE","SOX11","CSRP2","VAMP2","ID2","EMX2","LHX5-AS1","PEG10","HES1","TRH","WLS","TPBG","RPS29","CRABP2","RSPO3","RPL17","RPL7","PTMA","RPL36A","HMGN2","H2AFZ","PABPC1","HNRNPH1","PTN","FABP7","IGFBP2","ID4","C1orf61","VIM","RPS27L","FABP5","SDCBP","BNIP3","TCF7L2","NEFL","HMGCS1","GAP43","GPM6A","sqlE","MSMO1","SCOC","BASP1","TTR","MEST","MDK","TMBIM6","RCN1","C8orf59","ID3","PKM","NCOR1","ELAVL4","NNAT","ETFB","STMN2","TUBA1A","GNG3","MALAT1","SOX4","TUBB2B","CRYAB","GFAP","CHCHD2","HOPX","LgalS1","SCRG1","ISG15","AC090498.1","B2M","CLU")
df <- data.frame(cbind(Cluster,Gene))
第二个是为特定基因组合提供细胞类型注释的索引:
# index
Type <- c("Radial Glia","Excitatory Neuron ","Inhibitory Neuron","IPC","Radial Glia","Microglia","Inhibitory Neuron")
Subtype <- c("early","Layer IV","sst-MGE1","IPC-div2","Parietal and Temporal","oRG/Astrocyte","IPC-new","MGE2")
Markers <- c("TOP2A AURK HMGB CTNNB1","PPP1R1B SCN2A RORB CRYM","dlx6-AS1 dlx1 sst DCX","ERBB4 sst dlx2 dlx5 dlx6-AS1","CCNB2 NEUROD4 KIF15 PENK HES6 ZFHX4 GLI3","MEF2C STMN2 FLT ROBO CRYM","AQP4 GFAP AGT dio2 IL33","C1QB aif1 ccl4 C1QC","CENPK EOMES","CCK LHX6 SCGN sst")
index <- data.frame(cbind(Type,Subtype,Markers))
我正在尝试从df基因列表中找到Markers中概述的特定组合。当找到这样的匹配项时,将返回相应的类型和子类型。 但是,我发现有很多警告需要绕开我的头。
- 每个聚类的列表可能包含多个标记组合-因此该功能应迭代遍历每个标记组合,而不是在找到第一个匹配项时停止。
- 索引匹配过程应分别在每个聚类上进行-即检查聚类0中的基因是否存在标记匹配和返回类型/亚型,然后重复聚类1等步骤。
我的项目数据包含数十个类似df的输出,这些输出由不同数量的各个簇组成,每个簇包含数百至数千个基因。我已经尽力了 在网上搜索解决方案,但很遗憾,我在这里画了一个空白。
任何帮助/建议/建议将不胜感激。
编辑:
输出看起来像这样:
Cluster Gene Type Subtype
1 0 LMO3 Radial Glia early
2 0 NEUROD6 <NA> <NA>
3 0 NFIB <NA> <NA>
4 0 SNAP25 <NA> <NA>
5 0 RTN1 <NA> <NA>
6 0 CPE <NA> <NA>
正确的匹配会在df中添加一行,并为每个聚类添加相应的类型和子类型,而其余部分为空(NA)。
解决方法
执行此操作的方法可能更简单,但这里有一个循环;
output = data.frame(Cluster=as.character(),Gene=as.character(),Type=as.character(),Subtype=as.character())
for(i in 1:nrow(df)){
cluster = df[i,1]
gene = df[i,2]
type = index[grep(gene,index$Markers),]
n_types = nrow(type)
tmp = data.frame(Cluster=rep(cluster,n_types),Gene=rep(gene,Type=type[,1],Subtype=type[,2])
output = rbind(output,tmp)
}
,
我假设您想用以下类型注释每个基因簇 当类型的所有标记都出现在集群的索引中时 基因库。
我还将使用一些简化的数据集;两种简化的类型 索引:
library(tidyverse)
index <- bind_rows(
tibble(type = "AB",subtype = "X",markers = c("A","B")),tibble(type = "BC",subtype = "Y",markers = c("B","C")),)
index
#> # A tibble: 4 x 3
#> type subtype markers
#> <chr> <chr> <chr>
#> 1 AB X A
#> 2 AB X B
#> 3 BC Y B
#> 4 BC Y C
以及说明不同匹配方案的三个不同的集群:
clusters <- bind_rows(
tibble(cluster = 0,genes = c("A","B",# 2 matches
tibble(cluster = 1,genes = c("B","C","D")),# 1 match
tibble(cluster = 2,genes = c("C","D","E")),# No matches
)
clusters
#> # A tibble: 9 x 2
#> cluster genes
#> <dbl> <chr>
#> 1 0 A
#> 2 0 B
#> 3 0 C
#> 4 1 B
#> 5 1 C
#> 6 1 D
#> 7 2 C
#> 8 2 D
#> 9 2 E
我将首先创建一个返回匹配类型的函数来解决这个问题 对于给定的基因库:
match_index <- function(genes) {
matches <- index %>%
group_by(type,subtype) %>%
filter(all(markers %in% genes)) %>%
distinct(type,subtype)
# If none matched,return a row of NAs
if (nrow(matches)) matches else matches[NA_integer_,]
}
然后,您可以使用以下功能总结每个集群:
clusters %>%
group_by(cluster) %>%
summarise(match_index(genes))
#> `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
#> # A tibble: 4 x 3
#> # Groups: cluster [3]
#> cluster type subtype
#> <dbl> <chr> <chr>
#> 1 0 AB X
#> 2 0 BC Y
#> 3 1 BC Y
#> 4 2 <NA> <NA>
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。