如何解决如何将基因组SNP转换为二进制格式? script.awk 运行脚本script.awk 运行单班轮awk脚本
我正在努力找出如何转换(SNP)文件,如下所示:
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,C,G,G
79115,A,T,C
79116,A
79124,C
79128,G
为这样的二进制格式:
pos,1,1
79115,0
79116,0
79124,0
79128,0
有人知道R或命令行工具awk代码中可用于此目的的软件包吗?
1097023,A
1097027,C
4363243,A
4363270,T
4363275,G
1097023,0
1097027,0
4363243,1
4363270,0
4363275,1
应为:
1097023,1
1097027,0
4363270,0
因此,我认为解决方案应该包括大多数样本(即> 7个样本相同,则为0,与多数不匹配的样本为1。
解决方法
这是一个简单的awk
(标准Linux awk/gawk
)脚本:
script.awk
BEGIN {FS = ","} # set field seperator to ","
NR>1{ # every line except the first line
zeroIndc=$2; # identify the zero indicator and save the variable
for (i = 2; i <= NF; i++) { # for each DNA letter
if ($i == zeroIndc) { # if DNA letter match zero indicator
$i = 0; # set DNA letter to 0
} else { # if DNA letter not match zero indicator
$i = 1; # set DNA letter to 1
}
}
print; # print new line at end
}
运行脚本script.awk
awk -f scirpt.awk input.txt
运行单班轮awk脚本
awk 'BEGIN{FS=","}NR>1{z=$2;for(i=2;i<=NF;i++)$i=($i==z)?0:1;print;}' input.txt
,
使用基数R,您可以将每个sample
列与第一个示例列进行比较,并在值不匹配的地方返回1。
cols <- grep('sample',names(df))
df[cols] <- +(df$sample1 != df[cols])
df
# pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9
#1 79107 0 0 0 0 0 0 0 0 0
#2 79115 0 0 0 0 0 0 0 1 0
#3 79116 0 0 0 0 0 0 0 0 0
#4 79124 0 0 1 1 0 0 0 0 0
#5 79128 0 0 0 0 1 0 0 0 0
# sample10 sample11 sample12 sample13 sample14
#1 0 0 0 1 1
#2 0 1 0 0 0
#3 0 0 0 0 0
#4 0 0 0 0 0
#5 1 0 0 0 0
尽管上面的方法在大型数据集上效率更高,但这是dplyr
库的替代方法。
library(dplyr)
df <- df %>% mutate(across(contains('sample'),~+(sample1 != .)))
数据
df <- structure(list(pos = c(79107L,79115L,79116L,79124L,79128L
),sample1 = c("C","C","A","G"),sample2 = c("C",sample3 = c("C","T",sample4 = c("C",sample5 = c("C","A"),sample6 = c("C",sample7 = c("C",sample8 = c("C",sample9 = c("C",sample10 = c("C","C"),sample11 = c("C",sample12 = c("C",sample13 = c("G",sample14 = c("G","G")),class = "data.frame",row.names = c(NA,-5L))
,
$ awk -F,'NR>1{gsub($2,0); gsub(/[ACTG]/,1)} 1' file
pos,sample1,sample2,sample3,sample4,sample5,sample6,sample7,sample8,sample9,sample10,sample11,sample12,sample13,sample14
79107,1,1
79115,0
79116,0
79124,0
79128,0
,
假设您想将“引用”等位基因编码为0
,其中“引用”是最常见的,这是使用Python 3(其pandas
数据处理库(使用pip3 install pandas
和Counter
specialized dictionary安装。
从包含示例数据的文件snps_letters.csv
开始:
$ cat snps_letters.csv
pos,C,G,G
79115,A,T,C
79116,A
79124,C
79128,G
我们将打开它,对其进行转换并将其保存在脚本binarize.py
中:
#!/usr/bin/env python3
from collections import Counter
import pandas as pd
# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv("snps_letters.csv",index_col="pos")
def binarize(column):
# Extracting the most common element as ref
# (using list and tuple unpacking,# because the most_common method returns
# a list of (element,count) pairs):
[(ref,_)] = Counter(column).most_common(1)
# We can now define an auxiliary function
# to recode individual elements in the column:
def to_bin(elem):
# int(True) -> 1,int(False) -> 0
return 1 - int(elem == ref)
# Transform the column by applying the recoding function to its elements
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
return column.apply(to_bin)
# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize,axis=0)
# Write the result to a file:
binarized.to_csv("snps_binary.csv")
在shell中执行脚本:
$ ./binarize.py
检查保存的结果:
$ cat snps_binary.csv
pos,0
79115,1
79116,1
79124,1
79128,0
要拥有更可重用的脚本,您可以将输入和输出文件路径作为参数传递给命令行,它们可以在python代码中以sys.argv[1]
和sys.argv[2]
(首先为您提供了import sys
)。
一个版本,它在命令行上获取输入和输出文件,并将参考核苷酸编码为1而不是0:
#!/usr/bin/env python3
import sys
from collections import Counter
import pandas as pd
# Load the comma-separated data in a pandas DataFrame object:
snps = pd.read_csv(sys.argv[1],int(False) -> 0
return int(elem == ref)
# Transform the column by applying the recoding function to its elements
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html#pandas.Series.apply):
return column.apply(to_bin)
# Now apply our column-transforming function to the columns (axis=0)
# (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html):
binarized = snps.apply(binarize,axis=0)
# Write the result to a file:
binarized.to_csv(sys.argv[2])
它的用法如下:
./binarize.py snps_letters.csv snps_binary.csv
,
使用data from Ronak's answer识别参考等位基因,在这种情况下,基于最频繁的等位基因,我们有5个SNP REF等位基因:
REF <- apply(df[,-1],function(i) names(which.max(table(i))))
# [1] "C" "C" "A" "C" "G"
现在,将每个样本的列值与 REF 进行比较,然后通过将逻辑TRUE / FALSE乘以1转换为整数,然后将其分配回原始的 dataframe :
df[,-1] <- (df[,-1] != REF) * 1
df
# pos sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13 sample14
# 1 79107 0 0 0 0 0 0 0 0 0 0 0 0 1 1
# 2 79115 0 0 0 0 0 0 0 1 0 0 1 0 0 0
# 3 79116 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 4 79124 0 0 1 1 0 0 0 0 0 0 0 0 0 0
# 5 79128 0 0 0 0 1 0 0 0 0 1 0 0 0 0
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。