微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何将Fasta字符串分成多个行,以保证r中未更改的列数?

如何解决如何将Fasta字符串分成多个行,以保证r中未更改的列数?

我正在尝试读取fasta文件,并将序列作为单独的氨基酸显示为数据框。 1 seq = 1列

这是到目前为止我得到的:

FASTA_test.txt 包含:

>sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2
MatanSIIVLddddEDEAAAQPGPSHPLPNAASPGAEAPSSSEPHGARGSSSSGGKKCYK
LENEKLFEEFLELCKMQTADHPEVVPFLYNRQQRAHSLFLASAEFCNILSRVLSRARSRP
AKLYVYINELCTVLKAHSAKKKLNLAPAATTSNepsgNNPPTHLSLDPTNAENTASQSPR
TRGSRRQiqrLEQLLALYVAEIRRLQEKELDLSELDDPDSAYLQEARLKRKLIRLFGRLC
ELKDCSSLTGRVIEQRIPYRGTRYPEVNRRIERLINKPGPDTFPDYGDVLRAVEKAAARH
SLGLPRQQLQLMAQDAFRDVGIRLQERRHLDLIYNFGCHLTDDYRPGVDPALSDPVLARR
LRENRSLAMSRLDEVISKYAMLQDKSEEGERKKRRARLQGTSSHSADTPEASLDSGEGPS
GMASQGcpsASRAETDDEDDEESDEEEEEEEEEEEEEATDSEEEEDLEQMQEGQEDDEEE
DEEEEAAAGKDGDKSPMSSLQISNEKNLEPGKQISRSSGEQQNKGRIVSPsllSEEPLAP
SSIDAESNGEQPEELTLEEESPVsqlFELEIEALPLDTPSsveTdisSSRKQSEEPFTTV
LENGAGMVsstSFNGGVSPHNWGDSGPPCKKSRKEKKQTGSGPLGNSYVERQRSVHEKNG
KKICTLPSPPSPLASLAPVADsstRVDSPSHGLVTSSLCIPSPARLSQTPHSQPPRPGTC
KTSVATQCDPEEIIVLSDSD
>sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3
MEPAPARSPRPQQDPARPQEPTMPPPETPSEGRQPSPSPSPteraPASEEEFQFLRCQQC
QAEAKCPKLLPCLHTLCSGCLEASGMQCPICQAPWPLGADTPALDNVFFESLQRRLSVYR
QIVDAQAVCTRCKESADFWCFECEQLLCAkcfEAHQWFLKHEARPLAELRNQSVREFLDG
TRKTNNIFCSNPNHRTPTLTSIYCRGCSKPLCCSCALLDSSHSELKCDISAEIQQRQEEL
damTQALQEQDSAFGAVHAQMHAAVGQLGRaraETEELIRERVRQVVAHVRAQERELLEA
VDARYQRDYEEMASRLGRLDAVLQRIRTGSALVQRMKCYASDQEVLDMHGFLRQALCRLR
QEEPQSLQAAVRTDGFDEFKVRLQDLSSCITQGKDAAVSKKASPEAASTPRDPIDVDLPE
EAERVKAQVQALGLAEAQPMAVVQSVPGahpVPVYAFSIKGPSYGEDVSNTTTAQKRKCS
QTQCPRKVIKMESEEGKEARlarsSPEQPRPSTSKAVSPPHLDGPPSPRSPVIGSEVFLP
NSNHVASGAGEAEERVVVISSSEDSDAENSSSRELDDSSSESSDLQLEGPSTLRVLDENL
ADPQAEDRPLVFFDLKIDNETQKIsqlAAVNRESKFRVVIQPEAFFSIYSKAVSLEVGLQ
HFLSFLSSMRRPILACYKLWGPGLPNFFRALEDINRLWEFQEAISGFLAALPLIRERVPG
ASSFKLKNLAQTYLARNMSERSAMAAVLAMRDLCRLLEVSPGPQLAQHVYPFSSLQCFAS
LQPLVQAAVLPRAEARLLALHNVSFMELLSAHRRDRQGGLKKYSRYLSLQTTTLPPAQPA
FNLQALGTYFEGLLEGPALaraEGVSTPLAGRGLAERASQQS

我的代码

library("Biostrings")
fastaFile <- readAAStringSet("~/Desktop/FASTA_test.txt")
seq_name = names(fastaFile)
sequence = paste(fastaFile)
df <- data.frame(seq_name,sequence)
view(df)

#separate the aa into separate columns
df_splited_1 <- as.data.frame(do.call(cbind,apply(df,1,function(x) {
  do.call(expand.grid,strsplit(df$sequence,""))
})))

view(df_splited_1)

我面临的问题是上面的脚本将氨基酸分开,但是将它们放在一个单独的列中,而不是将这些列分开保存。

dput(fastaFile)
new("AAStringSet",pool = new("SharedRaw_Pool",xp_list = list(
    <pointer: 0x0>),.link_to_cached_object_list = list(<environment>)),ranges = new("GroupedIRanges",group = c(1L,1L),start = c(1L,741L),width = c(740L,882L),NAMES = c("sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2","sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3"
    ),elementType = "ANY",elementMetadata = NULL,Metadata = list()),elementType = "AAString",Metadata = list())

感谢您的帮助!

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。