我有一个代码,试图确定给定DNA序列的起始和结束密码子的位置.
我们将起始密码子定义为ATG序列,并将末端密码子定义为TGA,TAA,TAG序列.
我们将起始密码子定义为ATG序列,并将末端密码子定义为TGA,TAA,TAG序列.
我遇到的问题是下面的代码只适用于前两个序列(DM208659和AF038953),但不适用于其余的序列.
我的方法下面有什么问题?
#!/usr/bin/perl -w while (<DATA>) { chomp; print "$_\n"; my ($id,$rna_sq) = split(/\s+/,$_); local $_ = $rna_sq; while (/atg/g) { my $start = pos() - 2; if (/tga|taa|tag/g) { my $stop = pos(); my $gene = substr( $_,$start - 1,$stop - $start + 1 ),$/; my $genelen = length($gene); my $ct = "$id $start $stop $gene $genelen"; print "\t$ct\n"; } } } __DATA__ DM208659 gtgggcctcaaatgtggagcactattctgatgtccaagtggaaagtgctgcgacatttgagcgtcac AF038953 gatcccagacctcggcttgcagtagtgttagactgaagataaagtaagtgctgtttgggctaacaggatctcctcttgcagtctgcagcccaggacgctgattccagcagcgccttaccgcgcagcccgaagattcactatggtgaaaatcgccttcaatacccctaccgccgtgcaaaaggaggaggcgcggcaagacgtggaggccctcctgagccgcacggtcagaactcagatactgaccggcaaggagctccgagttgccacccaggaaaaagagggctcctctgggagatgtatgcttactctcttaggcctttcattcatcttggcaggacttattgttggtggagcctgcatttacaagtacttcatgcccaagagcaccatttaccgtggagagatgtgcttttttgattctgaggatcctgcaaattcccttcgtggaggagagcctaacttcctgcctgtgactgaggaggctgacattcgtgaggatgacaacattgcaatcattgatgtgcctgtccccagtttctctgatagtgaccctgcagcaattattcatgactttgaaaagggaatgactgcttacctggacttgttgctggggaactgctatctgatgcccctcaatacttctattgttatgcctccaaaaaatctggtagagctctttggcaaactggcgagtggcagatatctgcctcaaacttatgtggttcgagaagacctagttgctgtggaggaaattcgtgatgttagtaaccttggcatctttatttaccaactttgcaataacagaaagtccttccgccttcgtcgcagagacctcttgctgggtttcaacaaacgtgccattgataaatgctggaagattagacacttccccaacgaatttattgttgagaccaagatctgtcaagagtaagaggcaacagatagagtgtccttggtaataagaagtcagagatttacaatatgactttaacattaaggtttatgggatactcaagatatttactcatgcatttactctattgcttatgccgtaaaaaaaaaaaaaaaaaaaaaaaaaaaaa BC021011 ggggagtccggggcggcgcctggaggcggagccgcccgctgggctaaatggggcagaggccgggaggggtgggggttccccgcgccgcagccatggagcagcttcgcgccgccgcccgtctgcagattgttctg DM208660 gggatactcaaaatgggggcgctttcctttttgtctgtactgggaagtgcttcgattttggggtgtccc AF038954 ggacccaagggggccttcgaggtgccttaggccgcttgccttgctctcagaatcgctgccgccatggctagtcagtctcaggggattcagcagctgctgcaggccgagaagcgggcagccgagaaggtgtccgaggcccgcaaaagaaagaaccggaggctgaagcaggccaaagaagaagctcaggctgaaattgaacagtaccgcctgcagagggagaaagaattcaaggccaaggaagctgcggcattgggatcccgtggcagttgcagcactgaagtggagaaggagacccaggagaagatgaccatcctccagacatacttccggcagaacagggatgaagtcttggacaacctcttggcttttgtctgtgacattcggccagaaatccatgaaaactaccgcataaatggatagaagagagaagcacctgtgctgtggagtggcattttagatgccctcacgaatatggaagcttagcacagctctagttacattcttaggagatggccattaaattatttccatatattataagagaggtccttccactttttggagagtagccaatctagctttttggtaacagacttagaaattagcaaagatgtccagctttttaccacagattcctgagggattttagatgggtaaatagagtcagactttgaccaggttttgggcaaagcacatgtatatcagtgtggacttttcctttcttagatctagtttaaaaaaaaaaaccccttaccattctttgaagaaaggaggggattaaataattttttcccctaacactttcttgaaggtcaggggctttatctatgaaaagttagtaaatagttctttgtaacctgtgtgaagcagcagccagccttaaagtagtccattcttgctaatggttagaacagtgaatactagtggaattgtttgggctgcttttagtttctcttaatcaaaattactagatgatagaattcaagaacttgttacatgtattacttggtgtatcgataatcatttaaaagtaaagactctgtcatgcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
解决方法
我删除了$_的使用(当你本地化时我特别颤抖 – 你这样做是正确的,但为什么要强迫自己担心如果其他一些函数要破坏$_,而不是使用已经可用的$rna_sq?
另外我修正了$start和$stop为基于0的索引到字符串中(这使得数学的其余部分更加直接),并且提前计算了$genelen,因此可以直接在substr操作中使用. (或者,您可以本地化$[1以使用基于1的数组索引,请参阅perldoc perlvar.)
use strict; use warnings; while (my $line = <DATA>) { chomp $line; print "processing $line\n"; my ($id,$line); while ($rna_sq =~ /atg/g) { # $start and $stop are 0-based indexes my $start = pos($rna_sq) - 3; # back up to include the start sequence # discard remnant if no stop sequence can be found last unless $rna_sq =~ /tga|taa|tag/g; my $stop = pos($rna_sq); my $genelen = $stop - $start; my $gene = substr($rna_sq,$start,$genelen); print "\t" . join(' ',$id,$start+1,$stop,$gene,$genelen) . "\n"; } }