我试图只从NCBI xml BLAST文件中提取第一个命中.接下来我想获得第一个HSP.在最后阶段,我想根据最高分获得这些.
在这里清楚地说明xml文件的一个示例:
在这里清楚地说明xml文件的一个示例:
<?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> <BlastOutput> <BlastOutput_program>blastx</BlastOutput_program> <BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version> <BlastOutput_reference>~Reference: Altschul,Stephen F.,Thomas L. Madden,Alejandro A. Schaffer,~Jinghui Zhang,Zheng Zhang,Webb Miller,and David J. Lipman (1997),~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs",Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference> <BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db> <BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID> <BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def> <BlastOutput_query-len>1024</BlastOutput_query-len> <BlastOutput_param> <Parameters> <Parameters_matrix>BLOSUM62</Parameters_matrix> <Parameters_expect>1e-05</Parameters_expect> <Parameters_gap-open>11</Parameters_gap-open> <Parameters_gap-extend>1</Parameters_gap-extend> <Parameters_filter>F</Parameters_filter> </Parameters> </BlastOutput_param> <BlastOutput_iterations> <Iteration> <Iteration_iter-num>1</Iteration_iter-num> <Iteration_query-ID>lcl|1_0</Iteration_query-ID> <Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def> <Iteration_query-len>1024</Iteration_query-len> <Iteration_stat> <Statistics> <Statistics_db-num>68007</Statistics_db-num> <Statistics_db-len>19518578</Statistics_db-len> <Statistics_hsp-len>0</Statistics_hsp-len> <Statistics_eff-space>0</Statistics_eff-space> <Statistics_kappa>0.041</Statistics_kappa> <Statistics_lambda>0.267</Statistics_lambda> <Statistics_entropy>0.14</Statistics_entropy> </Statistics> </Iteration_stat> <Iteration_message>No hits found</Iteration_message> </Iteration> <Iteration> <Iteration> <Iteration_iter-num>6</Iteration_iter-num> <Iteration_query-ID>lcl|6_0</Iteration_query-ID> <Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def> <Iteration_query-len>1068</Iteration_query-len> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_id>gnl|BL_ORD_ID|23609</Hit_id> <Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def> <Hit_accession>23609</Hit_accession> <Hit_len>293</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>49.2914</Hsp_bit-score> <Hsp_score>116</Hsp_score> <Hsp_evalue>5.15408e-06</Hsp_evalue> <Hsp_query-from>580</Hsp_query-from> <Hsp_query-to>792</Hsp_query-to> <Hsp_hit-from>202</Hsp_hit-from> <Hsp_hit-to>273</Hsp_hit-to> <Hsp_query-frame>-1</Hsp_query-frame> <Hsp_identity>26</Hsp_identity> <Hsp_positive>45</Hsp_positive> <Hsp_gaps>2</Hsp_gaps> <Hsp_align-len>73</Hsp_align-len> <Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq> <Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq> <Hsp_midline>M I+ DV+ +T +++DD+ D T AAK L+E GA +V Y TH + S + +EN+ ++ +D + TN++</Hsp_midline> </Hsp> </Hit_hsps> </Hit> <Hit> <Hit_num>2</Hit_num> <Hit_id>gnl|BL_ORD_ID|2466</Hit_id> <Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def> <Hit_accession>2466</Hit_accession> <Hit_len>3084</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score>48.9062</Hsp_bit-score> <Hsp_score>115</Hsp_score> <Hsp_evalue>6.70494e-06</Hsp_evalue> <Hsp_query-from>369</Hsp_query-from> <Hsp_query-to>875</Hsp_query-to> <Hsp_hit-from>2312</Hsp_hit-from> <Hsp_hit-to>2465</Hsp_hit-to> <Hsp_query-frame>-2</Hsp_query-frame> <Hsp_identity>52</Hsp_identity> <Hsp_positive>70</Hsp_positive> <Hsp_gaps>4</Hsp_gaps> <Hsp_align-len>173</Hsp_align-len> <Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq> <Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq> <Hsp_midline>AP +Q+P A ++ + K P + T + T A P + T A PP+ PP S P A R P AP P P P+ P P+ A +T PR + P SA +AT AP SAK++ D P+AG PP+ GG V</Hsp_midline> </Hsp> </Hit_hsps> </Hit> </Iteration_hits> <Iteration_stat> <Statistics> <Statistics_db-num>68007</Statistics_db-num> <Statistics_db-len>19518578</Statistics_db-len> <Statistics_hsp-len>0</Statistics_hsp-len> <Statistics_eff-space>0</Statistics_eff-space> <Statistics_kappa>0.041</Statistics_kappa> <Statistics_lambda>0.267</Statistics_lambda> <Statistics_entropy>0.14</Statistics_entropy> </Statistics> </Iteration_stat> </Iteration>
基本上每个查询搜索都会创建一个迭代元素.每次迭代都可以有多次命中,而后者又可以有多个HSP.我想只获得第一个命中,它是每次迭代的第一个HSP.如果BLAST没有发现命中,我想忽略迭代.
我编写了这个简单的代码:
#!/usr/bin/env python from elementtree.ElementTree import parse from elementtree import ElementTree as ET file = open("/Applications/blast/blanes_viral_nr_results.xml","r") save_file = open("/Applications/blast/Blast_parse_ET.txt",'w') tree = parse(file) elem = tree.getroot() print elem Per_ID = () save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t","It_ID\t","Hit_Def\t","Num\t","ID\t","ACC\t")) iteration = tree.findall('BlastOutput_iterations/Iteration') for iteration in iteration: for hit in iteration.findall('Iteration_hits/Hit'): It_Num = iteration.findtext('Iteration_iter-num') It_ID = iteration.findtext('Iteration_query-def') Hit_Def = hit.findtext('Hit_def') Num = hit.findtext('Hit_num') ID = hit.findtext('Hit_id') DEF = hit.findtext('Hit_def') ACC = hit.findtext('Hit_accession') save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num,It_ID[12:26],Hit_Def[1:10],Num,ID,ACC,)) for hsp in hit.findall('Hit_hsps'): HSPN = hsp.findtext('Hsp/Hsp_num') identities = hsp.findtext('Hsp/Hsp_identity') #print 'id: ',identities.rjust(4),length = hsp.findtext('Hsp/Hsp_align-len') #print 'len:',length.rjust(4),Per_ID = int(identities) * 100.0 / int(length) #print hsp.findtext('Hsp/Hsp_qseq')[:50] #print hsp.findtext('Hsp/Hsp_midline')[:50] #print hsp.findtext('Hsp/Hsp_hseq')[:50] save_file.write('%s\t%s\t%s\%st\n' % ('***','%',HSPN,Per_ID)) save_file.write('n\n' % ())
任何帮助都会非常受欢迎!
虽然构建自己的解析器可能很“有趣”,但已经有一个可以解析BLAST xml文件的包……如果你愿意,它甚至可以为你进行本地BLAST实例的中间调用.
主要网站在这里:
http://biopython.org/wiki/Biopython
XML BLAST解析器在这里:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82
就像是:
from Bio.Blast import NCBIXML with open('xml/results/file') as handle: all_records = NCBIXML.parse(handle) first_record = all_records.next()
应该管用.我一般都喜欢BioPython解析器和编写器,但我不喜欢类结构组织.所以我通常只使用解析器并将我需要的信息提取到我自己的结构中.
因人而异
希望有所帮助.