我有一个超过1亿行的大文本文件,名为reads.fastq.此外,我有另一个名为takeThese.txt的文件,其中包含应打印的文件reads.fastq中的行号(每行一个).
目前我用
awk’FNR == NR {h [$1];下一个}(F中的h)’takeThese.txt reads.fastq> subsample.fastq
显然需要很长时间.有没有办法使用存储在另一个文件中的行号从文本文件中提取行?如果takeThese.txt文件被排序,它会加快速度吗?
编辑:
我有几个文件示例行:
reads.fastq:
@HWI-1KL157:36:C2468ACXX TGTTCAGTTTCTTCGTTCTTTTTTTGGAC + @@@DDDDDFF>FFGGC@F?HDHIHIFIGG @HWI-1KL157:36:C2468ACXX CGAGGCGGTGACGGAGAGGGGGGAGACGC + BCCFFFFFHHHHHIGHHIHIJJDDBBDDD @HWI-1KL157:36:C2468ACXX TCATATTTTCTGATTTCTCCGTCACTCAA
takeThese.txt:
5 6 7 8
这样输出看起来像这样:
@HWI-1KL157:36:C2468ACXX CGAGGCGGTGACGGAGAGGGGGGAGACGC + BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
编辑:建议脚本的比较:
$time perl AndreasWederbrand.pl takeThese.txt reads.fastq > /dev/null real 0m1.928s user 0m0.819s sys 0m1.100s $time ./karakfa takeThese_numbered.txt reads_numbered.fastq > /dev/null real 0m8.334s user 0m9.973s sys 0m0.226s $time ./EdMorton takeThese.txt reads.fastq > /dev/null real 0m0.695s user 0m0.553s sys 0m0.130s $time ./ABrothers takeThese.txt reads.fastq > /dev/null real 0m1.870s user 0m1.676s sys 0m0.186s $time ./GlenJackman takeThese.txt reads.fastq > /dev/null real 0m1.414s user 0m1.277s sys 0m0.147s $time ./DanielFischer takeThese.txt reads.fastq > /dev/null real 0m1.893s user 0m1.744s sys 0m0.138s
感谢您的所有建议和努力!
您的问题中的脚本将非常快,因为它所做的只是对数组h中当前行号的哈希查找.除非您想要从reads.fastq打印最后一个行号,因为它会在打印完最后一个所需的行号后退出,而不是继续读取reads.fastq的其余部分,这样会更快.
awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq
你可以输入删除h [FNR];印刷后;为了减少数组大小,所以MAYBE可以加快查询时间,但是如果这样可以真正提高性能,因为数组访问是一个哈希查找,因此非常快,因此添加删除可能最终会降低整个脚本的速度.
实际上,这将更快,因为它避免了对两个文件中的每一行测试NR == FNR:
awk -v nums='takeThese.txt' ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq
是否更快或者@glennjackman发布的脚本更快取决于takeThese.txt中的行数以及它们发生的reads.fastq的结尾有多接近.由于Glenns读取整个reads.fastq,无论takeThese.txt的内容是什么,它都将在大约恒定的时间内执行,而我的将在读取结束后显着更快.在takeThese.txt中发生最后一个行号.例如
$awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq
.
$awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt $time awk -v nums=takeThese.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m28.720s user 0m27.876s sys 0m0.450s $time awk -v nums=takeThese.txt ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq > /dev/null real 0m50.060s user 0m47.564s sys 0m0.405s
.
$awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt $time awk -v nums=takeThat.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m26.738s user 0m23.556s sys 0m0.310s $time awk -v nums=takeThat.txt ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit} ' reads.fastq > /dev/null real 0m0.094s user 0m0.015s sys 0m0.000s
但你可以充分利用这两个世界:
$time awk -v nums=takeThese.txt ' function next_index() { if ( ( ("sort -n " nums) | getline i) > 0 ) { return i } else { exit } } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m28.057s user 0m26.675s sys 0m0.498s $time awk -v nums=takeThat.txt ' function next_index() { if ( ( ("sort -n " nums) | getline i) > 0 ) { return i } else { exit } } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() } ' reads.fastq > /dev/null real 0m0.094s user 0m0.030s sys 0m0.062s
如果我们假设takeThese.txt已经排序可以简化为:
$time awk -v nums=takeThese.txt ' BEGIN { getline linenum < nums } NR == linenum { print; if ((getline linenum < nums) < 1) exit } ' reads.fastq > /dev/null real 0m27.362s user 0m25.599s sys 0m0.280s $time awk -v nums=takeThat.txt ' BEGIN { getline linenum < nums } NR == linenum { print; if ((getline linenum < nums) < 1) exit } ' reads.fastq > /dev/null real 0m0.047s user 0m0.030s sys 0m0.016s