bash – 使用索引文件从文本文件中打印许多特定行

前端之家收集整理的这篇文章主要介绍了bash – 使用索引文件从文本文件中打印许多特定行前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
我有一个超过1亿行的大文本文件,名为reads.fastq.此外,我有另一个名为takeThese.txt的文件,其中包含应打印的文件reads.fastq中的行号(每行一个).

目前我用

awk’FNR == NR {h [$1];下一个}(F中的h)’takeThese.txt reads.fastq> subsample.fastq

显然需要很长时间.有没有办法使用存储在另一个文件中的行号从文本文件提取行?如果takeThese.txt文件被排序,它会加快速度吗?

编辑:

我有几个文件示例行:

reads.fastq:

@HWI-1KL157:36:C2468ACXX
TGTTCAGTTTCTTCGTTCTTTTTTTGGAC
+
@@@DDDDDFF>FFGGC@F?HDHIHIFIGG
@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
@HWI-1KL157:36:C2468ACXX
TCATATTTTCTGATTTCTCCGTCACTCAA

takeThese.txt:

5
6
7
8

这样输出看起来像这样:

@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD

编辑:建议脚本的比较:

$time perl AndreasWederbrand.pl takeThese.txt reads.fastq  > /dev/null

real    0m1.928s
user    0m0.819s
sys     0m1.100s

$time ./karakfa  takeThese_numbered.txt reads_numbered.fastq  > /dev/null

real    0m8.334s
user    0m9.973s
sys     0m0.226s

$time ./EdMorton takeThese.txt reads.fastq  > /dev/null

real    0m0.695s
user    0m0.553s
sys     0m0.130s

$time ./ABrothers  takeThese.txt reads.fastq  > /dev/null

real    0m1.870s
user    0m1.676s
sys     0m0.186s

$time ./GlenJackman takeThese.txt reads.fastq  > /dev/null

real    0m1.414s
user    0m1.277s
sys     0m0.147s

$time ./DanielFischer takeThese.txt reads.fastq  > /dev/null

real    0m1.893s
user    0m1.744s
sys     0m0.138s

感谢您的所有建议和努力!

您的问题中的脚本将非常快,因为它所做的只是对数组h中当前行号的哈希查找.除非您想要从reads.fastq打印最后一个行号,因为它会在打印完最后一个所需的行号后退出,而不是继续读取reads.fastq的其余部分,这样会更快.
awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq

你可以输入删除h [FNR];印刷后;为了减少数组大小,所以MAYBE可以加快查询时间,但是如果这样可以真正提高性能,因为数组访问是一个哈希查找,因此非常快,因此添加删除可能最终会降低整个脚本的速度.

实际上,这将更快,因为它避免了对两个文件中的每一行测试NR == FNR:

awk -v nums='takeThese.txt' '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq

是否更快或者@glennjackman发布的脚本更快取决于takeThese.txt中的行数以及它们发生的reads.fastq的结尾有多接近.由于Glenns读取整个reads.fastq,无论takeThese.txt的内容是什么,它都将在大约恒定的时间内执行,而我的将在读取结束后显着更快.在takeThese.txt中发生最后一个行号.例如

$awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq

.

$awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt

$time awk -v nums=takeThese.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.720s
user    0m27.876s
sys     0m0.450s

$time awk -v nums=takeThese.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m50.060s
user    0m47.564s
sys     0m0.405s

.

$awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt

$time awk -v nums=takeThat.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m26.738s
user    0m23.556s
sys     0m0.310s

$time awk -v nums=takeThat.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.015s
sys     0m0.000s

但你可以充分利用这两个世界:

$time awk -v nums=takeThese.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.057s
user    0m26.675s
sys     0m0.498s


$time awk -v nums=takeThat.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.030s
sys     0m0.062s

如果我们假设takeThese.txt已经排序可以简化为:

$time awk -v nums=takeThese.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m27.362s
user    0m25.599s
sys     0m0.280s

$time awk -v nums=takeThat.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m0.047s
user    0m0.030s
sys     0m0.016s

猜你在找的Bash相关文章