bash – 使用索引文件从文本文件中打印许多特定行

我有一个超过1亿行的大文本文件,名为reads.fastq.此外,我有另一个名为takeThese.txt的文件,其中包含应打印的文件reads.fastq中的行号(每行一个).

目前我用

awk’FNR == NR {h [$1];下一个}(F中的h)’takeThese.txt reads.fastq> subsample.fastq

显然需要很长时间.有没有办法使用存储在另一个文件中的行号从文本文件中提取行？如果takeThese.txt文件被排序,它会加快速度吗？

编辑：

我有几个文件示例行：

reads.fastq：

@HWI-1KL157:36:C2468ACXX
TGTTCAGTTTCTTCGTTCTTTTTTTGGAC
+
@@@DDDDDFF>FFGGC@F?HDHIHIFIGG
@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD
@HWI-1KL157:36:C2468ACXX
TCATATTTTCTGATTTCTCCGTCACTCAA

takeThese.txt：

这样输出看起来像这样：

@HWI-1KL157:36:C2468ACXX
CGAGGCGGTGACGGAGAGGGGGGAGACGC
+
BCCFFFFFHHHHHIGHHIHIJJDDBBDDD

编辑：建议脚本的比较：

$time perl AndreasWederbrand.pl takeThese.txt reads.fastq  > /dev/null

real    0m1.928s
user    0m0.819s
sys     0m1.100s

$time ./karakfa  takeThese_numbered.txt reads_numbered.fastq  > /dev/null

real    0m8.334s
user    0m9.973s
sys     0m0.226s

$time ./EdMorton takeThese.txt reads.fastq  > /dev/null

real    0m0.695s
user    0m0.553s
sys     0m0.130s

$time ./ABrothers  takeThese.txt reads.fastq  > /dev/null

real    0m1.870s
user    0m1.676s
sys     0m0.186s

$time ./GlenJackman takeThese.txt reads.fastq  > /dev/null

real    0m1.414s
user    0m1.277s
sys     0m0.147s

$time ./DanielFischer takeThese.txt reads.fastq  > /dev/null

real    0m1.893s
user    0m1.744s
sys     0m0.138s

感谢您的所有建议和努力！

您的问题中的脚本将非常快,因为它所做的只是对数组h中当前行号的哈希查找.除非您想要从reads.fastq打印最后一个行号,因为它会在打印完最后一个所需的行号后退出,而不是继续读取reads.fastq的其余部分,这样会更快.

awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq

你可以输入删除h [FNR];印刷后;为了减少数组大小,所以MAYBE可以加快查询时间,但是如果这样可以真正提高性能,因为数组访问是一个哈希查找,因此非常快,因此添加删除可能最终会降低整个脚本的速度.

实际上,这将更快,因为它避免了对两个文件中的每一行测试NR == FNR：

awk -v nums='takeThese.txt' '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq

是否更快或者@glennjackman发布的脚本更快取决于takeThese.txt中的行数以及它们发生的reads.fastq的结尾有多接近.由于Glenns读取整个reads.fastq,无论takeThese.txt的内容是什么,它都将在大约恒定的时间内执行,而我的将在读取结束后显着更快.在takeThese.txt中发生最后一个行号.例如

$awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq

$awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt

$time awk -v nums=takeThese.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.720s
user    0m27.876s
sys     0m0.450s

$time awk -v nums=takeThese.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m50.060s
user    0m47.564s
sys     0m0.405s

$awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt

$time awk -v nums=takeThat.txt '
    function next_index() {
        ("sort -n " nums) | getline i
        return i
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m26.738s
user    0m23.556s
sys     0m0.310s

$time awk -v nums=takeThat.txt '
    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }
    NR in h{print; if (!--c) exit}
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.015s
sys     0m0.000s

但你可以充分利用这两个世界：

$time awk -v nums=takeThese.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m28.057s
user    0m26.675s
sys     0m0.498s


$time awk -v nums=takeThat.txt '
    function next_index() {
        if ( ( ("sort -n " nums) | getline i) > 0 ) {
            return i
        }
        else {
            exit
        }
    }
    BEGIN { linenum = next_index() }
    NR == linenum { print; linenum = next_index() }
' reads.fastq > /dev/null
real    0m0.094s
user    0m0.030s
sys     0m0.062s

如果我们假设takeThese.txt已经排序可以简化为：

$time awk -v nums=takeThese.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m27.362s
user    0m25.599s
sys     0m0.280s

$time awk -v nums=takeThat.txt '
    BEGIN { getline linenum < nums }
    NR == linenum { print; if ((getline linenum < nums) < 1) exit }
' reads.fastq > /dev/null
real    0m0.047s
user    0m0.030s
sys     0m0.016s

bash – 使用索引文件从文本文件中打印许多特定行

猜你在找的Bash相关文章