bash – 按行和列号对文件进行子集

我们要对行和列中的文本文件进行子集，其中从文件读取行和列号。排除头(行1)和rownames(col 1)。

inputFile.txt制表符分隔文本文件

header  62  9   3   54  6   1
25  1   2   3   4   5   6
96  1   1   1   1   0   1
72  3   3   3   3   3   3
18  0   1   0   1   1   0
82  1   0   0   0   0   1
77  1   0   1   0   1   1
15  7   7   7   7   7   7
82  0   0   1   1   1   0
37  0   1   0   0   1   0
18  0   1   0   0   1   0
53  0   0   1   0   0   0
57  1   1   1   1   1   1

subsetCols.txt逗号分隔，没有空格，一行，数字排序。在实际数据中，我们有500K列，需要子集〜10K。

1,4,6

subsetRows.txt逗号分隔，没有空格，一行，数字排序。在实际数据中，我们有20K行，需要约〜300个子集。

1,3,7

使用切割和awk循环的当前解决方案(Related post: Select rows using awk)：

# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt

# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s,a,","); for (i in a) b[a[i]]} NR in b' > $fileOutput

输出文件：result.txt

1   4   6
3   3   3
7   7   7

题：
该解决方案适用于小文件，较大的文件为50K行和200K列，占用时间过长，15分钟加，仍在运行。我认为切割列可以正常运行，选择行是慢的。

有什么更好的方法吗

实际输入文件信息：

# $fileInput:
#        Rows = 20127
#        Cols = 533633
#        Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers

有关该文件的更多信息：文件包含GWAS基因型数据。每行代表样本(个体)，每列代表SNP.对于进一步的基于区域的分析，我们需要子集样本(行)和SNP(列)，以使数据更易于管理(小)作为其他统计软件的输入，如r 。

系统：

$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux

更新：@JamesBrown下面提供的解决方案是在我的系统中混合列的顺序，因为我使用不同版本的awk，我的版本是：GNU Awk 3.1.7

即使在 If programming languages were countries,which country would each language represent?他们说…

Awk: North Korea. Stubbornly resists change,and its users appear to be unnaturally fond of it for reasons we can only speculate on.

…每当你看到自己的管道sed，切，grep，awk等，停止并对自己说：awk可以使它一个人！

所以在这种情况下，这是一个提取行和列的问题(调整它们以排除头部和第一列)，然后缓冲输出以最终打印出来。

awk -v cols="1 4 6" -v rows="1 3 7" '
    BEGIN{
       split(cols,c); for (i in c) col[c[i]]  # extract cols to print
       split(rows,r); for (i in r) row[r[i]]  # extract rows to print
    }
    (NR-1 in row){
       for (i=2;i<=NF;i++) 
              (i-1) in col && line=(line ? line OFS $i : $i); # pick columns
              print line; line=""                             # print them
    }' file

使用您的示例文件：

$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7

将您的示例文件和输入作为变量，以逗号分隔：

awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c,/,/); for (i in c) col[c[i]]; split(rows,r,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput

我相信这会更快。例如，您可以检查Remove duplicates from text file based on second text file的一些基准，比较awk与grep等的性能。

最好，金正恩

bash – 按行和列号对文件进行子集

猜你在找的Bash相关文章