编辑:添加解决方案.
嗨,我目前有一些工作虽然代码很慢.
它使用主键逐行合并2个CSV文件.
例如,如果文件1具有以下行:
"one,two,four,42"
和文件2有这一行;
"one,three,42"
其中0索引$position = 4主键= 42;
然后是sub:merge_file($file1,$file2,$outputfile,$position);
"one,42";
每个主键在每个文件中都是唯一的,一个键可能存在于一个文件中但不存在于另一个文件中(反之亦然)
每个文件大约有100万行.
通过第一个文件中的每一行,我使用哈希来存储主键,并将行号存储为值.行号对应于存储第一个文件中每一行的数组[行号].
然后我遍历第二个文件中的每一行,并检查主键是否在哈希中,如果是,则从file1array获取行,然后将我需要的列从第一个数组添加到第二个数组,并且然后结束.到最后.然后删除哈希值,然后在最后,将整个事件转储到文件中. (我正在使用SSD,所以我想最小化文件写入.)
最好用代码解释:
sub merge_file2{ my ($file1,$out,$position) = ($_[0],$_[1],$_[2],$_[3]); print "merging: \n$file1 and \n$file2,to: \n$out\n"; my $OUTSTRING = undef; my %line_for; my @file1array; open FILE1,"<$file1"; print "$file1 opened\n"; while (<FILE1>){ chomp; $line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key) $file1array[$.] = $_; #store line in file1array. } close FILE1; print "$file2 opened - merging..\n"; open FILE2,"<",$file2; my @from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2. while (<FILE2>){ print "$.\n" if ($.%1000) == 0; chomp; my @array1 = (); my @array2 = (); my @array2 = split /,/,$_; #split 2nd csv line by commas my @array1 = split /,$file1array[$line_for{$array2[$position]}]; # ^ ^ ^ # prev line lookup line in 1st file,lookup hash,pos of key #my @output = &merge_string(\@array1,\@array2); #merge 2 csv strings (old fn.) foreach(@from1to2){ $array2[$_] = $array1[$_]; } my $outstring = join ",",@array2; $OUTSTRING.=$outstring."\n"; delete $line_for{$array2[$position]}; } close FILE2; print "adding rest of lines\n"; foreach my $key (sort { $a <=> $b } keys %line_for){ $OUTSTRING.= $file1array[$line_for{$key}]."\n"; } print "writing file $out\n\n\n"; write_line($out,$OUTSTRING); }
第一次很好,不到1分钟,但第二次循环需要大约1小时才能运行,我想知道我是否采取了正确的方法.我认为有可能加速很多? :) 提前致谢.
解:
sub merge_file3{ my ($file1,$position,$hsize) = ($_[0],$_[3],$_[4]); print "merging: \n$file1 and \n$file2,to: \n$out\n"; my $OUTSTRING = undef; my $header; my (@file1,@file2); open FILE1,"<$file1" or die; while (<FILE1>){ if ($.==1){ $header = $_; next; } print "$.\n" if ($.%100000) == 0; chomp; push @file1,[split ',',$_]; } close FILE1; open FILE2,"<$file2" or die; while (<FILE2>){ next if $.==1; print "$.\n" if ($.%100000) == 0; chomp; push @file2,$_]; } close FILE2; print "sorting files\n"; my @sortedf1 = sort {$a->[$position] <=> $b->[$position]} @file1; my @sortedf2 = sort {$a->[$position] <=> $b->[$position]} @file2; print "sorted\n"; @file1 = undef; @file2 = undef; #foreach my $line (@file1){print "\t [ @$line ],\n"; } my ($i,$j) = (0,0); while ($i < $#sortedf1 and $j < $#sortedf2){ my $key1 = $sortedf1[$i][$position]; my $key2 = $sortedf2[$j][$position]; if ($key1 eq $key2){ foreach(0..$hsize){ #header size. $sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef; } $i++; $j++; } elsif ( $key1 < $key2){ push(@sortedf2,[@{$sortedf1[$i]}]); $i++; } elsif ( $key1 > $key2){ $j++; } } #foreach my $line (@sortedf2){print "\t [ @$line ],\n"; } print "outputting to file\n"; open OUT,">$out"; print OUT $header; foreach(@sortedf2){ print OUT (join ",@{$_})."\n"; } close OUT; }
谢谢大家,解决方案在上面发布.现在合并整个事情需要大约1分钟!