解析大(100 Mb)
XML文件时出现“Out of memory”错误
use strict; use warnings; use XML::Twig; my $twig=XML::Twig->new(); my $data = XML::Twig->new ->parsefile("divisionhouserooms-v3.xml") ->simplify( keyattr => []); my @good_division_numbers = qw( 30 31 32 35 38 ); foreach my $property ( @{ $data->{DivisionHouseRoom}}) { my $house_code = $property->{HouseCode}; print $house_code,"\n"; my $amount_of_bedrooms = 0; foreach my $division ( @{ $property->{Divisions}->{Division} } ) { next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers; $amount_of_bedrooms += $division->{DivisionQuantity}; } open my $fh,">>","Result.csv" or die $!; print $fh join("\t",$house_code,$amount_of_bedrooms),"\n"; close $fh; }
解决方法
处理不适合内存的大型XML文件是
XML::Twig
advertises:
One of the strengths of
XML::Twig
is that it let you work with files
that do not fit in memory (BTW storing an XML document in memory as a
tree is quite memory-expensive,the expansion factor being often
around 10).To do this you can define handlers,that will be called once a
specific element has been completely parsed. In these handlers you can
access the element and process it as you see fit (…)
问题中发布的代码根本没有充分利用XML :: Twig的优势(使用简化方法并没有比XML::Simple
更好).
代码中缺少的是’twig_handlers’或’twig_roots’,这实际上导致解析器有效地关注XML文档内存的相关部分.
很难说没有看到XML是processing the document chunk-by-chunk还是just selected parts,但任何一个都应该解决这个问题.
所以代码应该类似于以下内容(chunk-by-chunk演示):
use strict; use warnings; use XML::Twig; use List::Util 'sum'; # To make life easier use Data::Dump 'dump'; # To see what's going on my %bedrooms; # Data structure to store the wanted info my $xml = XML::Twig->new ( twig_roots => { DivisionHouseRoom => \&count_bedrooms,} ); $xml->parsefile( 'divisionhouserooms-v3.xml'); sub count_bedrooms { my ( $twig,$element ) = @_; my @divParents = $element->children( 'Divisions' ); my $id = $element->first_child_text( 'HouseCode' ); for my $divParent ( @divParents ) { my @divisions = $divParent->children( 'Division' ); my $total = sum map { $_->text } @divisions; $bedrooms{$id} = $total; } $element->purge; # Free up memory } dump \%bedrooms;