我需要在Python中编写一个解析器,它可以在没有太多内存(只有2 GB)的计算机上处理一些非常大的文件(大于2 GB)。我想在lxml中使用iterparse来做到这一点。
我的文件格式如下:
<item> <title>Item 1</title> <desc>Description 1</desc> </item> <item> <title>Item 2</title> <desc>Description 2</desc> </item>
到目前为止我的解决方案是:
from lxml import etree context = etree.iterparse( MYFILE,tag='item' ) for event,elem in context : print elem.xpath( 'description/text( )' ) del context
不幸的是,这个解决方案仍然吃了很多内存。我认为问题是,在处理每个“ITEM”后,我需要做一些事情来清理空的孩子。任何人都可以提供一些建议,我可以做什么后处理我的数据,以适当清理?
尝试
Liza Daly’s fast_iter.处理元素elem之后,它调用elem.clear()删除子孙,并删除前面的兄弟姐妹。
def fast_iter(context,func,*args,**kwargs): """ http://lxml.de/parsing.html#modifying-the-tree Based on Liza Daly's fast_iter http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ See also http://effbot.org/zone/element-iterparse.htm """ for event,elem in context: func(elem,**kwargs) # It's safe to call clear() here because no descendants will be # accessed elem.clear() # Also eliminate now-empty references from the root node to elem for ancestor in elem.xpath('ancestor-or-self::*'): while ancestor.getprevIoUs() is not None: del ancestor.getparent()[0] del context def process_element(elem): print elem.xpath( 'description/text( )' ) context = etree.iterparse( MYFILE,tag='item' ) fast_iter(context,process_element)
Daly的文章是一个很好的阅读,特别是如果你正在处理大型XML文件。
编辑:上面的fast_iter是Daly的fast_iter的修改版本。在处理元素之后,在移除不再需要的其他元素时更积极。
下面的脚本显示了行为的差异。特别要注意的是,orig_fast_iter不删除A1元素,而mod_fast_iter则删除它,从而节省更多的内存。
import lxml.etree as ET import textwrap import io def setup_ABC(): content = textwrap.dedent('''\ <root> <A1> <B1></B1> <C>1<D1></D1></C> <E1></E1> </A1> <A2> <B2></B2> <C>2<D></D></C> <E2></E2> </A2> </root> ''') return content def study_fast_iter(): def orig_fast_iter(context,**kwargs): for event,elem in context: print('Processing {e}'.format(e=ET.tostring(elem))) func(elem,**kwargs) print('Clearing {e}'.format(e=ET.tostring(elem))) elem.clear() while elem.getprevIoUs() is not None: print('Deleting {p}'.format( p=(elem.getparent()[0]).tag)) del elem.getparent()[0] del context def mod_fast_iter(context,**kwargs): """ http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ Author: Liza Daly See also http://effbot.org/zone/element-iterparse.htm """ for event,**kwargs) # It's safe to call clear() here because no descendants will be # accessed print('Clearing {e}'.format(e=ET.tostring(elem))) elem.clear() # Also eliminate now-empty references from the root node to elem for ancestor in elem.xpath('ancestor-or-self::*'): print('Checking ancestor: {a}'.format(a=ancestor.tag)) while ancestor.getprevIoUs() is not None: print( 'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag)) del ancestor.getparent()[0] del context content = setup_ABC() context = ET.iterparse(io.BytesIO(content),events=('end',),tag='C') orig_fast_iter(context,lambda elem: None) # Processing <C>1<D1/></C> # Clearing <C>1<D1/></C> # Deleting B1 # Processing <C>2<D/></C> # Clearing <C>2<D/></C> # Deleting B2 print('-' * 80) """ The improved fast_iter deletes A1. The original fast_iter does not. """ content = setup_ABC() context = ET.iterparse(io.BytesIO(content),tag='C') mod_fast_iter(context,lambda elem: None) # Processing <C>1<D1/></C> # Clearing <C>1<D1/></C> # Checking ancestor: root # Checking ancestor: A1 # Checking ancestor: C # Deleting B1 # Processing <C>2<D/></C> # Clearing <C>2<D/></C> # Checking ancestor: root # Checking ancestor: A2 # Deleting A1 # Checking ancestor: C # Deleting B2 study_fast_iter()