19.7. xml.etree.ElementTree — The ElementTree XML API
New in version 2.5.
Source code: Lib/xml/etree/ElementTree.py
The Element type is a flexible container object,designed to store hierarchical data structures in memory. The type can be described as a cross between a list and a dictionary.
The xml.etree.ElementTree module is not secure against malicIoUsly constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.
Each element has a number of properties associated with it:
- a tag which is a string identifying what kind of data this element represents (the element type,in other words).
- a number of attributes,stored in a Python dictionary.
- a text string.
- an optional tail string.
- a number of child elements,stored in a Python sequence
To create an element instance,use the Element constructor or the SubElement() factory function.
The ElementTree class can be used to wrap an element structure,and convert it from and to XML.
A C implementation of this API is available as xml.etree.cElementTree.
See http://effbot.org/zone/element-index.htm for tutorials and links to other docs. Fredrik Lundh’s page is also the location of the development version of the xml.etree.ElementTree.
Changed in version 2.7: The ElementTree API is updated to 1.3. For more information,see Introducing ElementTree 1.3.
19.7.1. Tutorial
This is a short tutorial for using xml.etree.ElementTree (ET in short). The goal is to demonstrate some of the building blocks and basic concepts of the module.
19.7.1.1. XML tree and elements
XML is an inherently hierarchical data format,and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree,and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.
19.7.1.2. Parsing XML
We’ll be using the following XML document as the sample data for this section:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
We have a number of ways to import the data. Reading the file from disk:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Reading the data from a string:
root = ET.fromstring(country_data_as_string)
fromstring() parses XML from a string directly into an Element,which is the root element of the parsed tree. Other parsing functions may create an ElementTree. Check the documentation to be sure.
As an Element,root has a tag and a dictionary of attributes:
>>> root.tag
'data'
>>> root.attrib
{}
It also has children nodes over which we can iterate:
>>> for child in root:
... print child.tag, child.attrib
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}
Children are nested,and we can access specific child nodes by index:
>>> root[0][1].text
'2008'
19.7.1.3. Finding interesting elements
Element has some useful methods that help iterate recursively over all the sub-tree below it (its children,their children,and so on). For example, Element.iter():
>>> for neighbor in root.iter('neighbor'):
... print neighbor.attrib
...
{'name': 'Austria','direction': 'E'}
{'name': 'Switzerland','direction': 'W'}
{'name': 'Malaysia','direction': 'N'}
{'name': 'Costa Rica','direction': 'W'}
{'name': 'Colombia','direction': 'E'}
Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag,and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:
>>> for country in root.findall('country'):
... rank = country.find('rank').text
... name = country.get('name')
... print name, rank
...
Liechtenstein 1
Singapore 4
Panama 68
More sophisticated specification of which elements to look for is possible by using XPath.
19.7.1.4. Modifying an XML File
ElementTree provides a simple way to build XML documents and write them to files. The ElementTree.write() method serves this purpose.
Once created,an Element object may be manipulated by directly changing its fields (such as Element.text),adding and modifying attributes (Element.set() method),as well as adding new children (for example with Element.append()).
Let’s say we want to add one to each country’s rank,and add an updated attribute to the rank element:
>>> for rank in root.iter('rank'):
... new_rank = int(rank.text) + 1
... rank.text = str(new_rank)
... rank.set('updated', 'yes')
...
>>> tree.write('output.xml')
Our XML now looks like this:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
We can remove elements using Element.remove(). Let’s say we want to remove all countries with a rank higher than 50:
>>> for country in root.findall('country'):
... rank = int(country.find('rank').text)
... if rank > 50:
... root.remove(country)
...
>>> tree.write('output.xml')
Our XML now looks like this:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
</data>
19.7.1.5. Building XML documents
The SubElement() function also provides a convenient way to create new sub-elements for a given element:
>>> a = ET.Element('a')
>>> b = ET.SubElement(a, 'b')
>>> c = ET.SubElement(a, 'c')
>>> d = ET.SubElement(c, 'd')
>>> ET.dump(a)
<a><b /><c><d /></c></a>
19.7.1.6. Parsing XML with Namespaces
If the XML input has namespaces,tags and attributes with prefixes in the form prefix:soMetag get expanded to {uri}soMetag where the prefix is replaced by the full URI. Also,if there is a default namespace,that full URI gets prepended to all of the non-prefixed tags.
Here is an XML example that incorporates two namespaces,one with the prefix “fictional” and the other serving as the default namespace:
<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
xmlns="http://people.example.com">
<actor>
<name>John Cleese</name>
<fictional:character>Lancelot</fictional:character>
<fictional:character>Archie Leach</fictional:character>
</actor>
<actor>
<name>Eric Idle</name>
<fictional:character>Sir Robin</fictional:character>
<fictional:character>Gunther</fictional:character>
<fictional:character>Commander Clement</fictional:character>
</actor>
</actors>
One way to search and explore this XML example is to manually add the URI to every tag or attribute in the xpath of a find() or findall():
root = fromstring(xml_text)
for actor in root.findall('{http://people.example.com}actor'):
name = actor.find('{http://people.example.com}name')
print name.text
for char in actor.findall('{http://characters.example.com}character'):
print ' |-->', char.text
A better way to search the namespaced XML example is to create a dictionary with your own prefixes and use those in the search functions:
ns = {'real_person': 'http://people.example.com',
'role': 'http://characters.example.com'}
for actor in root.findall('real_person:actor', ns):
name = actor.find('real_person:name', ns)
print name.text
for char in actor.findall('role:character', ns):
print ' |-->', char.text
These two approaches both output:
John Cleese
|--> Lancelot
|--> Archie Leach
Eric Idle
|--> Sir Robin
|--> Gunther
|--> Commander Clement
19.7.1.7. Additional resources
See http://effbot.org/zone/element-index.htm for tutorials and links to other docs.