World::Create(): March 2010

Using Python etree with unicode content

It took me a while to figure out how to do this correctly so I'll do my duty and document for the search spiders.

Aim:
Take a UTF-8 encoded xml file, read the contents into a ElementTree then iterate through and print out the element contents.

Example:

<?xml version="1.0" ?>
<text>
    <group>
        <line>English</line>
        <line>Français</line>
    </group>
</text>

from xml.etree import ElementTree()

tree = ElementTree()

f = open( "filename.xml" )
tree.parse(f)

for group in groups.findall("//group"):
    for line in group.findall("line"):
        print(line.text.encode('utf-8'))

The important code is the .encode('utf-8') part. Internally the ElementTree is storing the decoded bytes so if you call line.text it will try and encode the bytes into the default encoding which is ASCII. This will fail as the ç character isn't in the ASCII range.

If you call line.text.encode('utf-8') it will encode into UTF-8 so everything will be fine and dandy.

World::Create()

Using Python etree with unicode content

Labels

Archive

About Me