Efficient XML parsing for 25GB data - 【StackMirror】|python|xml|xml-parsing|beautifulsoup|lxml

My aim is to parse 25 GB of XML data. An example of such a data is given below:

<Document>
<Data Id='12' category='1'  Body="abc"/>
<Data Id='13' category='1'  Body="zwq"/>
.
.
<Data Id='82018030' category='2' CorrespondingCategory1Id='13' Body="pqr"/>

However..considering the data I have of "25 GB"...my approach is quite inefficient. Please suggest some way to improve my code or an alternate approach. Also kindly include a small example code to make things clearer.

2012-04-05 19:35
by Jannat Arora

possible duplicate of Python sax to lxml for 80+GB XMLFrancis Avila 2012-04-05 20:03

You might find that a SAX parser works better for this task. Rather than building a DOM, a SAX parser turns the XML file into a stream of elements and calls functions you provide to let you handle each element.

The good thing is that SAX parsers can be very fast and memory-efficient compared to DOM parsers, and some don't even need to be given all the XML at once, which would be ideal when you have 25 GB of it.

Unfortunately, if you need any context information, like "I want tag <B> but only if it's inside tag <A>," you must maintain it yourself, since all the parser gives you is "start tag <A>, start tag <B>, end tag <B>, end tag <A>." It never explicitly tells you that tag <B> is inside tag <A>, you have to figure that out from what you saw. And once you have seen an element, it's gone unless you remembered it yourself.

This gets very hairy for complex parsing jobs, but yours is probably manageable.

It happens that Python's standard library has a SAX parser in xml.sax. You probably want something like xml.sax.xmlreader.IncrementalParser.

2012-04-05 19:48
by kindall

Thanks for the help. Can u please explain the concept with a small example code...that would be really kind of you...you may construct your own simple example also - Jannat Arora 2012-04-05 20:01

If I get a chance I will. Not sure I'll have the time, though - kindall 2012-04-05 20:08

My first suggestion from looking at your question would be to use a relational database such as MySQL or sqlite. It wouldn't be hard to get your XML data into this form, and then querying on that data would be both more straightforward and faster.

2012-04-05 19:41
by RotaJota

I don't want to take that approach...essentially the join of the tables also for 25GB of data takes a lot of time. Therefore..I am looking for an alternative with parsing itsel - Jannat Arora 2012-04-05 19:43

For the kind of query you are making here with this much data, you really need a database. If you define the indexes properly it will be much faster and easier than doing what you suggest. So use SAX or other incremental parser to load the data into a database, then query from the database to write out to a new XML file - Francis Avila 2012-04-05 20:07

Your initial algorithm runs in O(n^2), which will be very slow for 25GB of data. Ideally you will get it down to O(n) or O(n log n). In the absence of any other information about the data (like whether category 1 or 2 is smaller, etc), you can do something like this (which is O(n)):

from lxml import objectify
f=open('myfile25GB', 'r')
text=f.read()
root=objectify.fromstring(text)

cat_one_bodies = {}
for e in root.attrib['Document'].row:
    category = int(e.attrib['category'])
    body = e.attrib['Body']
    if category == 1:
        e_id = int(e.attrib['Id'])
        cat_one_bodies[e_id] = body
    else: #Assuming there are only 2 categories
        cat_one_id = int(e.attrib['CorrespondingCategory1Id'])
        print "Cat1 Body: '%s' Cat2 Body: '%s'" % (body, cat_one_bodies[cat_one_id])

Although this doesn't parse your file, hopefully it shows you the idea. It potentially uses quite a bit of memory (since it's maintaining all the category1 bodies in the dictionary), so that may be a consideration.

2012-04-05 20:04
by jagill

In XSLT 3.0 (draft) as currently implemented in Saxon-EE, you can write a streaming transformation that solves this problem as follows:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:map="http://www.w3.org/2005/xpath-functions/map">
<xsl:mode streamable="yes"/>
<xsl:template match="/">
  <xsl:iterate select="Document/Data">
    <xsl:param name="map" select="map{}"/>
    <xsl:choose>
      <xsl:when test="@category='1'">
        <xsl:next-iteration>
          <xsl:with-param name="map" select="map:put($map, string(@Id), string(@Body))"/>
        </xsl:next-iteration>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="'Cat1 Body: ', 
                              $map(@CorrespondingCategoryId), 'Cat2 Body', @Body"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:iterate>
</xsl:template>

I haven't tested this (it's late at night on the eve of a four-day holiday...) but if you are interested in pursuing this approach I'll be happy to help. XSLT 3.0 is still a draft spec and fairly fluid. Its focus is on solving problems like this one using a streaming approach that handles very large documents using bounded memory. Saxon-EE 9.4 implements a snapshot of the spec.

2012-04-05 22:19
by Michael Kay

This looks pretty sweet - kindall 2012-04-05 22:28

If the IDs are in ascending order, then you can roll out your own function that reads an element at any position in a file. Then you can just scan the entire file and for every element you can find the corresponding one using binary search algorithm. The thing will run in O(n log n) while still using negligible amount of memory.

2012-04-05 22:32
by user1096188

Try and use iterparse from lxml. I think it will suit the problem which you wish to handle.

2012-05-08 11:53
by Jannat Arora

Here is a similar question solved by .iterparse() http://stackoverflow.com/a/9814580/134670 - pepr 2012-05-08 12:15