My aim is to parse 25 GB of XML data. An example of such a data is given below:
<Document>
<Data Id='12' category='1' Body="abc"/>
<Data Id='13' category='1' Body="zwq"/>
.
.
<Data Id='82018030' category='2' CorrespondingCategory1Id='13' Body="pqr"/>
However..considering the data I have of "25 GB"...my approach is quite inefficient. Please suggest some way to improve my code or an alternate approach. Also kindly include a small example code to make things clearer.
You might find that a SAX parser works better for this task. Rather than building a DOM, a SAX parser turns the XML file into a stream of elements and calls functions you provide to let you handle each element.
The good thing is that SAX parsers can be very fast and memory-efficient compared to DOM parsers, and some don't even need to be given all the XML at once, which would be ideal when you have 25 GB of it.
Unfortunately, if you need any context information, like "I want tag <B>
but only if it's inside tag <A>
," you must maintain it yourself, since all the parser gives you is "start tag <A>
, start tag <B>
, end tag <B>
, end tag <A>
." It never explicitly tells you that tag <B>
is inside tag <A>
, you have to figure that out from what you saw. And once you have seen an element, it's gone unless you remembered it yourself.
This gets very hairy for complex parsing jobs, but yours is probably manageable.
It happens that Python's standard library has a SAX parser in xml.sax
. You probably want something like xml.sax.xmlreader.IncrementalParser
.
My first suggestion from looking at your question would be to use a relational database such as MySQL or sqlite. It wouldn't be hard to get your XML data into this form, and then querying on that data would be both more straightforward and faster.
Your initial algorithm runs in O(n^2), which will be very slow for 25GB of data. Ideally you will get it down to O(n) or O(n log n). In the absence of any other information about the data (like whether category 1 or 2 is smaller, etc), you can do something like this (which is O(n)):
from lxml import objectify
f=open('myfile25GB', 'r')
text=f.read()
root=objectify.fromstring(text)
cat_one_bodies = {}
for e in root.attrib['Document'].row:
category = int(e.attrib['category'])
body = e.attrib['Body']
if category == 1:
e_id = int(e.attrib['Id'])
cat_one_bodies[e_id] = body
else: #Assuming there are only 2 categories
cat_one_id = int(e.attrib['CorrespondingCategory1Id'])
print "Cat1 Body: '%s' Cat2 Body: '%s'" % (body, cat_one_bodies[cat_one_id])
Although this doesn't parse your file, hopefully it shows you the idea. It potentially uses quite a bit of memory (since it's maintaining all the category1 bodies in the dictionary), so that may be a consideration.
In XSLT 3.0 (draft) as currently implemented in Saxon-EE, you can write a streaming transformation that solves this problem as follows:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:map="http://www.w3.org/2005/xpath-functions/map">
<xsl:mode streamable="yes"/>
<xsl:template match="/">
<xsl:iterate select="Document/Data">
<xsl:param name="map" select="map{}"/>
<xsl:choose>
<xsl:when test="@category='1'">
<xsl:next-iteration>
<xsl:with-param name="map" select="map:put($map, string(@Id), string(@Body))"/>
</xsl:next-iteration>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="'Cat1 Body: ',
$map(@CorrespondingCategoryId), 'Cat2 Body', @Body"/>
</xsl:otherwise>
</xsl:choose>
</xsl:iterate>
</xsl:template>
I haven't tested this (it's late at night on the eve of a four-day holiday...) but if you are interested in pursuing this approach I'll be happy to help. XSLT 3.0 is still a draft spec and fairly fluid. Its focus is on solving problems like this one using a streaming approach that handles very large documents using bounded memory. Saxon-EE 9.4 implements a snapshot of the spec.
If the IDs are in ascending order, then you can roll out your own function that reads an element at any position in a file. Then you can just scan the entire file and for every element you can find the corresponding one using binary search algorithm. The thing will run in O(n log n) while still using negligible amount of memory.
Try and use iterparse from lxml. I think it will suit the problem which you wish to handle.