I am new to Marklogic world. My program uses a custom Java app to query Moreover.com fetch XML data feed every 30 seconds. Results are returned in XML format. The Java app uses XCC API (Marklogic API) to insert the retrieved data into ML in a single XML file. Size of the data is 6 MB every minute, if application will run for a day or so amount of data wil grow in GBs. I am not aware of any admin configuration i have to do to put this amount of huge data in single XML file in MarkLogic. Can somebody validate my approach, or suggest if I have to do any configuration changes at Admin level. The Structure of XML is as follows...
<?xml version="1.0" encoding="UTF-8"?>
<moreovercontentdump>
<article id="_6232903453">
<description></description>
<author></author>
<source_category>Local</source_category>
<genre>General</genre>
<publisher></publisher>
<media_type>text</media_type>
<docurl>http://www.ilrestodelcarlino.it</docurl>
<harvest_time>Apr 4 2012 4:28PM</harvest_time>
<valid_time>May 14 2012 4:27PM</valid_time>
</article>
<article id="_6232903453">
<description></description>
<author></author>
<source_category>Local</source_category>
<genre>General</genre>
<publisher></publisher>
<media_type>text</media_type>
<docurl>http://www.ilrestodelcarlino.it</docurl>
<harvest_time>Apr 4 2012 4:28PM</harvest_time>
<valid_time>May 14 2012 4:27PM</valid_time>
</article>
<article id="_6232903453">
<description></description>
<author></author>
<source_category>Local</source_category>
<genre>General</genre>
<publisher></publisher>
<media_type>text</media_type>
<docurl>http://www.ilrestodelcarlino.it</docurl>
<harvest_time>Apr 4 2012 4:28PM</harvest_time>
<valid_time>May 14 2012 4:27PM</valid_time>
</article>
</moreovercontentdump>
Looking at the sample XML, I think you will probably want to store each article in its own document. You could write a FLWOR expression to call xdmp:document-insert
, or call xdmp:spawn
if you would prefer to insert each document in an asynchronous task.
The simplest code might look like this:
for $article in xdmp:http-get($some-url, $options)/moreovercontentdump/article
let $uri := concat('moreover/', $article/@id)
return xdmp:document-insert($uri, $article)
You could enhance that code by rewriting some of the original XML. For example, you might want to reformat the harvest_time
and valid_time
elements in xs:dateTime format. That way you can create a range index on those values.
In general, you'll be much better served if you store each response from Moreover.com in MarkLogic as its own document. In some ways, inside MarkLogic, documents are like rows in an RDBMS.
Also, if you insert one of these very 30 seconds, I'm having trouble seeing how that comes to 6MB per minute of ingest. Are there some details you left out?