Optimal way of storing XML data in Marlogic - 【StackMirror】|marklogic

I am new to Marklogic world. My program uses a custom Java app to query Moreover.com fetch XML data feed every 30 seconds. Results are returned in XML format. The Java app uses XCC API (Marklogic API) to insert the retrieved data into ML in a single XML file. Size of the data is 6 MB every minute, if application will run for a day or so amount of data wil grow in GBs. I am not aware of any admin configuration i have to do to put this amount of huge data in single XML file in MarkLogic. Can somebody validate my approach, or suggest if I have to do any configuration changes at Admin level. The Structure of XML is as follows...

<?xml version="1.0" encoding="UTF-8"?>      
<moreovercontentdump>        
<article id="_6232903453">           
<description></description>
<author></author>       
<source_category>Local</source_category>    
<genre>General</genre>  
<publisher></publisher> 
<media_type>text</media_type>   
<docurl>http://www.ilrestodelcarlino.it</docurl>    
<harvest_time>Apr  4 2012  4:28PM</harvest_time>    
<valid_time>May 14 2012  4:27PM</valid_time>    
</article>
<article id="_6232903453">           
<description></description>
<author></author>       
<source_category>Local</source_category>    
<genre>General</genre>  
<publisher></publisher> 
<media_type>text</media_type>   
<docurl>http://www.ilrestodelcarlino.it</docurl>    
<harvest_time>Apr  4 2012  4:28PM</harvest_time>    
<valid_time>May 14 2012  4:27PM</valid_time>    
</article>
<article id="_6232903453">           
<description></description>
<author></author>       
<source_category>Local</source_category>    
<genre>General</genre>  
<publisher></publisher> 
<media_type>text</media_type>   
<docurl>http://www.ilrestodelcarlino.it</docurl>    
<harvest_time>Apr  4 2012  4:28PM</harvest_time>    
<valid_time>May 14 2012  4:27PM</valid_time>    
</article>
</moreovercontentdump>

marklogic

2012-04-04 20:27
by Pankaj

Looking at the sample XML, I think you will probably want to store each article in its own document. You could write a FLWOR expression to call xdmp:document-insert, or call xdmp:spawn if you would prefer to insert each document in an asynchronous task.

The simplest code might look like this:

for $article in xdmp:http-get($some-url, $options)/moreovercontentdump/article
let $uri := concat('moreover/', $article/@id)
return xdmp:document-insert($uri, $article)

You could enhance that code by rewriting some of the original XML. For example, you might want to reformat the harvest_time and valid_time elements in xs:dateTime format. That way you can create a range index on those values.

2012-04-05 02:39
by mblakele

In general, you'll be much better served if you store each response from Moreover.com in MarkLogic as its own document. In some ways, inside MarkLogic, documents are like rows in an RDBMS.

Also, if you insert one of these very 30 seconds, I'm having trouble seeing how that comes to 6MB per minute of ingest. Are there some details you left out?

2012-04-05 01:01
by Eric Bloch

every hit to moreover.com fetch me thousands of articles (i.e. around 6 mb per minute). I have not given all the tag details in sample xml above. 2. If result of every hit saved as new XML file in marklogic, there will be 2880 files every day in marklogic DB and number will grow day by day. Is there a problem?

- Pankaj 2012-04-05 06:22

If you split the documents into articles you may be able to do some de-duplication on articles that have not changed, reducing the overall ingest per day. "Is there a problem?" is a hard question to answer. Databases can clearly handle 2,880 inserts per day as long as you have the hardware capacity to handle this throughput - derickson 2012-04-05 12:56