Returning web page abstract with Solr - 【StackMirror】|solr|nutch

I've crawled a site with Nutch successfully and am trying to return a highlighted abstract using Solr as the indexer/searcher. So, if I query "ocean" then I want to return a 20-30 word abstract from just the text of the web page (not the title or url) containing that query term.

I've copied the Nutch schema.xml as my Solr schema.xml.

So I have two questions: 1. Is the "content" field in the Nutch schema.xml the field for body elements of a web page? 2. If this field is not stored, is there a way to have Solr retrieve that field at search time so that it can be highlighted?

solr
nutch

2012-04-04 08:00
by Ramsel

I haven't used Nutch in a long time, but I think it's pretty safe to assume that "content" is the field you want to highlight.
You need to store the field to be able to use highlighting and if you want to use FastVectorHighlighting you need to enable the following attributes for that field: termVectors, termPositions and termOffsets.

If you use FVH, you can also use boundaryScanner in Solr 3.5 and up.

2012-04-04 08:54
by Okke Klein