Get tags around text in HTML document using C# - 【StackMirror】|c#|html|html-parsing|html-agility-pack

I would like to search an HTML file for a certain string and then extract the tags. Given:

<div_outer><div_inner>Happy birthday<div><div>

I would like to search the HTML for "Happy birthday" then have a function return some sort of tag structure: this is the innermost tag, this is the tag outside that one, etc. So, <div_inner></div> then <div_outer></div>.

Any ideas? I am thinking HTMLAgilityPack but I haven't been able to figure out how to do it.

Thanks as always, guys.

2012-04-04 19:44
by Mark Williams

What is the source of this HTML - Oded 2012-04-04 19:45

The HAP is a good place indeed for this.

You can use the OuterHtml and Parent properties of a Node to get the enclosing elements and markup.

2012-04-04 19:46
by Oded

So are you saying iterate through each tag until I find the text and then work my way backwards? Good idea but it doesn't sound too efficient. I guess sometimes the obvious answer wins, haha - Mark Williams 2012-04-04 20:27

@MarkWilliams - If you don't have any way to navigate to the text (say a div with a specific attribute value), that's the only way to do it with a parser. You could get the index of the string and then go backwards an forwards in the string to find the enclosing elements, but that would mean writing your own parsing routines - Oded 2012-04-04 20:31

You could use xpath for this. I use //*[text()='Happy birthday'][1]/ancestor-or-self::* expression which finds a first (for simplicity) node which text content is Happy birthday, and then returns all the ancestors (parent, grandparent, etc.) of this node and the node itself:

var doc = new HtmlDocument();
doc.LoadHtml("<div_outer><div_inner>Happy birthday<div><div>");

var ancestors = doc.DocumentNode
    .SelectNodes("//*[text()='Happy birthday'][1]/ancestor-or-self::*")
    .Reverse()
    .ToList();

It seems that the order of the nodes returned is the order the nodes found in the document, so I used Enumerable.Reverse method to reverse it.

This will return 2 nodes: div_inner and div_outer.

2012-04-04 21:52
by Alex