Nokogiri Xpath to retrieve text after within <TD> and - 【StackMirror】|html|ruby|xpath|nokogiri

I have the following html and like to know how to use xpath to retrieve all the info: - Name(first, last) - Nick Name - email - shipping address...

Primarily, retrieve text after  . Many Thanks in advance.

<table>
<tr>
<td valign="top" width="50%" align="left">
<span>Buyer</span><br/>FirstName LastName<br/>NickName<br/>First.Last@SomeCompany.com</td>

<tr><td valign="top" width="40%" align="left">
<span><span>Shipping address - </span><span>confirmed</span></span><br/>FirstName LastName<br/>Attn: FirstName<br/>1234 Main St.<br/>TheCity, TheState, 12345<br/>United States<br/></td>
</tr></table>

After I posted the above question, I learned that I can do these, but does not look clean:

buyer = html.xpath("//span/text()[contains(., 'Buyer')]").first.parent 
buyer_name = buyer.next.next 
puts "Buyer's Full name: #{buyer_name.text}" 
buyer_nick = buyer_name.next.next 
puts "Buyer's Nick name: #{buyer_nick.text}" 
buyer_email = buyer_nick.next.next 
puts "Buyer's email: #{buyer_email.text}"

My question now is why the html.xpath("//span/text()[contains(., 'Buyer')]") return the TEXT itself instead of the ELEMENT. Again, thanks!!

2012-04-04 21:45
by TX T

Here's a concise way:

name, nick, email, *addr = doc.search('//td/text()[preceding-sibling::br]')

puts name, nick, email, "--", addr

The XPath does exactly what you stated: it takes all text nodes following a br. The address is slurped into one variable, but you can get the components separately if you want.

Output:

FirstName LastName
NickName
First.Last@SomeCompany.com
--
FirstName LastName
Attn: FirstName
1234 Main St.
TheCity, TheState, 12345
United States

2012-04-05 00:06
by Mark Thomas

  are a bit of a unique problem when dealing with HTML. They don't really get used for anything but formatting the content in the page, i.e., breaking lines like a new-line would in a *nix text file. So, my tactic when dealing with them while extracting text, is to transform them into new-lines.

Parse the content into a Nokogiri::HTML document:

doc = Nokogiri::HTML(html_doc_to_parse)

Convert the   to new-lines:

doc.search('br').each { |br| br.replace("\n") }

Then, find the cells you want:

doc.search('//td').map{ |td| td.content }

which will return something like:

doc.search('//td').map(&:content)
=> ["\n  Buyer\nFirstName LastName\nNickName\nFirst.Last@SomeCompany.com",
 "\n  Shipping address - confirmed\nFirstName LastName\nAttn: FirstName\n1234 Main St.\nTheCity, TheState, 12345\nUnited States\n"]

which looks like this when printed:

puts doc.search('//td').map(&:content)

  Buyer
FirstName LastName
NickName
First.Last@SomeCompany.com

  Shipping address - confirmed
FirstName LastName
Attn: FirstName
1234 Main St.
TheCity, TheState, 12345
United States

From there it's a case of determining the correct array elements that you want, and then splitting on the new-lines i.e., String.split("\n").

2012-04-04 22:30
by the Tin Man

Thank you very much "the Tin Man" for you quick replied with help. Please see my answer to my own question below - TX T 2012-04-04 22:45

This doesn't help much. It'd be better to append it to your original question, format the code, and delete this comment - the Tin Man 2012-04-04 23:53

Good but consider br.after instead of br.replace. Also didn't you once say it's unsafe to modify nodes in production code - pguardiario 2012-04-05 02:47

If you're grabbing the   for immediate processing it's fine because it's not changing the structure. Like I said in the answer,   is a problem-child. I wouldn't do it with other tags, except maybe  on rare occasion - the Tin Man 2012-04-05 03:28

See http://stackoverflow.com/q/43594656/128421 als - the Tin Man 2017-04-24 18:04

Nokogiri Xpath to retrieve text after within and

Nokogiri Xpath to retrieve text after
within and