web scraping with XML, turn td node into val

Go To StackoverFlow.com

1

I want to scrape google search for number of hits:

require(XML)

input <- "projektgebiet" 
url <- paste("https://www.google.at/search?q=",
             input,
             "&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:de:official&client=firefox-a",
             sep = "")

CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
doc <- htmlParse(script)

xmlValue(getNodeSet(doc, "//td")[[6]])

I'm close - the only problem is that I don't grasp how to address the two values within the node seperately - I actually just want the number.. (in the above example the two values are concatenated)

I'd also wish to know of a way how to avoid the indexing [[6]], but don't know if it is possible to address the node by any other characteristic.

Any help or pointers would be greatly appreciated!

ps: of course I could use a regex - but I think this is not the most elegant way..

2012-04-04 20:31
by Kay


1

You can avoid the [[6]] by noticing that one of the div elements has an id attribute. The following returns the contents of the two child nodes, separately, without concatenating them.

xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)
# [1] "Erweiterte Suche"            "Ungefähr 245.000 Ergebnisse"
2012-04-05 05:38
by Vincent Zoonekynd
Ads