I want to scrape google search for number of hits:
require(XML)
input <- "projektgebiet"
url <- paste("https://www.google.at/search?q=",
input,
"&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:de:official&client=firefox-a",
sep = "")
CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
doc <- htmlParse(script)
xmlValue(getNodeSet(doc, "//td")[[6]])
I'm close - the only problem is that I don't grasp how to address the two values within the node seperately - I actually just want the number.. (in the above example the two values are concatenated)
I'd also wish to know of a way how to avoid the indexing [[6]], but don't know if it is possible to address the node by any other characteristic.
Any help or pointers would be greatly appreciated!
ps: of course I could use a regex - but I think this is not the most elegant way..
You can avoid the [[6]]
by noticing that one of the div
elements has an id
attribute.
The following returns the contents of the two child nodes, separately,
without concatenating them.
xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)
# [1] "Erweiterte Suche" "Ungefähr 245.000 Ergebnisse"