ruby regex is using last match to seperate string but should use first - 【StackMirror】|ruby-on-rails|ruby|regex|mechanize

Im parsing the source of a website and Im using this regex:

/page\.php\?id\=([0-9]*)\"\>(.*)\<\/a\>\<\/span\>/.match(self.agent.page.content)

self.agent.page.content contains the source of the page fetched by mechanize. The regex basicly works but in the secound match it does fetch more then it should because there are more then one <\/a\>\<\/span\> in the source and the regex uses the last one so I get a bunch of html crap. How can I tell the regex to use the first match as an "end marker"?

2012-04-05 17:49
by davidb

Mechanize already has loaded Nokogiri, which is an excellent HTML parser. You'd do much better using it instead of a regex - the Tin Man 2012-04-05 19:03

I know that but in this case its not practable because the code Im taking "care" of is commented so I cant access it using nokogiri. Of cause I know I could use create substings and interpret them manully but no,.. - davidb 2012-04-05 23:06

.* is greedy, whereas .*? is non-greedy. Try:

/page\.php\?id\=([0-9]*)\"\>(.*?)\<\/a\>\<\/span\>/.match(self.agent.page.content)

2012-04-05 17:57
by niiru

thanks this works really good - davidb 2012-04-05 18:01