Extracting a specific string out an HTML document - 【StackMirror】|python|html|parsing

I need to sample and extract only a specific string out of an offline HTML document and write that information nice and clean into a *.txt file.

So for example, lets assume that this is a section of the HTML file:

    <span id="dataView01">001.00 SPL</span>
    <span id="dataView02">543.00 SPL</span>
    <span id="dataView03">056.00 SPL</span>
    <span id="dataView04">228.00 SPL</span>

I need to get this as a result:

   001.00 SPL
   543.00 SPL
   056.00 SPL
   228.00 SPL

Could you please help me with this, Thanks.

2012-04-04 22:08
by mbilyanov

Use an HTML parser like BeautifulSoup.
Example:

from bs4 import BeautifulSoup as bs
import re

markup = '''<span id="dataView01">001.00 SPL</span>
    <span id="dataView02">543.00 SPL</span>
    <span id="dataView03">056.00 SPL</span>
    <span id="dataView04">228.00 SPL</span>'''

soup = bs(markup)
tags = soup.find_all('span', id=re.compile(r'[dataView]\d+'))
for t in tags:  
    print(t.text)

Result:

001.00 SPL
543.00 SPL
056.00 SPL
228.00 SPL

Next step; write to .txt file:

import csv

with open('output.txt','wb') as fou:
    csv_writer = csv.writer(fou)
    for tag in tags:
        split_on_whitespace = t.text.split()
        csv_writer.writerow(split_on_whitespace)

2012-04-04 22:14
by bernie

Please note: code is for illustrative purposes, and there are certainly other ways to do this. If you have questions feel free to post a comment and I will explain ASAP. Best of luck - bernie 2012-04-04 22:21

Hello, just downloaded bs4, works great. The only part I am not sure about is how to sample the html file and get all the "543.00 SPL" strings ready to be processed by the bs4 library. Thanks! Really helpfull - mbilyanov 2012-04-04 22:33

Have you tried the .find_all() code sample I posted? It is tailored to the markup example you provided - bernie 2012-04-04 22:34

Sorry, I have not seen it. I am a bit new to all of this stackoverflow posting and linking etc - mbilyanov 2012-04-04 22:46

I got all of that working, the only missing part is the part where I open the HTML file and sample it, so that I can feed it into the bs processor. Thanks alot for the help - mbilyanov 2012-04-04 22:53

For that have a look at urllib2; specifically .urlopen(). Here's a handy HOWTO on the subject: http://docs.python.org/howto/urllib2.htm - bernie 2012-04-04 22:57

Thanks. Sorry for keeping asking this, but will this work on an offline HTML file - mbilyanov 2012-04-04 23:04

Yes, definitely. You can open it up as a plain-text file - bernie 2012-04-04 23:05

Great! Works! Thank you very much - mbilyanov 2012-04-04 23:19

@symbolix you need to accept this answer by clicking the checkmark and making it gree - NoName 2012-04-09 16:02

Use BeautifulSoup

2012-04-04 22:13
by jldupont

Thanks, I progressed with the bs4 answer above. bs4 works great - mbilyanov 2012-04-04 22:44

 import re
 s='001.00 SPL 543.00 SPL 056.00 SPL 228.00 SPL'
 print re.search(r'(\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL)',s).group()

I dont know the surrounding text in the html document but this might work.

I see your edit i will update mine

actually go with jldupont's answer.

2012-04-04 22:11
by apple16

Thanks! Well, I am not familiar with the file operations :( I need to open the HTML file, then sample it only for those lines and then extract that middle part. The re lib probably will do the job of finding and then I will split and filter and extract but I am not sure how will I open the file and scan it. Thanks - mbilyanov 2012-04-04 22:13