小编Kay*_*Kay的帖子

修复Python中破碎的HTML - Beautifulsoup无法正常工作

我有兴趣从这个表中删除文本:https://ows.doleta.gov/unemploy/trigger/2011/trig_100211.html 以及其他类似的文本.

我写了一个快速python脚本,适用于以类似方式格式化的其他表:

    state = ""
    weeks = ""
    edate = "" 
    pdate = url[-11:]
    pdate = pdate[:-5]

    table = soup.find("table") 

    for row in table.findAll('tr'):     
        cells = row.findAll("td")
        if len(cells) == 13: 
            state = row.find("th").find(text=True) 
            weeks = cells[11].find(text=True) 
            edate = cells[12].find(text=True)
            try:   
                print pdate, state, weeks, edate 
                f.writerow([pdate, state, weeks, edate])
            except:  
                print state[1] + " error"

Run Code Online (Sandbox Code Playgroud)

但是,该脚本不适用于此表,因为标记在一半的行中被破坏.一半行的格式没有标记,以指示行的开头:

</tr> #end of last row, on State0  
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td> …

Run Code Online (Sandbox Code Playgroud)

python html-table tidy beautifulsoup web-scraping

Kay*_*Kay

2014 08-13

3
推荐指数

1
解决办法

1856
查看次数

标签统计

beautifulsoup ×1

html-table ×1

python ×1

tidy ×1

web-scraping ×1

修复Python中破碎的HTML - Beautifulsoup无法正常工作

标签 统计

小编Kay_Kay的帖子

标签统计