我有兴趣从这个表中删除文本:https://ows.doleta.gov/unemploy/trigger/2011/trig_100211.html 以及其他类似的文本.
我写了一个快速python脚本,适用于以类似方式格式化的其他表:
state = ""
weeks = ""
edate = ""
pdate = url[-11:]
pdate = pdate[:-5]
table = soup.find("table")
for row in table.findAll('tr'):
cells = row.findAll("td")
if len(cells) == 13:
state = row.find("th").find(text=True)
weeks = cells[11].find(text=True)
edate = cells[12].find(text=True)
try:
print pdate, state, weeks, edate
f.writerow([pdate, state, weeks, edate])
except:
print state[1] + " error"
Run Code Online (Sandbox Code Playgroud)
但是,该脚本不适用于此表,因为标记在一半的行中被破坏.一半行的格式没有标记,以指示行的开头:
</tr> #end of last row, on State0
<td headers = "State1 no info", attributes> <FONT attributes> text </FONT> </td> …Run Code Online (Sandbox Code Playgroud)