Faz*_*hra 1 python beautifulsoup typeerror
我正在网页抓取这个页面http://www.crmz.com/Directory/Industry806.htm,我应该得到所有的
但是compnay名称旁边有一个rss链接,所以我没有得到结果并显示一个typeError.
这是我的代码:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
url = "http://www.crmz.com/Directory/Industry806.htm"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find("table", {"border":"0", "cellspacing":"1", "cellpadding":"2"})
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text+"|",
print
Run Code Online (Sandbox Code Playgroud)
这是我的输出:
LRI$ python scrape.py
#| Company Name| Country| State/Province|
1.| 1300 Smiles Limited|
Traceback (most recent call last):
File "scrape.py", line 17, in <module>
text = ''.join(td.find(text=True))
TypeError
Run Code Online (Sandbox Code Playgroud)
尝试加入None文本搜索的值会导致异常:
>>> [td.find(text=True) for td in rows[6].findAll('td')]
[u'2.', u'1st Dental Laboratories Plc', None, u'United Kingdom', u' ']
Run Code Online (Sandbox Code Playgroud)
这None是触发异常的原因:
>>> ''.join(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError
Run Code Online (Sandbox Code Playgroud)
那是因为.find()只会找到第一个文本对象,或者None如果没有这样的对象则返回.您可能打算使用td.findAll(text=True),它将始终返回一个列表:
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.findAll(text=True))
print text+"|",
print
Run Code Online (Sandbox Code Playgroud)
或者更好的是,使用tag.getText()方法:
for tr in rows:
cols = tr.findAll('td')
if cols:
print u'|'.join([td.getText() for td in cols])
Run Code Online (Sandbox Code Playgroud)
我强烈建议你改用BeautifulSoup 4; BeautifulSoup 3现在已经超过2年没有看到任何错误修复或其他维护.
您可能还需要查看csv模块以编写输出.