Iyk*_*eln 4 python beautifulsoup html-parsing python-2.7
想从我的 html 文件中提取文本。如果我在下面使用特定文件:
import bs4, sys
from urllib import urlopen
#filin = open(sys.argv[1], 'r')
filin = '/home/iykeln/Desktop/R_work/file1.html'
webpage = urlopen(filin).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
print u''.join(node.findAll(text=True)).encode('utf-8')
Run Code Online (Sandbox Code Playgroud)
它会起作用。但是在下面尝试使用 open(sys.argv[1], 'r') 获取非特定文件:
import bs4, sys
from urllib import urlopen
filin = open(sys.argv[1], 'r')
#filin = '/home/iykeln/Desktop/R_work/file1.html'
webpage = urlopen(filin).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
print u''.join(node.findAll(text=True)).encode('utf-8')
Run Code Online (Sandbox Code Playgroud)
或者
import bs4, sys
from urllib import urlopen
with open(sys.argv[1], 'r') as filin:
webpage = urlopen(filin).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
print u''.join(node.findAll(text=True)).encode('utf-8')
Run Code Online (Sandbox Code Playgroud)
我将收到以下错误:
Traceback (most recent call last):
File "/home/iykeln/Desktop/py/clean.py", line 5, in <module>
webpage = urlopen(filin).read().decode('utf-8')
File "/usr/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 180, in open
fullurl = unwrap(toBytes(fullurl))
File "/usr/lib/python2.7/urllib.py", line 1057, in unwrap
url = url.strip()
AttributeError: 'file' object has no attribute 'strip'
Run Code Online (Sandbox Code Playgroud)
您不应该调用open,只需将文件名传递给urlopen:
import bs4, sys
from urllib import urlopen
webpage = urlopen(sys.argv[1]).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
print u''.join(node.findAll(text=True)).encode('utf-8')
Run Code Online (Sandbox Code Playgroud)
仅供参考,您不需要urllib打开本地文件:
import bs4, sys
with open(sys.argv[1], 'r') as f:
webpage = f.read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
print u''.join(node.findAll(text=True)).encode('utf-8')
Run Code Online (Sandbox Code Playgroud)
希望有帮助。
| 归档时间: |
|
| 查看次数: |
11240 次 |
| 最近记录: |