小编Iyk*_*eln的帖子

使用 bs4 提取 html 文件中的文本

想从我的 html 文件中提取文本。如果我在下面使用特定文件：

import bs4, sys
from urllib import urlopen
#filin = open(sys.argv[1], 'r')
filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')

Run Code Online (Sandbox Code Playgroud)

它会起作用。但是在下面尝试使用 open(sys.argv[1], 'r') 获取非特定文件：

import bs4, sys
from urllib import urlopen
filin = open(sys.argv[1], 'r')
#filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')

Run Code Online (Sandbox Code Playgroud)

或者

import bs4, sys
from urllib import urlopen
with open(sys.argv[1], 'r') as filin:
    webpage = urlopen(filin).read().decode('utf-8')
    soup = bs4.BeautifulSoup(webpage)
    for node …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup html-parsing python-2.7

Iyk*_*eln

2016 12-20

4
推荐指数

1
解决办法

1万
查看次数