小编Iyk*_*eln的帖子

使用 bs4 提取 html 文件中的文本

想从我的 html 文件中提取文本。如果我在下面使用特定文件:

import bs4, sys
from urllib import urlopen
#filin = open(sys.argv[1], 'r')
filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')
Run Code Online (Sandbox Code Playgroud)

它会起作用。但是在下面尝试使用 open(sys.argv[1], 'r') 获取非特定文件:

import bs4, sys
from urllib import urlopen
filin = open(sys.argv[1], 'r')
#filin = '/home/iykeln/Desktop/R_work/file1.html' 
webpage = urlopen(filin).read().decode('utf-8')
soup = bs4.BeautifulSoup(webpage)
for node in soup.findAll('html'):
    print u''.join(node.findAll(text=True)).encode('utf-8')
Run Code Online (Sandbox Code Playgroud)

或者

import bs4, sys
from urllib import urlopen
with open(sys.argv[1], 'r') as filin:
    webpage = urlopen(filin).read().decode('utf-8')
    soup = bs4.BeautifulSoup(webpage)
    for node …
Run Code Online (Sandbox Code Playgroud)

python beautifulsoup html-parsing python-2.7

4
推荐指数
1
解决办法
1万
查看次数

标签 统计

beautifulsoup ×1

html-parsing ×1

python ×1

python-2.7 ×1