读取保存在文本文件中的源页面并提取文本

Question

读取保存在文本文件中的源页面并提取文本

我有多个文本文件，用于存储来自网站的源页面。所以每个文本文件都是一个源页面。

我需要使用以下代码从存储在文本文件中的 div 类中提取文本：

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt"))
txt = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text
print txt

Run Code Online (Sandbox Code Playgroud)

我已经检查了我的汤对象的类型，以确保它在寻找 div 类时没有使用字符串查找方法。汤对象类型

print type(soup)
<class 'bs4.BeautifulSoup'>

Run Code Online (Sandbox Code Playgroud)

我已经参考了之前的一篇文章，并在beautifulsoup声明中写了公开声明。

错误：

Traceback (most recent call last):
  File "html_desc_cleaning.py", line 13, in <module>
    txt2 = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text
AttributeError: 'NoneType' object has no attribute 'text'

Run Code Online (Sandbox Code Playgroud)

来自页面的来源：

Answer 1

Kev*_*uan 7

尝试替换这个：

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt"))

Run Code Online (Sandbox Code Playgroud)

有了这个：

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt").read())

Run Code Online (Sandbox Code Playgroud)

顺便说一句，阅读后关闭文件是个好主意。你可以这样使用with：

with open("zing.internet.accelerator.plus.txt") as f:
    soup = BeautifulSoup(f.read())

Run Code Online (Sandbox Code Playgroud)

with 将自动关闭文件。

这是一个关于为什么需要.read()函数的示例：

>>> a = open('test.txt')
>>> type(a)
<class '_io.TextIOWrapper'>

>>> print(a)
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

>>> b = a.read()
>>> type(b)
<class 'str'>

>>> print(b)
Hey there.

>>> print(open('test.txt'))
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

>>> print(open('test.txt').read())
Hey there.

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	5788 次
最近记录：	10 年，2 月前