使用 Pandas 读取下载的 html 文件

Question

使用 Pandas 读取下载的 html 文件

作为标题，我尝试使用read_html但给我以下错误：

In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6

Run Code Online (Sandbox Code Playgroud)

我做错了什么？

更新 01

HTML 在顶部包含一些 javascript，然后是一个 html 表。我使用 R 来处理它，通过 XML 包解析 html 给我一个数据框。我想用 python 来做，我应该在给熊猫之前使用像 beautifulsoup 这样的其他东西吗？

Answer 1

ZJS*_*ZJS 6

我认为您通过使用像漂亮汤这样的 html 解析器走上了正确的轨道。pandas.read_html() 读取 html 表而不是 html 页面。

你会想做这样的事情......

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input

Run Code Online (Sandbox Code Playgroud)

无法让这个解决方案工作（但我也无法安装lxml，这可能与它有关）。然而，`df = pd.read_html('path/to/file.html',flavor='bs4')`工作得很好。 (4认同)

归档时间：	11 年，7 月前
查看次数：	20238 次
最近记录：	8 年，2 月前