Gan*_*row 25 python linux bash perl
我正在寻找一种在linux shell环境中从HTML获取某些信息的方法.
这有点我感兴趣:
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>
Run Code Online (Sandbox Code Playgroud)
我想存储在shell变量中,或者在从上面的html中提取的键值对中回显这些变量.示例:
Tests : 103
Failures : 24
Success Rate : 76.70 %
and so on..
Run Code Online (Sandbox Code Playgroud)
我现在可以做的是创建一个java程序,它将使用sax解析器或hs解析器(如jsoup)来提取此信息.
但是在这里使用java似乎是在你要执行的"wrapper"脚本中包含runnable jar的开销.
我确信必须有"shell"语言,可以做同样的事情,如perl,python,bash等.
我的问题是我对这些没有经验,有人可以帮我解决这个"相当容易"的问题
快速更新:
我忘了提到我在.html文件中有更多的表格和更多的行(抱歉)(清晨).
更新#2:
试图像这样安装Bsoup,因为我没有root访问权限:
$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz
$ cp -r beautifulsoup4-4.1.0/bs4 .
$ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted
$ run file (python htmlParse.py)
Run Code Online (Sandbox Code Playgroud)
错误:
$ python htmlParse.py
Traceback (most recent call last):
File "htmlParse.py", line 1, in ?
from bs4 import BeautifulSoup
File "/home/gdd/setup/py/bs4/__init__.py", line 29
from .builder import builder_registry
^
SyntaxError: invalid syntax
Run Code Online (Sandbox Code Playgroud)
更新#3:
运行Tichodromas的回答得到这个错误:
Traceback (most recent call last):
File "test.py", line 27, in ?
headings = [th.get_text() for th in table.find("tr").find_all("th")]
TypeError: 'NoneType' object is not callable
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?
小智 44
使用A的Python溶液BeautifulSoup4(编辑:用适当的跳跃.EDIT3:使用class="details"
选择table
):
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
print datasets
Run Code Online (Sandbox Code Playgroud)
结果如下:
[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]
Run Code Online (Sandbox Code Playgroud)
Edit2:要生成所需的输出,请使用以下内容:
for dataset in datasets:
for field in dataset:
print "{0:<16}: {1}".format(field[0], field[1])
Run Code Online (Sandbox Code Playgroud)
结果:
Tests : 103
Failures : 24
Success Rate : 76.70%
Average Time : 71 ms
Min Time : 0 ms
Max Time : 829 ms
Run Code Online (Sandbox Code Playgroud)
小智 9
import pandas as pd
html_tables = pd.read_html('resources/test.html')
df = html_tables[0]
df.T # transpose to align
0
Tests 103
Failures 24
Success Rate 76.70%
Average Time 71 ms
Run Code Online (Sandbox Code Playgroud)
这是最佳答案,针对 Python3 兼容性进行了调整,并通过去除单元格中的空格进行了改进:
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(s, 'html.parser')
table = soup.find("table")
# The first tr contains the field names.
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")]
print(headings)
datasets = []
for row in table.find_all("tr")[1:]:
dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td"))))
datasets.append(dataset)
print(datasets)
Run Code Online (Sandbox Code Playgroud)