Mar*_*ria 5 html python beautifulsoup pandas
如果该行具有rowspan元素,则如何使该行对应于维基百科页面中的表.
from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd
wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
try:
table = soup.find_all('table')[6]
except AttributeError as e:
print 'No tables found, exiting'
try:
first = table.find_all('tr')[0]
except AttributeError as e:
print 'No table row found, exiting'
try:
allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
print 'No table row found, exiting'
headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]
df = pd.DataFrame(data=results, columns=headers)
df
Run Code Online (Sandbox Code Playgroud)
我得到表作为输出..但对于行包含rowspan的表-我得到表如下 -

如您所知,问题是由于以下情况造成的,
\n\nhtml内容:
\n\n<tr>\n <td rowspan="2">2=</td>\n <td>West Indies</td>\n <td>4</td>\n <td>Lord\'s</td>\n <td>2009</td>\n</tr>\n<tr>\n <td style="text-align:left;">India</td>\n <td>4</td>\n <td>Mumbai</td>\n <td>2012</td>\n</tr>\nRun Code Online (Sandbox Code Playgroud)\n\n因此,当td有rowspan属性时,请考虑在同一级别的td下一个重复相同的值,以及下一个标签数量的平均值tr的值。 rowspantr
rowspan信息并保存在变量中。保存标签的序号tr,标签的序号,即有多少个标签有相同td的值,文本值。rowspantrtdtdtr按照上述方法更新结果。注意::仅检查给定的测试用例。需要检查更多测试用例。
\n\n代码:
\n\nfrom bs4 import BeautifulSoup\nimport urllib2\nfrom lxml.html import fromstring \nimport re\nimport csv\nimport pandas as pd\n\n\nwiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"\nheader = {\'User-Agent\': \'Mozilla/5.0\'} #Needed to prevent 403 error on Wikipedia\nreq = urllib2.Request(wiki,headers=header)\npage = urllib2.urlopen(req)\n\nsoup = BeautifulSoup(page)\n\ntable = soup.find_all(\'table\')[6]\n\ntmp = table.find_all(\'tr\')\n\nfirst = tmp[0]\nallRows = tmp[1:-1]\n#table.find_all(\'tr\')[1:-1]\n\n\nheaders = [header.get_text() for header in first.find_all(\'th\')]\n\nresults = [[data.get_text() for data in row.find_all(\'td\')] for row in allRows]\n\n#<td rowspan="2">2=</td>\n# list of tuple (Level of tr, Level of td, total Count, Text Value)\n#e.g.\n#[(1, 0, 2, u\'2=\')]\n# (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=)\nrowspan = []\n\nfor no, tr in enumerate(allRows):\n tmp = []\n for td_no, data in enumerate(tr.find_all(\'td\')):\n print data.has_key("rowspan")\n if data.has_key("rowspan"):\n rowspan.append((no, td_no, int(data["rowspan"]), data.get_text()))\n\n\nif rowspan:\n for i in rowspan:\n # tr value of rowspan in present in 1th place in results\n for j in xrange(1, i[2]):\n #- Add value in next tr.\n results[i[0]+j].insert(i[1], i[3])\n\n\ndf = pd.DataFrame(data=results, columns=headers)\nprint df\nRun Code Online (Sandbox Code Playgroud)\n\n输出:
\n\n Rank Opponent No. wins Most recent venue Season\n0 1 \xc2\xa0South Africa 6 Lord\'s 1951\n1 2= \xc2\xa0West Indies 4 Lord\'s 2009\n2 2= \xc2\xa0India 4 Mumbai 2012\n3 4 \xc2\xa0Australia 3 Sydney 1932\n4 5 \xc2\xa0Pakistan 2 Trent Bridge 1967\n5 6 \xc2\xa0Sri Lanka 1 Old Trafford 2002\nRun Code Online (Sandbox Code Playgroud)\n\n还工作到表 10
\n\n Rank Hundreds Player Matches Innings Average\n0 1 25 Alastair Cook 107 191 45.61\n1 2 23 Kevin Pietersen 104 181 47.28\n2 3 22 Colin Cowdrey 114 188 44.07\n3 3 22 Wally Hammond 85 140 58.46\n4 3 22 Geoffrey Boycott 108 193 47.72\n5 6 21 Andrew Strauss 100 178 40.91\n6 6 21 Ian Bell 103 178 45.30\n7 8= 20 Ken Barrington 82 131 58.67\n8 8= 20 Graham Gooch 118 215 42.58\n9 10 19 Len Hutton 79 138 56.67\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
4642 次 |
| 最近记录: |