当<tr>有rowspan时我该怎么办?

Mar*_*ria 5 html python beautifulsoup pandas

如果该行具有rowspan元素,则如何使该行对应于维基百科页面中的表.

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring 
import re
import csv
import pandas as pd

wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

try:
    table = soup.find_all('table')[6]
except AttributeError as e:
    print 'No tables found, exiting'

try:
    first = table.find_all('tr')[0]
except AttributeError as e:
    print 'No table row found, exiting'

try:
    allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
    print 'No table row found, exiting'


headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]


df = pd.DataFrame(data=results, columns=headers)
df
Run Code Online (Sandbox Code Playgroud)

我得到表作为输出..但对于行包含rowspan的-我得到表如下 - 在此输入图像描述

Viv*_*ble 3

如您所知,问题是由于以下情况造成的,

\n\n

html内容:

\n\n
<tr>\n     <td rowspan="2">2=</td>\n     <td>West Indies</td>\n     <td>4</td>\n     <td>Lord\'s</td>\n     <td>2009</td>\n</tr>\n<tr>\n     <td style="text-align:left;">India</td>\n     <td>4</td>\n     <td>Mumbai</td>\n      <td>2012</td>\n</tr>\n
Run Code Online (Sandbox Code Playgroud)\n\n

因此,当tdrowspan属性时,请考虑在同一级别的td下一个重复相同的值,以及下一个标签数量的平均值tr的值。 rowspantr

\n\n
    \n
  1. 获取所有此类rowspan信息并保存在变量中。保存标签的序号tr,标签的序号,即有多少个标签有相同td的值,文本值。rowspantrtdtd
  2. \n
  3. 全部tr按照上述方法更新结果。
  4. \n
\n\n

注意::仅检查给定的测试用例。需要检查更多测试用例。

\n\n

代码:

\n\n
from bs4 import BeautifulSoup\nimport urllib2\nfrom lxml.html import fromstring \nimport re\nimport csv\nimport pandas as pd\n\n\nwiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"\nheader = {\'User-Agent\': \'Mozilla/5.0\'} #Needed to prevent 403 error on Wikipedia\nreq = urllib2.Request(wiki,headers=header)\npage = urllib2.urlopen(req)\n\nsoup = BeautifulSoup(page)\n\ntable = soup.find_all(\'table\')[6]\n\ntmp = table.find_all(\'tr\')\n\nfirst = tmp[0]\nallRows = tmp[1:-1]\n#table.find_all(\'tr\')[1:-1]\n\n\nheaders = [header.get_text() for header in first.find_all(\'th\')]\n\nresults = [[data.get_text() for data in row.find_all(\'td\')] for row in allRows]\n\n#<td rowspan="2">2=</td>\n# list of tuple (Level of tr, Level of td, total Count, Text Value)\n#e.g.\n#[(1, 0, 2, u\'2=\')]\n# (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=)\nrowspan = []\n\nfor no, tr in enumerate(allRows):\n    tmp = []\n    for td_no, data in enumerate(tr.find_all(\'td\')):\n        print  data.has_key("rowspan")\n        if data.has_key("rowspan"):\n            rowspan.append((no, td_no, int(data["rowspan"]), data.get_text()))\n\n\nif rowspan:\n    for i in rowspan:\n        # tr value of rowspan in present in 1th place in results\n        for j in xrange(1, i[2]):\n            #- Add value in next tr.\n            results[i[0]+j].insert(i[1], i[3])\n\n\ndf = pd.DataFrame(data=results, columns=headers)\nprint df\n
Run Code Online (Sandbox Code Playgroud)\n\n

输出:

\n\n
  Rank       Opponent No. wins Most recent venue Season\n0    1  \xc2\xa0South Africa        6            Lord\'s   1951\n1   2=   \xc2\xa0West Indies        4            Lord\'s   2009\n2   2=         \xc2\xa0India        4            Mumbai   2012\n3    4     \xc2\xa0Australia        3            Sydney   1932\n4    5      \xc2\xa0Pakistan        2      Trent Bridge   1967\n5    6     \xc2\xa0Sri Lanka        1      Old Trafford   2002\n
Run Code Online (Sandbox Code Playgroud)\n\n
\n\n

还工作到表 10

\n\n
  Rank Hundreds            Player Matches Innings Average\n0    1       25     Alastair Cook     107     191   45.61\n1    2       23   Kevin Pietersen     104     181   47.28\n2    3       22     Colin Cowdrey     114     188   44.07\n3    3       22     Wally Hammond      85     140   58.46\n4    3       22  Geoffrey Boycott     108     193   47.72\n5    6       21    Andrew Strauss     100     178   40.91\n6    6       21          Ian Bell     103     178   45.30\n7   8=       20    Ken Barrington      82     131   58.67\n8   8=       20      Graham Gooch     118     215   42.58\n9   10       19        Len Hutton      79     138   56.67\n
Run Code Online (Sandbox Code Playgroud)\n