将beautifulsoup输出转换成矩阵

L S*_*haw 5 python beautifulsoup matrix

由于beautifulsoup,我已经抓取了网络数据,但是我无法将输出转换为可以操作的矩阵/数组.

from bs4 import BeautifulSoup
import urllib2

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://statsheet.com/mcb/teams/duke/game_stats', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html)

#statdiv = soup.find('div', attrs={'id': 'basic_stats'})  #not needed
table = soup.find('table', attrs={'class': 'sortable statsb'})
rows = table.findAll('tr')
for tr in rows:
  text = []
  cols = tr.findAll('td')
  for td in cols:
    try:
      text = ''.join(td.find(text=True))
    except Exception:
        text = "000"
    print text+",",
  print
Run Code Online (Sandbox Code Playgroud)

注意:''.join(td.find(text=True))是为了防止程序在空白单元格上失败.

哪个输出:

W, GSU, 32, 42, 74, 24-47, 51.1, 15-23, 65.2, 11-24, 45.8, 6, 25, 31, 17, 4, 6, 15, 19,
W, UK, 33, 42, 75, 26-57, 45.6, 15-22, 68.2, 8-18, 44.4, 11, 20, 31, 16, 6, 6, 8, 17,
W, FGCU, 52, 36, 88, 30-63, 47.6, 19-23, 82.6, 9-31, 29.0, 16, 21, 37, 19, 9, 4, 18, 14,
W, @MINN, 40, 49, 89, 30-55, 54.5, 21-26, 80.8, 8-10, 80.0, 10, 22, 32, 12, 12, 4, 15, 21,
W, VCU, 29, 38, 67, 20-48, 41.7, 24-27, 88.9, 3-15, 20.0, 4, 30, 34, 14, 4, 8, 8, 18,
W, Lville, 36, 40, 76, 24-55, 43.6, 23-27, 85.2, 5-20, 25.0, 8, 25, 33, 13, 8, 6, 14, 20,
W, OSU, 23, 50, 73, 24-51, 47.1, 20-27, 74.1, 5-12, 41.7, 8, 29, 37, 11, 3, 5, 8, 19,
Run Code Online (Sandbox Code Playgroud)

这是完美的,只是现在我无法弄清楚如何将数据输入矩阵,以便我可以操纵某些列,添加新列等.

我一直在玩numpy,但每次我尝试我最终得到这样的东西:

[u'W,']
[u'GSU,']
[u'32,']
[u'42,']
[u'74,']
[u'24-47,']
[u'51.1,']
[u'15-23,']
[u'65.2,']
[u'11-24,']
[u'45.8,']
Run Code Online (Sandbox Code Playgroud)

我想要的是获取我的已删除数据,并能够添加列,移动列,更改列中的文本,将一列中的数据拆分为两列(连字符列).

这是我与python的第二天.我假设将我的数据放入矩阵/数组是最简单的方法.如果不是,请告诉我.

roo*_*oot 7

你可以使用熊猫.此示例将数据转换为pandas DataFrame,它提供了进一步处理数据的便捷方法,例如拆分列或将它们转换为不同的数据类型.


来自文档:

DataFrame is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict
of Series objects. It is generally the most commonly used pandas object. Like
Series, DataFrame accepts many different kinds of input.
Run Code Online (Sandbox Code Playgroud)
import pandas as pd

table  = soup.find('table', attrs={'class': 'sortable statsb'})
header = [th.text for th in table.find('thead').select('th')]
header[:2] = ['',' ']
body   = [[td.text for td in row.select('td')]
             for row in table.findAll('tr', attrs = {"onmouseover":"hl(this)"})]
cols   =  zip(*body)
tbl_d  = {name:col for name, col in zip(header,cols)}

print pd.DataFrame(tbl_d, columns = header)
Run Code Online (Sandbox Code Playgroud)

输出:

              1H  2H   T     FG   FG%     FT   FT%    3PT    3%  OR  DR REB  AS  ST  B  TO  PF
0   W     GSU  32  42  74  24-47  51.1  15-23  65.2  11-24  45.8   6  25  31  17   4  6  15  19
1   W      UK  33  42  75  26-57  45.6  15-22  68.2   8-18  44.4  11  20  31  16   6  6   8  17
2   W    FGCU  52  36  88  30-63  47.6  19-23  82.6   9-31  29.0  16  21  37  19   9  4  18  14
3   W   @MINN  40  49  89  30-55  54.5  21-26  80.8   8-10  80.0  10  22  32  12  12  4  15  21
4   W     VCU  29  38  67  20-48  41.7  24-27  88.9   3-15  20.0   4  30  34  14   4  8   8  18
5   W  Lville  36  40  76  24-55  43.6  23-27  85.2   5-20  25.0   8  25  33  13   8  6  14  20
6   W     OSU  23  50  73  24-51  47.1  20-27  74.1   5-12  41.7   8  29  37  11   3  5   8  19
7   W      UD  42  46  88  35-67  52.2  11-21  52.4   7-20  35.0  10  37  47  23   9  8  11  19
8   W  TEMPLE  46  44  90  28-59  47.5  22-29  75.9  12-20  60.0  11  27  38  19   6  2   6  15
9   W    CORN  41  47  88  34-60  56.7  13-17  76.5   7-21  33.3   4  26  30  23  11  7  10  11
10  W    ELON  35  41  76  29-67  43.3   7-16  43.8  11-19  57.9  15  28  43  16  10  2  12  13
11  W     SCU  38  52  90  31-63  49.2  21-33  63.6   7-14  50.0  14  26  40  15   6  2  11  17
12  W    @DAV  29  38  67  21-46  45.7  20-22  90.9   5-11  45.5   6  27  33   8   5  5  12  17
13  W     WFU  41  39  80  29-63  46.0  11-22  50.0  11-24  45.8  10  25  35  22   6  4   6  16
14  W    CLEM  25  43  68  27-56  48.2   6-14  42.9   8-15  53.3  13  29  42  13   8  1  13  12
15  L   @NCSU  39  37  76  30-67  44.8  10-12  83.3   6-20  30.0  13  22  35  10   6  4  12  20
16  W      GT  27  46  73  26-65  40.0  11-16  68.8  10-21  47.6  15  25  40  12  10  5  13  18
Run Code Online (Sandbox Code Playgroud)