使用BeautifulSoup从表中提取选定的列

Question

使用BeautifulSoup从表中提取选定的列

mac*_*389 11 python beautifulsoup html-parsing

我试图使用BeautifulSoup 提取此数据表的第一列和第三列.通过查看HTML,第一列有一个<th>标记.另一个感兴趣的列有<td>标记.在任何情况下,我所能得到的只是带有标签的列的列表.但是,我只想要文本.

table已经是一个列表,所以我不能使用findAll(text=True).我不知道如何以另一种形式获得第一列的列表.

from BeautifulSoup import BeautifulSoup
from sys import argv
import re

filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one

print table

Run Code Online (Sandbox Code Playgroud)

Answer 1

jon*_*hkr 31

你可以试试这段代码:

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())

for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

Run Code Online (Sandbox Code Playgroud)

正如您所看到的,代码只是连接到url并获取html,而BeautifulSoup找到第一个表,然后是所有'tr'并选择第一列,即'th',第三列,即一个'td'.

归档时间：	13 年，1 月前
查看次数：	19868 次
最近记录：	8 年前