que*_*ter 3 python wikipedia beautifulsoup html-parsing web-scraping
我试图从以下维基百科页面中检索3列(NFL团队,玩家姓名,大学团队).我是python的新手,并一直在尝试使用beautifulsoup来完成这项工作.我只需要属于QB的列,但我甚至无法获得所有列的位置.这是我到目前为止所没有输出的东西,我不完全确定原因.我相信这是由于标签,但我不知道要改变什么.任何帮助将不胜感激.'
wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
rnd = ""
pick = ""
NFL = ""
player = ""
pos = ""
college = ""
conf = ""
notes = ""
table = soup.find("table", { "class" : "wikitable sortable" })
#print table
#output = open('output.csv','w')
for row in table.findAll("tr"):
cells = row.findAll("href")
print "---"
print cells.text
print "---"
#For each "tr", assign each "td" to a variable.
#if len(cells) > 1:
#NFL = cells[1].find(text=True)
#player = cells[2].find(text = True)
#pos = cells[3].find(text=True)
#college = cells[4].find(text=True)
#write_to_file = player + " " + NFL + " " + college + " " + pos
#print write_to_file
#output.write(write_to_file)
#output.close()
Run Code Online (Sandbox Code Playgroud)
我知道其中有很多是评论它,因为我试图找出崩溃的位置.
这是我要做的:
Player Selections段落wikitable使用find_next_sibling()tr里面的所有标签td一个th标签并通过索引获得所需的单元格这是代码:
filter_position = 'QB'
player_selections = soup.find('span', id='Player_selections').parent
for row in player_selections.find_next_sibling('table', class_='wikitable').find_all('tr')[1:]:
cells = row.find_all(['td', 'th'])
try:
nfl_team, name, position, college = cells[3].text, cells[4].text, cells[5].text, cells[6].text
except IndexError:
continue
if position != filter_position:
continue
print nfl_team, name, position, college
Run Code Online (Sandbox Code Playgroud)
这是输出(只有四分卫被过滤):
Atlanta Falcons Ryan, MattMatt Ryan† QB Boston College
Baltimore Ravens Flacco, JoeJoe Flacco QB Delaware
Green Bay Packers Brohm, BrianBrian Brohm QB Louisville
Miami Dolphins Henne, ChadChad Henne QB Michigan
New England Patriots O'Connell, KevinKevin O'Connell QB San Diego State
Minnesota Vikings Booty, John DavidJohn David Booty QB USC
Pittsburgh Steelers Dixon, DennisDennis Dixon QB Oregon
Tampa Bay Buccaneers Johnson, JoshJosh Johnson QB San Diego
New York Jets Ainge, ErikErik Ainge QB Tennessee
Washington Redskins Brennan, ColtColt Brennan QB Hawai?i
New York Giants Woodson, Andre'Andre' Woodson QB Kentucky
Green Bay Packers Flynn, MattMatt Flynn QB LSU
Houston Texans Brink, AlexAlex Brink QB Washington State
Run Code Online (Sandbox Code Playgroud)