如何使用 python 维基百科库从维基百科中提取信息框 vcard

Question

如何使用 python 维基百科库从维基百科中提取信息框 vcard

Mic*_*hal 4 python beautifulsoup wikipedia-api

我一直在尝试使用维基百科 python 包提取信息框内容。

我的代码如下（针对此页面）：

import wikipedia
Aldi = wikipedia.page('Aldi')

Run Code Online (Sandbox Code Playgroud)

当我输入时：

Aldi.content

Run Code Online (Sandbox Code Playgroud)

我收到了文章文本，但没有收到信息框。

我尝试从 DBPedia 获取数据，但没有成功。我还尝试使用 BeautifulSoup4 提取页面，但该表的结构很奇怪（因为有一个图像跨越两列，后面跟着未命名的列。

这是我对 BeautifulSoup 的了解：

from bs4 import BeautifulSoup
import urllib2
site= "http://en.wikipedia.org/wiki/Aldi"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup

Run Code Online (Sandbox Code Playgroud)

我还查看了维基数据，但它不包含我从表中需要的大部分信息。

我不一定将 python 包作为解决方案。任何可以解析该表的东西都会很棒。

最好，我想要一本包含信息框值的字典：

Type     Private
Industry Retail

Run Code Online (Sandbox Code Playgroud)

ETC...

Answer 1

ZZY*_*ZZY 5

基于BeautifulSoup的解决方案：

from bs4 import BeautifulSoup
import urllib2
site= "http://en.wikipedia.org/wiki/Aldi"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page.read())
table = soup.find('table', class_='infobox vcard')
result = {}
exceptional_row_count = 0
for tr in table.find_all('tr'):
    if tr.find('th'):
        result[tr.find('th').text] = tr.find('td').text
    else:
        # the first row Logos fall here
        exceptional_row_count += 1
if exceptional_row_count > 1:
    print 'WARNING ExceptionalRow>1: ', table
print result

Run Code Online (Sandbox Code Playgroud)

在http://en.wikipedia.org/wiki/Aldi上进行了测试，但尚未在其他 wiki 页面上进行完全测试。

归档时间：	11 年，4 月前
查看次数：	4917 次
最近记录：	10 年，7 月前