小编A2D*_*2D2的帖子

使用BeautifulSoup和Python从HTML文件中提取数据

我需要从HTML文件中提取数据.有问题的文件很可能是自动生成的.我已将其中一个文件的代码上传到Pastebin:http://pastebin.com/9Nj2Edfv.这是指向实际页面的链接:http://eur-lex.europa.eu/Notice.do？checktexts = checkbox&val = 60504%3A&call = 1& page = 1&lang = en&pgs = 10&nbl = 1&list = 60504%3Acs%C2&hwords =&action = GO&VISU =%23texte

我需要提取的数据可以在不同的标题下找到.

这是我到目前为止:

from BeautifulSoup import BeautifulSoup
ecj_data = open("data\ecj_1.html",'r').read()

soup = BeautifulSoup(ecj_data)

celex = soup.find('h1')
auth_lang = soup('ul', limit=14)[13].li
procedure = soup('ul', limit=20)[17].li

print "Celex number:", celex.renderContents(),
print "Authentic language:", auth_lang
print "Type of procedure:", procedure

Run Code Online (Sandbox Code Playgroud)

我将所有数据存储在本地,这就是它打开文件ecj_1.html的原因.

Celex数字和Authentic语言有点好用.

celex回归