使用BeautifulSoup或LXML.HTML进行WebScraping

Question

使用BeautifulSoup或LXML.HTML进行WebScraping

Mer*_*lin 0 python yahoo lxml beautifulsoup web-scraping

我已经看过一些网络广播,需要帮助才能做到这一点:我一直在使用lxml.html.雅虎最近改变了网络结构.

目标页面;

http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true

在使用检查器的Chrome中:我看到了数据

 //*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table

Run Code Online (Sandbox Code Playgroud)

那么一些代码

如何将这些数据输出到列表中.我想换成其他股票从"LLY"到"Msft"？
如何在日期之间切换....并获得所有月份.

Answer 1

nos*_*klo 7

我知道你说你不能用lxml.html.但这里是如何使用该库,因为它是非常好的库.所以我提供使用它的代码,为了完整性,因为我不再使用BeautifulSoup- 它没有维护,速度慢且API难度大.

下面的代码解析页面并将结果写入csv文件.

import lxml.html
import csv

doc = lxml.html.parse('http://finance.yahoo.com/q/os?s=lly&m=2011-04-15')
# find the first table contaning any tr with a td with class yfnc_tabledata1
table = doc.xpath("//table[tr/td[@class='yfnc_tabledata1']]")[0]

with open('results.csv', 'wb') as f:
    cf = csv.writer(f)
    # find all trs inside that table:
    for tr in table.xpath('./tr'):
        # add the text of all tds inside each tr to a list
        row = [td.text_content().strip() for td in tr.xpath('./td')]
        # write the list to the csv file:
        cf.writerow(row)

Run Code Online (Sandbox Code Playgroud)

而已!lxml.html真是太简单了!太糟糕了,你不能使用它.

以下results.csv是生成的文件中的一些行:

LLY110416C00017500,N/A,0.00,17.05,18.45,0,0,17.50,LLY110416P00017500,0.01,0.00,N/A,0.03,0,182
LLY110416C00020000,15.70,0.00,14.55,15.85,0,0,20.00,LLY110416P00020000,0.06,0.00,N/A,0.03,0,439
LLY110416C00022500,N/A,0.00,12.15,12.80,0,0,22.50,LLY110416P00022500,0.01,0.00,N/A,0.03,2,50

Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，8 月前
查看次数：	3118 次
最近记录：	9 年，4 月前