Web Scraper使用Scrapy

Pro*_*loy 1 python scrapy

我只需解析此链接中的位置和点数.该链接有21个列表(我实际上不知道该怎么称呼它们)在此输入图像描述 每个上市都有40名玩家 在此输入图像描述期待最后一个.现在我写了一个像这样的代码,

from bs4 import BeautifulSoup
import urllib2

def overall_standing():
    url_list = ["http://www.afl.com.au/afl/stats/player-ratings/overall-standings#", 
                "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/3",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/4",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/5",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/6",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/7",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/8",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/9",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/10",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/11",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/12",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/13",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/14",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/15",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/16",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/17",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/18",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/19",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/20",
                "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/21"]

    gDictPlayerPointsInfo = {}
    for url in url_list:
        print url
        header = {'User-Agent': 'Mozilla/5.0'}
        header = {'User-Agent': 'Mozilla/5.0'}
        req = urllib2.Request(url,headers=header)
        page = urllib2.urlopen(req)
        soup = BeautifulSoup(page)
        table = soup.find("table", { "class" : "ladder zebra player-ratings" })

        lCount = 1
        for row in table.find_all("tr"):
            lPlayerName = ""
            lTeamName = ""
            lPosition = ""
            lPoint = ""
            for cell in row.find_all("td"):
                if lCount == 2:
                    lPlayerName = str(cell.get_text()).strip().upper()
                elif lCount == 3:
                    lTeamName = str(cell.get_text()).strip().split("\n")[-1].strip().upper()
                elif lCount == 4:
                    lPosition = str(cell.get_text().strip())
                elif lCount == 6:
                    lPoint = str(cell.get_text().strip())

                lCount += 1

            if url == "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2":
                print lTeamName, lPlayerName, lPoint
            if lPlayerName <> "" and lTeamName <> "":
                lStr = lPosition + "," + lPoint

#                 if gDictPlayerPointsInfo.has_key(lTeamName):
#                     gDictPlayerPointsInfo[lTeamName].append({lPlayerName:lStr})
#                 else:
                gDictPlayerPointsInfo[lTeamName+","+lPlayerName] = lStr
            lCount = 1


    lfp = open("a.txt","w")
    for key in gDictPlayerPointsInfo:
        if key.find("RICHMOND"):
            lfp.write(str(gDictPlayerPointsInfo[key]))

    lfp.close()
    return gDictPlayerPointsInfo


# overall_standing()
Run Code Online (Sandbox Code Playgroud)

但问题是它总是给我第一个上市的积分和位置,它忽略了其他20.我怎么能得到整个21的位置和积分?现在我听说scrapy可以做这种类型的事情很容易,但我并不完全熟悉scrapy.除了使用scrapy之外,还有其他方法吗?

Far*_*Joe 5

发生这种情况是因为这些链接由服务器处理,并且通常由#称为片段标识符的符号后面的链接部分由浏览器处理并引用某些链接或javascript行为,即加载不同的结果集.

我建议两个appraoches,要么找到一种方法来使用服务器可以评估的链接,你可以继续使用scrapy或使用webdriver selenium.

Scrapy

您的第一步是经常识别javascript加载调用ajax,并使用这些链接来提取您的信息.这些是对网站数据库的调用.这可以通过在您单击下一个搜索结果页面时打开Web检查器并查看网络流量来完成:

点击之前

然后点击后

点击后

我们可以看到这个网址有一个新的调用:

http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=3&pageSize=40
Run Code Online (Sandbox Code Playgroud)

这个url返回一个json可以解析的文件,你甚至可以缩短你的步骤,看起来你可以控制更多的信息返回给你.

您可以编写一个方法来为您生成一系列链接:

def gen_url(page_no):
    return "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=" + str(page_no) + "&pageSize=40"
Run Code Online (Sandbox Code Playgroud)

然后,例如,使用scrapy种子列表:

seed = [gen_url(i) for i in range(20)]
Run Code Online (Sandbox Code Playgroud)

或者你可以尝试调整url参数,看看你得到了什么,也许你可以一次获得多个页面:

http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=1&pageSize=200
Run Code Online (Sandbox Code Playgroud)

我将pageSize参数的结尾更改为,200因为它似乎直接对应于返回的结果数.

注意此方法可能无法正常工作,因为站点有时会通过筛选请求来源的ip来阻止其数据API在外部使用.

如果是这种情况,您应该采用以下方法.

Selenium(或其他webdriver)

使用类似于seleniumwebdriver的东西,您可以使用加载到浏览器中的内容来评估服务器返回网页后加载的数据.

有一些初始设置需要设置才能使selenium可用,但是一旦你有了它,它就是一个非常强大的工具.

一个简单的例子是:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
Run Code Online (Sandbox Code Playgroud)

您将看到一个python控制的Firefox浏览器(这也可以通过其他浏览器完成)在您的屏幕上打开并加载您提供的URL,然后按照您提供的命令,甚至可以从shell完成(对调试很有用)和你可以用你想要的方式搜索和解析html scrapy(代码从前面的代码部分开始......)

如果要执行类似下一页按钮的操作:

driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")
Run Code Online (Sandbox Code Playgroud)

这个表达式可能需要一些调整,但是它打算找到所有li元素class='page'都在divwith中class='pagination',这//意味着元素之间的路径缩短,你的另一种选择就好像/html/body/div/div/.....直到你找到有问题的那个,这就是为什么//div/...有用和吸引人的.

有关定位元素的具体帮助和参考,请参阅其页面

我通常的方法是试验和错误,调整表达式直到它达到我想要的目标元素.这是控制台/外壳派上用场的地方.设置完driver如上所述之后,我通常会尝试构建我的表达式:

假设你有一个html像这样的结构:

<html>
    <head></head>
    <body>
        <div id="container">
             <div id="info-i-want">
                  treasure chest
             </div>
        </div>
    </body>
</html> 
Run Code Online (Sandbox Code Playgroud)

我会从以下内容开始:

>>> print driver.get_element_by_xpath("//body")
'<body>
    <div id="container">
         <div id="info-i-want">
              treasure chest 
         </div>
    </div>
</body>'
>>> print driver.get_element_by_xpath("//div[@id='container']")
<div id="container">
     <div id="info-i-want">
          treasure chest 
     </div>
</div>
>>> print driver.get_element_by_xpath("//div[@id='info-i-want']")
<div id="info-i-want">
     treasure chest 
</div>
>>> print driver.get_element_by_xpath("//div[@id='info-i-want']/text()")
treasure chest 
>>> # BOOM TREASURE!
Run Code Online (Sandbox Code Playgroud)

通常它会更复杂,但这是一个很好的,通常是必要的调试策略.

回到你的情况,你可以将它们保存到一个数组中:

links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")
Run Code Online (Sandbox Code Playgroud)

然后逐个单击它们,刮取新数据,单击下一个:

import time
from selenium import webdriver

driver = None
try:
    driver = webdriver.Firefox()
    driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")

    #
    # Scrape the first page
    #

    links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")

    for link in links:
        link.click()
        #
        # scrape the next page
        #
        time.sleep(1) # pause for a time period to let the data load
finally:
    if driver:
        driver.close()
Run Code Online (Sandbox Code Playgroud)

最好将它全部包装在一个try...finally类型块中,以确保关闭驱动程序实例.

如果您决定深入研究selenium方法,可以参考他们的文档,这些文档具有优秀且非常明确的文档和示例.

快乐刮!