相关疑难解决方法(0)

如何使用scrapy的XmlFeedSpider解析sitemap.xml文件?

我试图sitemap.xml使用scrapy 解析文件,站点地图文件就像下面的文件一样,只有更多的url节点。

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:video="http://www.sitemaps.org/schemas/sitemap-video/1.1">
    <url>
        <loc>
            http://www.site.com/page.html
        </loc>
        <video:video>
            <video:thumbnail_loc>
                http://www.site.com/thumb.jpg
            </video:thumbnail_loc>
            <video:content_loc>http://www.example.com/video123.flv</video:content_loc>
            <video:player_loc allow_embed="yes" autoplay="ap=1">
                http://www.example.com/videoplayer.swf?video=123
            </video:player_loc>
            <video:title>here is the page title</video:title>
            <video:description>and an awesome description</video:description>
            <video:duration>302</video:duration>
            <video:publication_date>2011-02-24T02:03:43+02:00</video:publication_date>
            <video:tag>w00t</video:tag>
            <video:tag>awesome</video:tag>
            <video:tag>omgwtfbbq</video:tag>
            <video:tag>kthxby</video:tag>
        </video:video>
    </url>
</urlset>
Run Code Online (Sandbox Code Playgroud)

我查看了相关的scrapy文档,并编写了以下代码片段,以查看是否做得正确(看来我不^^):

class SitemapSpider(XMLFeedSpider):
    name = "sitemap"
    namespaces = [
        ('', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
        ('video', 'http://www.sitemaps.org/schemas/sitemap-video/1.1'),
    ]
    start_urls = ["http://example.com/sitemap.xml"]
    itertag = 'url'

    def parse_node(self, response, node):
        print "Parsing: %s" % str(node)
Run Code Online (Sandbox Code Playgroud)

但是当我运行蜘蛛时,会出现此错误:

File "/.../python2.7/site-packages/scrapy/utils/iterators.py", line 32, …
Run Code Online (Sandbox Code Playgroud)

python xml sitemap namespaces scrapy

2
推荐指数
1
解决办法
6564
查看次数

标签 统计

namespaces ×1

python ×1

scrapy ×1

sitemap ×1

xml ×1