在此 Python 脚本中将 BeautifulSoup 替换为另一个（标准）HTML 解析模块

Question

在此 Python 脚本中将 BeautifulSoup 替换为另一个（标准）HTML 解析模块

hel*_*ker 0 python beautifulsoup html-parsing

我用 BeautifulSoup 制作了一个脚本，它运行良好并且非常可读，但我想有一天重新分发它，而 BeautifulSoup 是我想避免的外部依赖项，特别是考虑到 Windows 使用。

这是代码，它从给定的谷歌地图用户获取每个用户地图链接。####### 标记的行是使用 BeautifulSoup 的行：

# coding: utf-8

import urllib, re
from BeautifulSoup import BeautifulSoup as bs

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    soup = bs(source)  ####
    maptables = soup.findAll(id=re.compile('^map[0-9]+$'))  #################
    for table in maptables:
        for line in table.findAll('a', 'maptitle'):  ################
            mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
            mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
            print shown, mapid, '\t', mapname
            shown += 1

            urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
                               '&msa=0&output=kml', mapname + '.kml')


    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

Run Code Online (Sandbox Code Playgroud)

正如你所看到的，使用 BSoup 只有三行代码，但我不是程序员，在尝试使用其他标准 HTML 和 XML 解析工具时遇到了很多困难，我猜可能是因为我尝试了错误的方法。

编辑：这个问题更多的是关于替换该脚本的三行代码，而不是找到解决可能存在的通用 html 解析问题的方法。

任何帮助将不胜感激，感谢您的阅读！

Answer 1

Mik*_*ham 5

不幸的是，Python 在标准库中没有有用的 HTML 解析，因此解析 HTML 的唯一合理方法是使用第三方模块，如lxml.html或BeautifulSoup。这并不意味着您必须有一个单独的依赖项——这些模块是免费软件，如果您不需要外部依赖项，欢迎您将它们与您的代码捆绑在一起，这样它们就不再是一个依赖项。依赖程度高于您自己编写的代码。

归档时间：	14 年，3 月前
查看次数：	677 次
最近记录：	14 年，3 月前