小编Iss*_*ssn的帖子

Python：使用 html 解析器提取特定数据

我开始使用 Python 中的 HTMLParser 从网站中提取数据。我得到了我想要的一切，除了 HTML 的两个标签内的文本。以下是 HTML 标记的示例：

<a href="http://wold.livingsources.org/vocabulary/1" title="Swahili" class="Vocabulary">Swahili</a>

Run Code Online (Sandbox Code Playgroud)

还有其他以 . 开头的标签。它们具有其他属性和值，因此我不想拥有它们的数据：

<a href="http://wold.livingsources.org/contributor#schadebergthilo" title="Thilo Schadeberg" class="Contributor">Thilo Schadeberg</a>

Run Code Online (Sandbox Code Playgroud)

标签是表格中的嵌入标签。我不知道这是否对其他标签有任何影响。我只想要一些名为“a”的标签中的信息，属性 class="Vocabulary"，我想要标签内的数据，在示例中它是“斯瓦希里语”。所以我所做的是：

class AllLanguages(HTMLParser):
    '''
    classdocs
    '''
    #counter for the languages
    #countLanguages = 0
    def __init__(self):
        HTMLParser.__init__(self)
        self.inLink = False
        self.dataArray = []
        self.countLanguages = 0
        self.lasttag = None
        self.lastname = None
        self.lastvalue = None
        #self.text = ""


    def handle_starttag(self, tag, attr):
        #print "Encountered a start tag:", tag      
        if tag == 'a':
            for name, value in attr:
                if …

Run Code Online (Sandbox Code Playgroud)

html python html-parsing html-parser python-2.7

Iss*_*ssn

2014 04-28

3
推荐指数

1
解决办法

4万
查看次数