如何使用ElementTree解析HTML以查找特定的RegEx?

Pac*_*ver 2 python regex elementtree data-structures

使用Python 2.7.6与ElementTree一起从文件系统加载/解析HTML文件,然后遍历文件以将特定的RegEx存储到数据结构中.

所以,在我项目的文件夹中,我有一个名为person.html的HTML文件:

<!DOCTYPE html>
<html>
    <body>
        <ul>
            <li>Name: $name</li>
            <li>Age: $age</li>
        </ul>
    </body> 
</html>
Run Code Online (Sandbox Code Playgroud)

到目前为止,这是我的Python脚本(main.py):

#!/usr/bin/env python
import web
import xml.etree.ElementTree as ElementTree

tree = ET.parse(person.html)
Run Code Online (Sandbox Code Playgroud)

问题:

  1. 如何使用以$(例如$name$age)开头的RegEx或ElementTree值进行解析?

  2. 如何将这些值存储到我将来可以迭代的数据结构中?

7st*_*tud 6

lxml用于按标签搜索html .例如,如果要查找所有<li>标记,并获取其文本:

import xml.etree.ElementTree as et

tree = et.parse('data.html')
html_tag = tree.getroot()

for li in html_tag.iter('li'):
    text = li.text
    print(text)

--output:--
Name: $name
Age: $age
Run Code Online (Sandbox Code Playgroud)

如果您的目标文本可以在任何标记中,那么您可以这样做:

import xml.etree.ElementTree as et
import re

tree = et.parse('data.html')
html_tag = tree.getroot()

pattern = r"""
    \$
    .*?
    \b
"""

for tag in html_tag.iter('*'):  # '*' => all tags
    text = tag.text.strip()

    if text: 
        match_list = re.findall(pattern, text, flags=re.X)
        print (match_list)

--output:--
['$name']
['$age']
Run Code Online (Sandbox Code Playgroud)

如何将这些值存储到我将来可以迭代的数据结构中?

您可以使用该shelve模块:

$ cat data.html
<!DOCTYPE html>
<html>
    <body>
        <ul>
            <li>Name: $name</li>
            <li>Age: $age</li>
            <li>Dogs: $dog1, $dog2</li>     
        </ul>
    </body> 
</html>
Run Code Online (Sandbox Code Playgroud)
import xml.etree.ElementTree as et
import re
import shelve
import collections as coll

tree = et.parse('data.html')
html_tag = tree.getroot()

pattern = r"""
    \$    #Match a literal $ sign...
    .+?   #followed by any character, 1 or more times, non-greedy
    \b    #followed by the (first) word boundary
"""

results = coll.defaultdict(list)

for tag in html_tag.iter('*'):
    text = tag.text.strip()

    if text: 
        match_list = re.findall(pattern, text, flags=re.X)
        if match_list:
            results['data.html'].extend(match_list)


print(results)

with shelve.open('mydb.db') as db:
    db['html vars'] = results

with shelve.open('mydb.db') as db:
    for key, val in db['html vars'].items():
        print("{}: {}".format(key, val))

--output:--
defaultdict(<class 'list'>, {'data.html': ['$name', '$age', '$dog1', '$dog2']})

data.html: ['$name', '$age', '$dog1', '$dog2']
Run Code Online (Sandbox Code Playgroud)

如果您的最终目标是替换html中的那些变量,那么您的格式符合python的template格式:

import string

with open('data.html') as f:
    template = string.Template(f.read())


values = {
    'name': 'socal_javaguy',
    'age': 25,
    'dog1': 'Rover',
    'dog2': 'Jane',
}

results = template.substitute(values)
print(results)

--output:--
<!DOCTYPE html>
<html>
    <body>
        <ul>
            <li>Name: socal_javaguy</li>
            <li>Age: 25</li>
            <li>Dogs: Rover, Jane</li>     
        </ul>
    </body> 
</html>
Run Code Online (Sandbox Code Playgroud)