提取 HTML 表单的字段名称 - Python

Question

提取 HTML 表单的字段名称 - Python

假设有一个链接“http://www.someHTMLPageWithTwoForms.com”，它基本上是一个具有两种表单（比如表单 1 和表单 2）的 HTML 页面。我有一个这样的代码......

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
h = httplib2.Http('.cache')
response, content = h.request('http://www.someHTMLPageWithTwoForms.com')
for field in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
        if field.has_key('name'):
                print field['name']

Run Code Online (Sandbox Code Playgroud)

这将返回属于我的 HTML 页面的 Form 1 和 Form 2 的所有字段名称。有什么方法可以让我只获得属于特定表单的字段名称（仅说表单 2）？

Answer 1

Ana*_*ass 5

如果只有两种形式，您可以尝试以下一种：

from BeautifulSoup import BeautifulSoup

forms = BeautifulSoup(content).findAll('form')
for field in forms[1]:
    if field.has_key('name'):
            print field['name']

Run Code Online (Sandbox Code Playgroud)

如果它不仅仅是关于第二种形式，你可以让它更具体（通过 id 或类属性

from BeautifulSoup import BeautifulSoup

forms = BeautifulSoup(content).findAll(attrs={'id' : 'yourFormId'})
for field in forms[0]:
    if field.has_key('name'):
            print field['name']

Run Code Online (Sandbox Code Playgroud)

Answer 2

mde*_*ous 1

进行这种解析也非常容易使用lxml（我个人更喜欢它，BeautifulSoup因为它的Xpath支持）。例如，以下代码片段将打印属于名为“form2”的表单的所有字段名称（如果有的话）：

# you can ignore this part, it's only here for the demo
from StringIO import StringIO
HTML = StringIO("""
<html>
<body>
    <form name="form1" action="/foo">
        <input name="uselessInput" type="text" />
    </form>
    <form name="form2" action="/bar">
        <input name="firstInput" type="text" />
        <input name="secondInput" type="text" />
    </form>
</body>
</html>
""")

# here goes the useful code
import lxml.html
tree = lxml.html.parse(HTML) # you can pass parse() a file-like object or an URL
root = tree.getroot()
for form in root.xpath('//form[@name="form2"]'):
    for field in form.getchildren():
        if 'name' in field.keys():
            print field.get('name')

Run Code Online (Sandbox Code Playgroud)

这不太好，它只查看表单元素的直接子元素，而不检查它们是否是表单输入（其他元素也可能具有 name 属性）。 (2认同)

归档时间：	14 年，4 月前
查看次数：	10017 次
最近记录：	6 年前