如何使用 lxml 和 python 从表中查找特定的 xpath td 类

Question

如何使用 lxml 和 python 从表中查找特定的 xpath td 类

我正在尝试使用 Python lxml 从页面导入文本列表。这是我到目前为止所拥有的。

test_page.html 来源：

<html>
<head>
    <title>Test</title>
</head>
<body>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tbody>
    <tr><td><a title="This page is cool" class="producttitlelink" href="about:mozilla">This page is cool</a></td></tr>
    <tr height="10"></tr>
    <tr><td class="plaintext">This is a really cool description for my really cool page.</td></tr>

            <tr><td class="plaintext">Published: 7/15/15</td></tr>

    <tr><td class="plaintext">



    </td></tr>
    <tr><td class="plaintext">


    </td></tr>
    <tr><td class="plaintext">


    </td></tr>
    <tr><td class="plaintext">

    </td></tr>


    </tbody>
</table>
</body>

Run Code Online (Sandbox Code Playgroud)

蟒蛇代码：

from lxml import html
import requests
page = requests.get('http://127.0.0.1/test_page.html')
tree = html.fromstring(page.text)
description = tree.xpath('//table//td[@class="plaintext"]/text()')
>> print (description)
['This is a really cool description for my really cool page.', 'Published: 7/15/15', '\n\t\t\n\t\t\t\t\n\t\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t\n\t', '\n\t\t\t\t\n\t']
>>

Run Code Online (Sandbox Code Playgroud)

然而，想要的最终结果是：

['This is a really cool description for my really cool page. Published: 7/15/15']

Run Code Online (Sandbox Code Playgroud)

我曾认为使用 [1] -

tree.xpath('//table//td[@class="plaintext"][1]/text()')

Run Code Online (Sandbox Code Playgroud)

可能让我收到第一行：

['This is a really cool description for my really cool page.']

Run Code Online (Sandbox Code Playgroud)

但是它会拉出整个列表。

有没有办法仅使用此 html 的 xpath 来指定单行或行列表？

Answer 1

har*_*r07 5

你可以试试这种方式：

from lxml import html

source = """html posted in the question here"""
tree = html.fromstring(source)
tds = tree.xpath('//table//td[@class="plaintext"]/text()[normalize-space()]')
description = ' '.join(tds)
print(description)

Run Code Online (Sandbox Code Playgroud)

[normalize-space()]应用于的 XPath 谓词text()将仅返回那些非空白文本节点。

使用有问题的 HTML，上述代码的输出完全符合要求：

This is a really cool description for my really cool page. Published: 7/15/15

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，5 月前
查看次数：	5677 次
最近记录：	10 年，5 月前