我试图用 BeautifulSoup 包装标签的内容。这:
<div class="footnotes">
<p>Footnote 1</p>
<p>Footnote 2</p>
</div>
Run Code Online (Sandbox Code Playgroud)
应该变成这样:
<div class="footnotes">
<ol>
<p>Footnote 1</p>
<p>Footnote 2</p>
</ol>
</div>
Run Code Online (Sandbox Code Playgroud)
所以我使用以下代码:
footnotes = soup.findAll("div", { "class" : "footnotes" })
footnotes_contents = ''
new_ol = soup.new_tag("ol")
for content in footnotes[0].children:
new_tag = soup.new_tag(content)
new_ol.append(new_tag)
footnotes[0].clear()
footnotes[0].append(new_ol)
print footnotes[0]
Run Code Online (Sandbox Code Playgroud)
但我得到以下信息:
<div class="footnotes"><ol><
></
><<p>Footnote 1</p>></<p>Footnote 1</p>><
></
><<p>Footnote 2</p>></<p>Footnote 2</p>><
></
></ol></div>
Run Code Online (Sandbox Code Playgroud)
建议?
我已经下载了 lxml 的 tarball 并使用 ipython setup.py install 来尝试安装它。不幸的是,它给了我一大堆错误消息:
\n\nsrc/lxml/lxml.etree.c:200651: error: \xe2\x80\x98XML_XPATH_INVALID_OPERAND\xe2\x80\x99 undeclared (first use in this function)\nsrc/lxml/lxml.etree.c:200661: error: \xe2\x80\x98XML_XPATH_INVALID_TYPE\xe2\x80\x99 undeclared (first use in this function)\nsrc/lxml/lxml.etree.c:200671: error: \xe2\x80\x98XML_XPATH_INVALID_ARITY\xe2\x80\x99 undeclared (first use in this function)\nsrc/lxml/lxml.etree.c:200681: error: \xe2\x80\x98XML_XPATH_INVALID_CTXT_SIZE\xe2\x80\x99 undeclared (first use in this function)\nsrc/lxml/lxml.etree.c:200691: error: \xe2\x80\x98XML_XPATH_INVALID_CTXT_POSITION\xe2\x80\x99 undeclared (first use in this function)\nsrc/lxml/lxml.etree.c:200921: error: \xe2\x80\x98LIBXSLT_VERSION\xe2\x80\x99 undeclared (first use in this function)\nsrc/lxml/lxml.etree.c:200933: error: \xe2\x80\x98xsltLibxsltVersion\xe2\x80\x99 undeclared (first use in this function)\nsrc/lxml/lxml.etree.c:200945: error: \xe2\x80\x98__pyx_v_4lxml_5etree_XSLT_DOC_DEFAULT_LOADER\xe2\x80\x99 undeclared (first use in this function)\nsrc/lxml/lxml.etree.c:200945: error: \xe2\x80\x98xsltDocDefaultLoader\xe2\x80\x99 undeclared (first use …Run Code Online (Sandbox Code Playgroud) 我有一个 dict 将每个 xml 标签映射到一个 dict 键。我想遍历 xml 中的每个标签和文本字段,并将其与关联的 dict 键值进行比较,该值是另一个 dict 中的键。
<2gMessage>
<Request>
<pid>daemon</pid>
<emf>123456</emf>
<SENum>2041788209</SENum>
<MM>
<MID>jbr1</MID>
<URL>http://jimsjumbojoint.com</URL>
</MM>
<AppID>reddit</AppID>
<CCS>
<Mode>
<SomeDate>true</CardPresent>
<Recurring>false</Recurring>
</Mode>
<Date>
<ASCII>B4788250000028291^RRR^15121015432112345601</ASCII>
</Date>
<Amount>100.00</Amount>
</CCS>
</Request>
</2gMessage>
Run Code Online (Sandbox Code Playgroud)
我到目前为止的代码:
parser = etree.XMLParser(ns_clean=True, remove_blank_text=True)
tree = etree.fromstring(strRequest, parser)
for tag in tree.xpath('//Request'):
subfields = tag.getchildren()
for subfield in subfields:
print (subfield.tag, subfield.text)
return strRequest
Run Code Online (Sandbox Code Playgroud)
但是,这仅打印作为请求的直接子级的标签,如果它是同一循环中的一个实例,我希望能够访问子级上的子级。我不想硬编码值,因为标签和结构可能会改变。
这是我的示例 Python 代码
import requests
import lxml.html
page = '<div class="aaaa12"><span class="test">22</span><span class="number">33</span></div><div class="dddd13"><span>Kevin</span></div>'
tree = lxml.html.fromstring(page)
number = tree.xpath('//span[@class="number"]/text()')
price = tree.xpath('.//div[@class="dddd13"]/span/text()')
print number
print price
Run Code Online (Sandbox Code Playgroud)
当我跑步时,我可以像下面这样
['33']
['Kevin']
Run Code Online (Sandbox Code Playgroud)
但是,我想同时得到两个像 = ['33','Kevin'] 我试过了
number = tree.xpath('//span[@class="number"]/text() or //div[@class="dddd13"]/span/text()')
Run Code Online (Sandbox Code Playgroud)
我无法获得价值。获得两个不同类的语法是什么?
我正在尝试使用 Python lxml 从页面导入文本列表。这是我到目前为止所拥有的。
test_page.html 来源:
<html>
<head>
<title>Test</title>
</head>
<body>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr><td><a title="This page is cool" class="producttitlelink" href="about:mozilla">This page is cool</a></td></tr>
<tr height="10"></tr>
<tr><td class="plaintext">This is a really cool description for my really cool page.</td></tr>
<tr><td class="plaintext">Published: 7/15/15</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
<tr><td class="plaintext">
</td></tr>
</tbody>
</table>
</body>
Run Code Online (Sandbox Code Playgroud)
蟒蛇代码:
from lxml import html
import requests
page = requests.get('http://127.0.0.1/test_page.html')
tree = html.fromstring(page.text)
description = tree.xpath('//table//td[@class="plaintext"]/text()')
>> print (description)
['This is …Run Code Online (Sandbox Code Playgroud) 打印 alxml.objectify.ObjectifiedElement只是打印一个空行,所以我必须通过它的标签访问它,当我不知道响应的标签时,我只是在猜测。
如何打印整个对象,显示孩子的名字和值?
根据要求,这是我的代码。不确定这有什么目的,但是:
from amazonproduct import API
api = API('xxxxx', 'xxxxx', 'us', 'xxxx')
result = api.item_lookup('B00H8U93JO', ResponseGroup='OfferSummary')
print result
Run Code Online (Sandbox Code Playgroud) 这可能是一件非常简单的事情,但我一直在失败。
当root包含一个或多个 "<link />" 时,root.xpath('(//link)') 将全部返回。但是 root.xpath('(//link)[0]') 返回一个空列表。怎么了?
from unittest import TestCase, TestProgram
class T(TestCase):
base_path = r'(//_:link)'
def test0ok(self):
self._test(2, self.base_path)
def test1ng(self):
self._test(1, self.base_path + r'[0]')
def _test(self, expected, path):
try:
from lxml.etree import fromstring as parse_xml_string
except ImportError:
raise
root = parse_xml_string(_xhtml)
nsmap = dict(_=root.nsmap[None])
gotten = root.xpath(path, namespaces=nsmap)
gotten = len(gotten)
self.assertEqual(expected, gotten)
_xhtml = br'''
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"
>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<link rev="made" href="./" …Run Code Online (Sandbox Code Playgroud) from lxml import html
import requests
url = 'https://www.data.gov/'
r = requests.get(url)
doc = html.fromstring(r.content)
link = doc.xpath('/html/body/header/div[4]/div/div/h4/label/small/a')
print(link)
Run Code Online (Sandbox Code Playgroud)
这不断给我:
[Element a at 0x1c64c963f48]
Run Code Online (Sandbox Code Playgroud)
响应而不是我在页面中寻找的实际数字?知道为什么吗?
另外,为什么我无法获取type(link)值来查看类型?
def extract_page_data(html):
tree = lxml.html.fromstring(html)
item_sel = CSSSelector('.my-item')
text_sel = CSSSelector('.my-text-content')
time_sel = CSSSelector('.time')
author_sel = CSSSelector('.author-text')
a_tag = CSSSelector('.a')
for item in item_sel(tree):
yield {'href': a_tag(item)[0].text_content(),
'my pagetext': text_sel(item)[0].text_content(),
'time': time_sel(item)[0].text_content().strip(),
'author': author_sel(item)[0].text_content()}
Run Code Online (Sandbox Code Playgroud)
我想提取href但我无法使用此代码提取它
我最近安装了一个新的 Anaconda 版本 2019-10,它使用 python 3.7.4。为了能够从 Python 中读取/写入 MsWord .docx 文件,我使用库模块 python-docx,我安装了它:conda install -c conda-forge python-docx
安装的是 python-docx 0.8.10。现在是一个 python 脚本,我经常用我以前的 anaconda 安装来读/写 MsWord .docx 文件(我不知道 python3.5.4 和 python-docx 版本)。
脚本:(缩短)
import docx
doc = docx.Document('demo.docx') # demo.docx exists in same dir
print(len(doc.paragraphs))
Run Code Online (Sandbox Code Playgroud)
突然报错:
回溯(最近一次调用最后一次):
File "D:\pa\Python\ProjectsWorkspace\Py001Proj\src\printenfrompython\wordprinten.py", line 19, in <module>
import docx
File "C:\Users\pa\Anaconda3\lib\site-packages\docx\__init__.py", line 3, in <module>
from docx.api import Document # noqa
File "C:\Users\pa\Anaconda3\lib\site-packages\docx\api.py", line 14, in <module>
from docx.package import Package
File "C:\Users\pa\Anaconda3\lib\site-packages\docx\package.py", line 9, …Run Code Online (Sandbox Code Playgroud)