我正在尝试运行这个程序。直到今天为止都运行良好。我的代码没有任何改变。
import lxml.etree
import urlparse
import re
def parse_url(url):
return lxml.etree.parse(url, lxml.etree.HTMLParser())
urlivv = "http://finance.yahoo.com/q?s=IVV"
docivv = parse_url(urlivv)
Run Code Online (Sandbox Code Playgroud)
这是我的错误消息:
IOError: Error reading file 'http://finance.yahoo.com/q?s=IVV': failed to load external entity "http://finance.yahoo.com/q?s=IVV"
网站上有一些关于添加StringIO参数的文档(见下文)。但我觉得很奇怪,我以前从未这样做过。
tree = etree.parse(StringIO(myString))
Run Code Online (Sandbox Code Playgroud)
编辑:更完整的堆栈跟踪。
>>> import lxml.etree
>>> tree = lxml.etree.parse('http://finance.yahoo.com/q?s=IVV', parser=lxml.etree.HTMLParser())
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "src\lxml\lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81100)
File "src\lxml\parser.pxi", line 1811, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:117831)
File "src\lxml\parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:118178)
File "src\lxml\parser.pxi", …Run Code Online (Sandbox Code Playgroud) 我有一些 HTML,我想使用 Python 通过 lxml 进行解析。页面上有许多元素,每个元素代表一张海报。我想获取每个发帖者的 ID,这样我就可以从发帖者的页面上抓取一条信息。目前海报的 id 存储在 id 属性中,因此我想使用 lxml 来获取该属性的值。
例如:
<div onclick="showDetail(9202)">
<div class="maincard narrower Poster" id="maincard_9202"> </div>
</div>
Run Code Online (Sandbox Code Playgroud)
我想从 id 属性中获取“maincard_9202”,这样我就可以使用正则表达式来获取 9202。从那里,我可以使用这个值直接访问海报的页面,因为我知道 url 重定向模式是这样的从
https://nips.cc/Conferences/2017/Schedule?type=海报(当前页面)至 https://nips.cc/Conferences/2017/Schedule?showEvent=9202(海报页面)
我试图使用以下代码:
from lxml import html
import requests
page = requests.get('https://nips.cc/Conferences/2017/Schedule?type=Poster')
tree = html.fromstring(page.content)
paper_numbers = tree.xpath('//div[@onclick]/id/')
Run Code Online (Sandbox Code Playgroud)
但这会返回一个空列表。
在这种情况下如何获取属性值?
我正在尝试对来自authorize.net 的数据执行交易报告。
我确定返回的对象是 alxml.objectify.ObjectifiedElement并且很难按原样使用。
我希望将这个对象转变为一个更可行的dict对象,但我在这样做时遇到了困难。
我已经确定authorize.net交易如下所示:
<getTransactionDetailsRequest xmlns="AnetApi/xml/v1/schema/AnetApiSchema.xsd">
<messages>
<resultCode>...</resultCode>
<message>
<code>...</code>
<text>...</text>
</message>
</messages>
<transaction>
<transId>...</transId>
<submitTimeUTC>...</submitTimeUTC>
<submitTimeLocal>...</submitTimeLocal>
<transactionType>...</transactionType>
<transactionStatus>...</transactionStatus>
<responseCode>...</responseCode>
<responseReasonCode>...</responseReasonCode>
<responseReasonDescription>...</responseReasonDescription>
<AVSResponse>...</AVSResponse>
<cardCodeResponse>...</cardCodeResponse>
<batch>
<batchId>...</batchId>
<settlementTimeUTC>...</settlementTimeUTC>
<settlementTimeLocal>...</settlementTimeLocal>
<settlementState>...</settlementState>
</batch>
<order>
<description>...</description>
</order>
<authAmount>...</authAmount>
<settleAmount>...</settleAmount>
<lineItems>
<lineItem>
<itemId>...</itemId>
<name>...</name>
<description>...</description>
<quantity>...</quantity>
<unitPrice>...</unitPrice>
<taxable>...</taxable>
</lineItem>
</lineItems>
<taxExempt>...</taxExempt>
<payment>
<creditCard>
<cardNumber>...</cardNumber>
<expirationDate>...</expirationDate>
<cardType>...</cardType>
</creditCard>
</payment>
<customer>
<email>...</email>
</customer>
<billTo>
<firstName>...</firstName>
<lastName>...</lastName>
<phoneNumber>...</phoneNumber>
</billTo>
<recurringBilling>...</recurringBilling>
<product>...</product>
<marketType>...</marketType>
</transaction>
Run Code Online (Sandbox Code Playgroud)
我想生成一个看起来像的字典
{getTransactionDetailsRequest …Run Code Online (Sandbox Code Playgroud) 我需要使用项目的特殊名称构建 xml 文件,这是我当前的代码:
from lxml import etree
import lxml
from lxml.builder import E
wp = E.wp
tmp = wp("title")
print(etree.tostring(tmp))
Run Code Online (Sandbox Code Playgroud)
当前输出是这样的:
b'<wp>title</wp>'
Run Code Online (Sandbox Code Playgroud)
我想成为 :
b'<wp:title>title</title:wp>'
Run Code Online (Sandbox Code Playgroud)
我如何创建具有如下名称的项目:wp:title?
我正在尝试使用lxml从Reddit.com网站上获取所有标题的列表.我用过这个查询:
reddit = etree.HTML( urllib.urlopen("http://www.reddit.com/r/all/top").read() )
reddit.xpath("//div[contains(@class,'title')]//b/text()")
Run Code Online (Sandbox Code Playgroud)
但是,当我运行表达式时,Python shell中没有任何内容.XPath不正确吗?
使用Python 2.7运行
这是完整的代码:
import urllib
import os, random, sys, math
from lxml import etree
def main():
reddit = etree.HTML( urllib.urlopen("http://www.reddit.com/r/all/top").read() )
reddit.xpath("//div[contains(@class,'title')]//b/text()")
if __name__ == "__main__":
main()
Run Code Online (Sandbox Code Playgroud) 我正在从<iframe>中获取数据.然后我想从<iframe>中获取数据lxml.
我没有找到任何可以从中获取<iframe>的资源lxml.
任何人都可以帮我告诉我如何实现它?
我需要在lxml中的特定标签后获取一些信息.xml doc看起来像这样
<?xml version="1.0" encoding="ISO-8859-1"?>
<web-app xmlns="http://java.sun.com/xml/ns/j2ee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/
ns/j2ee/web-app_2_4.xsd"
version="2.4">
<display-name>Community Bank</display-name>
<description>WebGoat for Cigital</description>
<context-param>
<param-name>PropertiesPath</param-name>
<param-value>/WEB-INF/properties.txt</param-value>
<description>This is the path to the properties file from the servlet root</description>
</context-param>
<servlet>
<servlet-name>Index</servlet-name>
<servlet-class>com.cigital.boi.servlet.index</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index</url-pattern>
</servlet-mapping>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index.html</url-pattern>
</servlet-mapping>
Run Code Online (Sandbox Code Playgroud)
我想阅读com.cigital.boi.servlet.index.
我已经使用这段代码来读取servlet下的所有内容
context = etree.parse(handle)
list = parser.xpath('//servlet')
print list
Run Code Online (Sandbox Code Playgroud)
list只包含更多信息:迭代上下文字段我找到了这些行.
<Element {http://java.sun.com/xml/ns/j2ee}servlet-name at 2ad19e6eca48>
<Element {http://java.sun.com/xml/ns/j2ee}servlet-class at 2ad19e6ecaf8>
Run Code Online (Sandbox Code Playgroud)
我在想,因为我在搜索时没有包含名称空间,输出是空列表.请建议在servlet-class标签中阅读"com.cigital.boi.servlet.index"
我正试图获得股票的公司名称,行业和行业.我下载的HTML 'https://finance.yahoo.com/q/in?s={}+Industry'.format(sign),然后尝试用解析它.xpath()从lxml.html.
要获取我正在尝试抓取的数据的XPath,我会转到Chrome中的网站,右键单击该项目,单击Inspect Element,右键单击突出显示的区域,然后单击Copy XPath.这在过去一直对我有用.
可以使用以下代码重现此问题(我使用Apple作为示例):
import requests
from lxml import html
page_p = 'https://finance.yahoo.com/q/in?s=AAPL+Industry'
name_p = '//*[@id="yfi_rt_quote_summary"]/div[1]/div/h2/text()'
sect_p = '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[1]/td/a/text()'
indu_p = '//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[2]/tbody/tr/td/table/tbody/tr[2]/td/a/text()'
page = requests.get(page_p)
tree = html.fromstring(page.text)
name = tree.xpath(name_p)
sect = tree.xpath(sect_p)
indu = tree.xpath(indu_p)
print('Name: {}\nSector: {}\nIndustry: {}'.format(name, sect, indu))
Run Code Online (Sandbox Code Playgroud)
这给出了这个输出:
Name: ['Apple Inc. (AAPL)']
Sector: []
Industry: []
Run Code Online (Sandbox Code Playgroud)
它没有遇到任何下载困难,因为它能够检索name,但其他两个不起作用.如果我分别用tr[1]/td/a/text()和替换它们的路径tr[1]/td/a/text(),它会返回:
Name: ['Apple Inc. (AAPL)']
Sector: ['Consumer Goods', 'Industry Summary', 'Company …Run Code Online (Sandbox Code Playgroud) HTML结构是这样的:
<div class="image">
<a target="_top" href="someurl">
<img class="_verticallyaligned" src="cdn.translte" alt="">
</a>
<button class="dui-button -icon" data-shop-id="343170" data-id="14145140">
<i class="dui-icon -favorite"></i>
</button>
</div>
Run Code Online (Sandbox Code Playgroud)
提取文本的代码:
buyers = doc.xpath("//div[@class='image']/a[0]/text()")
Run Code Online (Sandbox Code Playgroud)
输出为:
[]
Run Code Online (Sandbox Code Playgroud)
我做错什么了?
我下面有XML,我已经保存在名为movies.xml的文件中。我只需要将某些值转换为JSON。对于直接转换,我可以使用xmltodict。我正在使用etree和etree.XMLParser()。我尝试在此之后进行弹性搜索。我已经使用attrib方法成功提取了单个节点。
<?xml version="1.0" encoding="UTF-8" ?>
<collection>
<genre category="Action">
<decade years="1980s">
<movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
<format multiple="No">DVD</format>
<year>1981</year>
<rating>PG</rating>
<description>
'Archaeologist and adventurer Indiana Jones
is hired by the U.S. government to find the Ark of the
Covenant before the Nazis.'
</description>
</movie>
<movie favorite="True" title="THE KARATE KID">
<format multiple="Yes">DVD,Online</format>
<year>1984</year>
<rating>PG</rating>
<description>None provided.</description>
</movie>
<movie favorite="False" title="Back 2 the Future">
<format multiple="False">Blu-ray</format>
<year>1985</year>
<rating>PG</rating>
<description>Marty McFly</description>
</movie>
</decade>
<decade years="1990s">
<movie favorite="False" title="X-Men">
<format …Run Code Online (Sandbox Code Playgroud) lxml ×10
python ×10
html ×3
parsing ×2
python-3.x ×2
xml ×2
xpath ×2
iframe ×1
namespaces ×1
scrapy ×1
wordpress ×1
xml-parsing ×1
xml.etree ×1