例如,如果我有
<form name="blah">
<input name="1"/>
<input name="2"/>
<table>
<tr>
<td>
<unkown number of levels more>
<input name="3"/>
</td>
</tr>
<table>
</form>
Run Code Online (Sandbox Code Playgroud)
如何组合将返回输入1,2和3的查询?
编辑:我应该注意到我对抓取页面上的所有输入元素不感兴趣,我只想要所有输入元素都是特定形式的子元素,所以"//"就是正确的.
有没有办法使用lxml.html删除/转义html标签而不是有一些xss问题的beautifulsoup?我尝试使用清洁,但我想删除所有的HTML.
我的代码有点奇怪:
import lxml.html
myxml='''
<cooperate>
<job DecreaseHour="1" table="tpa_radio_sum">
</job>
<job DecreaseHour="2" table="tpa_radio_sum">
</job>
<job DecreaseHour="3" table="tpa_radio_sum">
</job>
</cooperate>
'''
root=lxml.html.fromstring(myxml)
nodes1=root.xpath('//job[@DecreaseHour="1"]')
nodes2=root.xpath('//job[@table="tpa_radio_sum"]')
print "nodes1=",nodes1
print "nodes2=",nodes2
Run Code Online (Sandbox Code Playgroud)
我得到的是:
nodes1=[] 和
nodes2=[ Element job at 0x1241240,
Element job at 0x1362690,
Element job at 0x13626c0]
Run Code Online (Sandbox Code Playgroud)
为什么nodes1是 []?这是一件很奇怪的事情.为什么?
我想忽略我的xml中的unicode。我愿意以某种方式在输出处理中进行更改。
我的python:
import urllib2, os, zipfile
from lxml import etree
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
target = doc.xpath('//references-cited/citation/nplcit/*/text()')
#target = '-'.join(target).replace('\n-','')
print "docID: {0}\nCitation: {1}\n".format(docID,target)
outFile.write(str(docID) +"|"+ str(target) +"\n")
Run Code Online (Sandbox Code Playgroud)
创建以下内容的输出:
docID: US-D0607176-S1-20100105
Citation: [u"\u201cThe birth of Lee Min Ho's donuts.\u201d Feb. 25, 2009. Jazzholic. Apr. 22, 2009 <http://www
Run Code Online (Sandbox Code Playgroud)
但是,如果我尝试重新添加,则'-'join(target).replace('\n-','')对于print和都会出现此错误outFile.write:
Traceback (most recent call last):
File "C:\Documents and Settings\mine\Desktop\test_lxml.py", line 77, in <module>
print "docID: {0}\nCitation: {1}\n".format(docID,target)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' …Run Code Online (Sandbox Code Playgroud) 我正在使用python2.7和lxml.我的代码如下
import urllib
from lxml import html
def get_value(el):
return get_text(el, 'value') or el.text_content()
response = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Frisco/DavidMcDavidHondaofFrisco/fullsales-504210667.html').read()
dom = html.fromstring(response)
try:
description = get_value(dom.xpath("//div[@class='description item vcard']")[0].xpath(".//p[@class='sales-review-paragraph loose-spacing']")[0])
except IndexError, e:
description = ''
Run Code Online (Sandbox Code Playgroud)
代码在try中崩溃,给出错误
UnicodeDecodeError at /
'utf8' codec can't decode byte 0x92 in position 85: invalid start byte
Run Code Online (Sandbox Code Playgroud)
无法编码/解码的字符串是:ouldn t
我尝试过使用很多技术,包括.encode('utf8'),但没有一个能解决问题.我有2个问题:
如何同时下载多个链接?我下面的脚本有效,但一次只下载一个,速度非常慢.我无法弄清楚如何在我的脚本中加入多线程.
Python脚本:
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'w').write(converted)
print(name)
Run Code Online (Sandbox Code Playgroud)
HTML文件名为links.html:
<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>
<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>
Run Code Online (Sandbox Code Playgroud) 我正在尝试解析一个HTML文档.它包含几个表.我能够找到正确的表并从中获取数据
for cell in doc.xpath('//table[@class="CE_13"]')[0]:
for a in cell:
print a.text_content()
Run Code Online (Sandbox Code Playgroud)
表由6列组成.我只需要第五列.是否有可能获得dict中的所有值(如果:{ column1 : values_of_clm1 ;column2 : values_of_clmn2; .....})如何?然后读取表单dict或者您是否建议使用不同的解决方案?
我想让我的脚本正常工作.到目前为止,它没有设法输出任何东西.
这是我的test.xml
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it">
<page>
<title>MediaWiki:Category</title>
<ns>0</ns>
<id>2</id>
<revision>
<id>11248</id>
<timestamp>2003-12-31T13:47:54Z</timestamp>
<contributor>
<username>Frieda</username>
<id>0</id>
</contributor>
<minor />
<text xml:space="preserve">categoria</text>
<sha1>0acykl71lto9v65yve23lmjgia1h6sz</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>
</mediawiki>
Run Code Online (Sandbox Code Playgroud)
这是我的代码
from lxml import etree
def fast_iter(context, func):
# fast_iter is useful if you need to free memory while iterating through a
# very large XML file.
#
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context …Run Code Online (Sandbox Code Playgroud) 我有许多带有孩子“名称”的“根”标签。我想对“根”块进行排序,并按“名称”元素按字母顺序排序。尝试过lxml / etree / minidom,但是无法正常工作...我无法获取它来解析标记内的值,然后对父根标记进行排序。
<?xml version='1.0' encoding='UTF-8'?>
<roots>
<root>
<path>//1.1.1.100/Alex</path>
<name>Alex Space</name>
</root>
<root>
<path>//1.1.1.101/Steve</path>
<name>Steve Space</name>
</root>
<root>
<path>//1.1.1.150/Bethany</path>
<name>Bethanys</name>
</root>
</roots>
Run Code Online (Sandbox Code Playgroud)
这是我尝试过的:
import xml.etree.ElementTree as ET
def sortchildrenby(parent, child):
parent[:] = sorted(parent, key=lambda child: child)
tree = ET.parse('data.xml')
root = tree.getroot()
sortchildrenby(root, 'name')
for child in root:
sortchildrenby(child, 'name')
tree.write('output.xml')
Run Code Online (Sandbox Code Playgroud) 我正试图在谷歌Colaboratory上移植我的代码.即便如此,这也很奇怪
!pip3 install xml
Run Code Online (Sandbox Code Playgroud)
在我的代码中.它仍然需要我安装lxml.
有人有问题吗?
****Requirement already satisfied: lxml in /usr/local/lib/python3.6/dist-packages****
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-17-eda66c9ec97a> in <module>()
48 #df = financial_statement(2017,3)
...
/usr/local/lib/python3.6/dist-packages/pandas/io/html.py in _parser_dispatch(flavor)
695 else:
696 if not _HAS_LXML:
--> 697 raise ImportError("lxml not found, please install it")
698 return _valid_parsers[flavor]
699
**ImportError: lxml not found, please install it**
**code:**
!pip3 install lxml
import requests
import pandas as pd
import numpy as np
import keras
import lxml
import html5lib
from bs4 import BeautifulSoup …Run Code Online (Sandbox Code Playgroud) lxml ×10
python ×9
xml ×3
xpath ×2
elementtree ×1
html-parsing ×1
html-table ×1
python-2.7 ×1
tags ×1
unicode ×1
urllib ×1
urllib2 ×1
web-scraping ×1
xss ×1