lxml将元素转换为elementtree

以下测试读取文件,并使用lxml.html为页面生成DOM/Graph的叶节点.

但是,我也试图弄清楚如何从"字符串"获取输入.运用

 lxml.html.fromstring(s)

Run Code Online (Sandbox Code Playgroud)

不起作用,因为这会生成"元素"而不是"ElementTree".

所以,我想弄清楚如何将元素转换为ElementTree.

思考

测试代码::

import lxml.html
from lxml import etree    # trying this to see if needed 
                          # to convert from element to elementtree


  #cmd='cat osu_test.txt'
  cmd='cat o2.txt'
  proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
  s=proc.communicate()[0].strip()

  # s contains HTML not XML text
  #doc = lxml.html.parse(s)
  doc = lxml.html.parse('osu_test.txt')
  doc1 = lxml.html.fromstring(s)

  for node in doc.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

  nt = etree.ElementTree(doc1)        <<<<< doesn't work.. so what will??
  for node in nt.iter():
  if len(node) …

Run Code Online (Sandbox Code Playgroud)

python lxml element elementtree

tom*_*ith

2015 02-17

12
推荐指数

2
解决办法

7579
查看次数

./configure not seeing/finds boost header files

试图在Fedora 64系统上使用bzr从Launchpad的gearmand-0.33.tar.gz构建gearman.

通过tiself执行./configure,以及使用"-with-boost =/usr/include"参数生成警告错误,因为配置进程似乎无法找到/找到boost头文件.

我们也通过"yum install boost*"删除/重新安装了boost头文件

任何指针都会被尝试!

谢谢

./configure
.
.
.
checking if more special flags are required for pthreads... no
checking for PTHREAD_PRIO_INHERIT... yes
checking for Boost headers version >= 1.39.0... yes
checking for Boost's header version... 1_41
checking for the toolset name used by Boost for g++... gcc44 -gcc
checking boost/program_options.hpp usability... no
checking boost/program_options.hpp presence... yes
configure: WARNING: boost/program_options.hpp: present but cannot be compiled
configure: WARNING: boost/program_options.hpp:     check for missing prerequisite headers?
configure: WARNING: boost/program_options.hpp: …

Run Code Online (Sandbox Code Playgroud)

compiler-construction boost fedora gearman

tom*_*ith

lucky-day

5
推荐指数

2
解决办法

1万
查看次数

xpath - using包含通配符

我有以下内容,并试图看看是否有更好的方法.我知道它可以使用starts-with/contains来完成.我正在测试firefox 10,我认为它实现了xpath 2. +.

测试节点是

<a id="foo">
.
.
.
<a id="foo1">
.
<a id="foo2">

Run Code Online (Sandbox Code Playgroud)

有没有办法使用通配符来获取foo1/foo2节点..

就像是

//a[@id =* 'foo'] 

or 

//a[contains(@id*,'foo')]

Run Code Online (Sandbox Code Playgroud)

哪个会说,给我一个"a",其中id以"foo"开头,但是有额外的字符......然后这将跳过带有"foo"的第一个节点

我以为我已经看到了这个rticle,但找不到它!

我记得,文章指出xpath有一组运算符,可用于指定字符串中给定模式的开始/结束.

谢谢

xpath pattern-matching

tom*_*ith

lucky-day

5
推荐指数

2
解决办法

2万
查看次数

scrapy - python问题

也许不是正确的发布地点.但是,无论如何我还是要去尝试!

我有几个我创建的测试python解析脚本.他们的工作足以让我测试我正在做的事情.

但是,我最近遇到了用于网页抓取的python框架Scrapy.我的应用程序在分布式进程中运行,跨多个服务器的测试平台.我正在努力理解scrapy,看看它是否比我正在做的事情带来好处.

所以,如果可能的话,我真的想和一些基于/或使用scrapy的人交谈.

python distributed web-crawler scrapy

tom*_*ith

2011 01-15

4
推荐指数

1
解决办法

1092
查看次数

pycurl - 302重定向/页面移动

尝试使用pycurl成功获取页面/标题(响应/请求).我可以使用java/htmlunit成功获取它.

我错过了一些微妙的东西来到新的/重定向的页面.

我/我试图让"新"重定向的URL,然后将其送入pycurl了新的一页.

谢谢

示例测试代码是:

  #setup base url for the curl
  initurl="http://louisville.bncollege.com/"


  #sets up the pycurl object
  crl = pycurl.Curl()
  qq=StringIO.StringIO()
  header=StringIO.StringIO()

  #
  # init the curl for the parse
  #
  test1=0
  test=0
  while test==0:
    print "aaaaattttt \n"
    try: 
       crl.setopt(pycurl.URL, initurl)
       crl.setopt(pycurl.HEADER, 1)             #appears to allow/disallow the display of the header data
       #crl.setopt(pycurl.HEADER, 0)
       crl.setopt(pycurl.USERAGENT, user_agent)
       crl.setopt(pycurl.FOLLOWLOCATION, 0)
       crl.setopt(pycurl.COOKIEFILE, COOKIEFILE)
       crl.setopt(pycurl.COOKIEJAR, COOKIEJAR)
       crl.setopt(pycurl.WRITEFUNCTION, qq.write)
       crl.setopt(pycurl.HEADERFUNCTION, header.write)
       crl.perform()

        print "aaaappppp \n"
    except pycurl.error, e:
       print "ffff1111 "+str(e[0])+"\n"
     if e[0] !="":
       test1=1 …

Run Code Online (Sandbox Code Playgroud)

python redirect pycurl http-status-code-302

tom*_*ith

2011 02-20

4
推荐指数

1
解决办法

4983
查看次数

XPath只能直接跟随兄弟姐妹

我有以下类型的HTML.内容按<div "id=foo">和<div "id=foo1">元素分组,<div "style=padding…">中间.

我正在试图弄清楚如何制作一个XPath表达式,这将允许我触发"id=foo"返回兄弟<div>s与"style=padding…"

得到这个<div id="foo">是微不足道的.但是,我不能只following-sibling基于"style=padding…"它做一个因为它然后返回所有匹配的<div>s.

我需要一种方法来返回匹配的<div>s,直到我击中匹配的兄弟姐妹"id=foo1".我很确定有一种我想念的简单方法!

<div id="foo">stuff...</div>

<div style="padding:2px; ">stuff...</div>

<div id="foo1">stuff...</div>

<div id="foo">stuff...</div>

<div style="padding:2px; ">stuff...</div>
<div style="padding:2px; ">stuff...</div>
<div style="padding:2px; ">stuff...</div>

<div id="foo1">stuff...</div>

Run Code Online (Sandbox Code Playgroud)

xpath

tom*_*ith

2010 01-06

3
推荐指数

1
解决办法

5740
查看次数