如何将此XPath表达式转换为BeautifulSoup？

Question

如何将此XPath表达式转换为BeautifulSoup？

在回答上一个问题时,有几个人建议我将BeautifulSoup用于我的项目.我一直在努力处理他们的文档而我无法解析它.有人可以指出我应该能够将此表达式转换为BeautifulSoup表达式的部分吗？

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')

Run Code Online (Sandbox Code Playgroud)

以上表达来自Scrapy.我试图以应用正则表达式re('\.a\w+')来td class altRow从那里得到的链接.

我也很感激指向任何其他教程或文档.我找不到任何东西.

谢谢你的帮助.

编辑: 我正在看这个页面:

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

Run Code Online (Sandbox Code Playgroud)

但是,如果你看一下页面来源"/cabel"是:

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabel'>Abel, Christian</a>

Run Code Online (Sandbox Code Playgroud)

出于某种原因,BeautifulSoup看不到搜索结果,但XPath可以看到它们,因为hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')捕获了"/ cabel"

编辑: cobbal:它仍然无法正常工作.但当我搜索这个:

>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

Run Code Online (Sandbox Code Playgroud)

它返回所有带有第二个字符"a"但不是律师姓名的链接.因此,出于某种原因,BeautifulSoup看不到这些链接(例如"/ cabel").我不明白为什么.

Answer 1

cob*_*bal 6

一个选项是使用lxml(我不熟悉beautifulsoup,所以我不能说怎么做),它默认支持XPath

编辑:
尝试~~(未经测试)~~测试:

soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)

Run Code Online (Sandbox Code Playgroud)

我在http://www.crummy.com/software/BeautifulSoup/documentation.html上使用了docs

汤应该是BeautifulSoup对象

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)

Run Code Online (Sandbox Code Playgroud)

Answer 2

Pau*_*McG 4

我知道 BeautifulSoup 是规范的 HTML 解析模块，但有时您只想从某些 HTML 中刮出一些子字符串，而 pyparsing 有一些有用的方法来做到这一点。使用此代码：

\n\n

from pyparsing import makeHTMLTags, withAttribute, SkipTo\nimport urllib\n\n# get the HTML from your URL\nurl = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="\npage = urllib.urlopen(url)\nhtml = page.read()\npage.close()\n\n# define opening and closing tag expressions for <td> and <a> tags\n# (makeHTMLTags also comprehends tag variations, including attributes, \n# upper/lower case, etc.)\ntdStart,tdEnd = makeHTMLTags("td")\naStart,aEnd = makeHTMLTags("a")\n\n# only interested in tdStarts if they have "class=altRow" attribute\ntdStart.setParseAction(withAttribute(("class","altRow")))\n\n# compose total matching pattern (add trailing tdStart to filter out \n# extraneous <td> matches)\npatt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart\n\n# scan input HTML source for matching refs, and print out the text and \n# href values\nfor ref,s,e in patt.scanString(html):\n    print ref.text, ref.a.href\n

Run Code Online (Sandbox Code Playgroud)\n\n

我从您的页面中提取了 914 条参考文献，从 Abel 到 Zupikova。

\n\n

Abel, Christian /cabel\nAcevedo, Linda Jeannine /jacevedo\nAcu\xc3\x83\xc2\xb1a, Jennifer /jacuna\nAdeyemi, Ike /igbadegesin\nAdler, Avraham /aadler\n...\nZhu, Jie /jzhu\nZ\xc3\x83\xc2\xaddek, Ale\xc3\x85\xc2\xa1 /azidek\nZi\xc3\x83\xc2\xb3\xc3\x85\xe2\x80\x9aek, Agnieszka /aziolek\nZitter, Adam /azitter\nZupikova, Jana /jzupikova\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	16 年前
查看次数：	8887 次
最近记录：	11 年前