查找最近的链接BeautifullSoup(python)

And*_*eas 7 python lxml beautifulsoup

我正在做一个小项目,在那里我提取报纸上的政治领袖.有时会提到一位政治家,并且没有父母或孩子有链接.(因为我猜到语义不好的标记).

所以我想创建一个可以找到最近的链接的函数,然后提取它.在下面的情况下,搜索字符串是Rasmussen,我想要的链接是:/307046.

#-*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find)) 

def find_nearest(element, url, direction="both"):
    """Find the nearest link, relative to a text string.
    When complete it will search up and down (parent, child),
    and only X levels up down. These features are not implemented yet.
    Will then return the link the fewest steps away from the
    original element. Assumes we have already found an element"""

    # Is the nearest link readily available?
    # If so - this works and extracts the link.
    if element.find_parents('a'):
        for artikel_link in element.find_parents('a'):
            link = artikel_link.get('href')
            # sometimes the link is a relative link - sometimes it is not
            if ("http" or "www") not in link:
                link = url+link
                return link
    # But if the link is not readily available, we will go up
    # This is (I think) where it goes wrong
    # ???????????????????????????????????
    if not element.find_parents('a'):
        element =  element.parent
        # Print for debugging
        print element #on the 2nd run (i.e <li> this finds <a href=/307056> 
        # So shouldn't it be caught as readily available above?
        print u"Found: %s" % element.name
        # the recursive call
        find_nearest(element,url)

# run it
if contexts:
    for a in contexts:
        find_nearest( element=a, url="http://information.dk")
Run Code Online (Sandbox Code Playgroud)

以下直接电话有效:

print contexts[0].parent.parent.parent.a['href'].encode('utf-8')
Run Code Online (Sandbox Code Playgroud)

作为参考,整个令人遗憾的代码在bitbucket上:https://bitbucket.org/achristoffersen/politikere-i-medierne

(ps使用BeautifullSoup 4)


编辑:SimonSapin要求我定义最近的:通过最近我的意思是在任一方向上距离搜索项最少的嵌套级别的链接.在上面的文本中,a href由基于drupal的报纸网站产生的,既不是发现搜索字符串的标签的直接父母或孩子.所以BeautifullSoup找不到它.

我怀疑"最少的字符"会经常发挥作用.在这种情况下,一个可以与find和rfind一起被黑客攻击 - 但我真的想通过BS这样做.由于这可行:contexts[0].parent.parent.parent.a['href'].encode('utf-8')必须可以将其概括为脚本.

编辑:也许我应该强调我正在寻找一个BeautifulSoup解决方案.根据@ erik85的建议,将BS与自定义/简单的第一次搜索结合起来很快就会变得混乱.

eri*_*ork 12

有人可能会想出一个可以复制和粘贴的解决方案,你会认为这解决了你的问题.但是你的问题不是代码!这是你的策略.有一种称为"分而治之"的软件设计原则,您应该在重新设计代码时应用它:将搜索最近节点(可能是广度优先搜索)的HTML /字符串解释为树/图形的代码.您不仅将学习设计更好的软件,您的问题可能会不复存在.

我认为你很聪明,可以自己解决这个问题,但我也想提供一个骨架:

def parse_html(txt):
    """ reads a string of html and returns a dict/list/tuple presentation"""
    pass

def breadth_first_search(graph, start, end):
    """ finds the shortest way from start to end
    You can probably customize start and end to work well with the input you want
    to provide. For implementation details see the link in the text above.
    """
    pass

def find_nearest_link(html,name):
    """putting it all together"""
    return breadth_first_search(parse_html(html),name,"link")
Run Code Online (Sandbox Code Playgroud)

PS:这样做也适用另一个原则,但是从数学:假设存在问题你不知道解决方案(找到接近所选子字符串的链接)并且有一组问题你知道解决方案(图遍历),然后尝试转换您的问题以匹配您可以解决的问题组,这样您就可以使用基本的解决方案模式(甚至可能已经在您选择的语言/框架中实现)并且您已完成.

  • 对于可能需要原则而不是代码的人来说,+1非常好的概念性答案. (2认同)

unu*_*tbu 2

这是使用 lxml 的解决方案。主要思想是找到所有前面和后面的元素,然后对这些元素进行循环迭代:

\n\n
def find_nearest(elt):\n    preceding = elt.xpath(\'preceding::*/@href\')[::-1]\n    following = elt.xpath(\'following::*/@href\')\n    parent = elt.xpath(\'parent::*/@href\')\n    for href in roundrobin(parent, preceding, following):\n        return href\n
Run Code Online (Sandbox Code Playgroud)\n\n

使用 BeautifulSoups\'(或 bs4\'s)next_elements 和 previous_elements的类似解决方案类似解决方案也应该是可能的。

\n\n
\n\n
import lxml.html as LH\nimport itertools\n\ndef find_nearest(elt):\n    preceding = elt.xpath(\'preceding::*/@href\')[::-1]\n    following = elt.xpath(\'following::*/@href\')\n    parent = elt.xpath(\'parent::*/@href\')\n    for href in roundrobin(parent, preceding, following):\n        return href\n\ndef roundrobin(*iterables):\n    "roundrobin(\'ABC\', \'D\', \'EF\') --> A D E B F C"\n    # http://docs.python.org/library/itertools.html#recipes\n    # Author: George Sakkis\n    pending = len(iterables)\n    nexts = itertools.cycle(iter(it).next for it in iterables)\n    while pending:\n        try:\n            for n in nexts:\n                yield n()\n        except StopIteration:\n            pending -= 1\n            nexts = itertools.cycle(itertools.islice(nexts, pending))\n\ntekst = \'\'\'\n<li>\n  <div class="views-field-field-webrubrik-value">\n    <h3>\n      <a href="/307046">Claus Hjort spiller med mrkede kort</a>\n    </h3>\n  </div>\n  <div class="views-field-field-skribent-uid">\n    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>\n  </div>\n  <div class="views-field-field-webteaser-value">\n    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise\n      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok\n      snarere at forberede det ideologiske grundlag for en L\xc3\xb8kke Rasmussens\n      genkomst som statsministe\n    </div>\n  </div>\n  <span class="views-field-view-node">\n    <span class="actions">\n      <a href="/307046">Ls mere</a>\n      |\n      <a href="/307046/#comments">Kommentarer (4)</a>\n    </span>\n  </span>\n</li>\n\'\'\'\n\nto_find = "Rasmussen"\ndoc = LH.fromstring(tekst)\n\nfor x in doc.xpath(\'//*[contains(text(),{s!r})]\'.format(s = to_find)):\n    print(find_nearest(x))\n
Run Code Online (Sandbox Code Playgroud)\n\n

产量

\n\n
/307046\n
Run Code Online (Sandbox Code Playgroud)\n