正则表达式在 bs4 中不起作用

Question

正则表达式在 bs4 中不起作用

我正在尝试从 watchseriesfree.to 网站上的特定文件主机中提取一些链接。在以下情况下，我想要rapidvideo 链接，所以我使用regex 过滤掉那些带有包含rapidvideo 的文本的标签

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

Run Code Online (Sandbox Code Playgroud)

但是，上面的代码返回一个空白列表。我究竟做错了什么？

Answer 1

ale*_*cxe 6

问题在这里：

firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

Run Code Online (Sandbox Code Playgroud)

何时BeautifulSoup应用您的文本正则表达式模式，它将使用所有匹配元素的.string属性值tr。现在，.string有一个重要的警告 -当一个元素有多.stringNone个子元素时，是：

如果一个标签包含不止一个东西，那么就不清楚.string应该引用什么，所以.string定义为None.

因此，您没有结果。

您可以做的是tr通过使用搜索功能并调用来检查元素的实际文本.get_text()：

soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，9 月前
查看次数：	542 次
最近记录：	8 年，9 月前