标签: bs4

BS4 select_one vs查找

我想知道表演bs.find('div')和表演有什么区别bs.select_one('div')。这同样适用于find_all和select。

在性能上是否存在任何差异，或者在特定情况下是否可以使用其他差异？

python beautifulsoup html-parsing bs4

Sal*_*med

2016 08-19

5
推荐指数

1
解决办法

2620
查看次数

Python Beautiful Soup'ascii'编解码器不能编码字符u'\ xa5'

当网页抓取页面的某些元素时,我遇到了一些奇怪的角色.似乎给出错误的字符是:

？?? ??了¢¢阿？/？/>？/ ??? ？/¢¥?? %%？Á？？？？？一个？？>/???¥??>¥？¥©Á？>¢¥/ %% /¥??>？Â>Á？一个？Á？???¢%Á？¥??? /%Á%Á？¥??> ?? />？Â??了？??¥?? ??¢¥????¢`¢¢¢ ?? %%？Á??À？/？Á？¥？_Á？¥？> ??Á/¢？>ÀÁ??? Á>¥?? ??¥阿？/>？?? __？> ?? /¥??>¢？Á

我的代码如下

url= "http://www.nsf.gov#######@#@#@##";
    #webbrowser.open(url,new =new );
    flagcnt+=1
    if flagcnt%20==0: #autosleep for avoiding shut-out
        print "flagcount: "
        print flagcnt
        time.sleep(5)
     #Program Code extraction
    r = requests.get (url)
    sp=BeautifulSoup(r.content)

Run Code Online (Sandbox Code Playgroud)

页面:http://www.nsf.gov/awardsearch

我读了这个错误的所有页面,其中一些建议解码和编码,但他们似乎没有帮助.我不知道这里使用的是哪种编码.已经降级BS版本但没有帮助.任何帮助表示赞赏.Python 2.7 BS 4

html python beautifulsoup web-scraping bs4

Pul*_*waj

2015 04-17

4
推荐指数

1
解决办法

8742
查看次数

BS4 和 BeautifulSoup 错误来自：无法读取 /var/mail/BeautifulSoup

From Beautiful import Beautiful 立即响应错误“from: can't read /var/mail/BeautifulSoup”。也试过与 BS4 相同的结果。使用突触包卸载并重新安装 BS4 和 BeautifulSoup。结果一样。尝试完全删除并得到相同的结果。使用终端，显示未安装 BS4 和 BeautifulSoup。

使用 Python 2.7.6

审查了问题，但只有 2 个回复，但没有帮助。

有什么建议？

python beautifulsoup bs4

Yan*_*e26

lucky-day

4
推荐指数

1
解决办法

4399
查看次数

在Python中使用这两种方式建立Web连接的实际区别是什么？

我注意到有几种方法可以为Web报废提供http连接.我不确定某些是更近期和最新的编码方式,还是它们只是具有不同优点和缺点的不同模块.更具体地说,我试图了解以下两种方法之间的区别,你会推荐什么？

1)使用urllib3:

http = PoolManager()
r = http.urlopen('GET', url, preload_content=False)
soup = BeautifulSoup(r, "html.parser")

Run Code Online (Sandbox Code Playgroud)

2)使用请求

html = requests.get(url).content
soup = BeautifulSoup(html, "html5lib")

Run Code Online (Sandbox Code Playgroud)

是什么将这两个选项区分开来,除了它们需要导入不同模块的简单事实？

http urllib3 python-3.x python-requests bs4

Vic*_*gos

lucky-day

3
推荐指数

1
解决办法

449
查看次数

正则表达式在 bs4 中不起作用

我正在尝试从 watchseriesfree.to 网站上的特定文件主机中提取一些链接。在以下情况下，我想要rapidvideo 链接，所以我使用regex 过滤掉那些带有包含rapidvideo 的文本的标签

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

Run Code Online (Sandbox Code Playgroud)

但是，上面的代码返回一个空白列表。我究竟做错了什么？

python regex urllib2 bs4

Ech*_*yak

lucky-day

3
推荐指数

1
解决办法

542
查看次数

BeautifulSoup 的导入错误

我已经下载了 BeautifulSoup pip3 install beautifulsoup，它运行良好。

但是当我尝试from bs4 import BeautifulSoupor 时import BeautifulSoup，我会收到错误ModuleNotFoundError: No module named 'BeautifulSoup'或ModuleNotFoundError: No module named 'bs4'取决于我使用的代码行。

我不知道出了什么问题。为什么我收到错误？

python beautifulsoup bs4

elm*_*ado

2017 04-03

3
推荐指数

1
解决办法

1万
查看次数

如何使用BeautifulSoup提取div的属性值

我有一个id为"img-cont"的div

<div class="img-cont-box" id="img-cont" style='background-image: url("http://example.com/example.jpg");'>

Run Code Online (Sandbox Code Playgroud)

我想用美丽的汤来提取背景图像中的网址.我该怎么做？

python bs4

lat*_*ish

lucky-day

3
推荐指数

1
解决办法

1134
查看次数

在类中抓取一个类

我想class_="href"在class_="_e4d". 基本上是想使用 BeautifulSoup 在一个类中抓取一个类。

from bs4 import BeautifulSoup
import selenium.webdriver as webdriver

url = ("https://www.google.com/search?...")

def get_related_search(url):
    driver = webdriver.Chrome("C:\\Users\\John\\bin\\chromedriver.exe")
    driver.get(url)
    soup = BeautifulSoup(driver.page_source)
    relate_result = soup.find_all("p", class_="_e4b")
    return relate_result[0]

relate_url = get_related_search(url)
print(relate_url)

Run Code Online (Sandbox Code Playgroud)

结果：markup_type=markup_type)) p class="_e4b"}{a href="/search?...a}{/p}

我现在想抓取 href 结果。我不确定下一步会是什么。谢谢您的帮助。

注意：我用 {} 替换了 <>，因为它没有显示为 html 脚本

python selenium webdriver beautifulsoup bs4

Mws*_*cer

2017 05-14

3
推荐指数

1
解决办法

1331
查看次数

如何在Mac上将BeautifulSoup4安装到python3

我在/ usr/bin/python中有原始的Python 2.7.5,我通过在/ usr/local/bin/python3下载Python 3.5.1软件包安装了Python3,然后我安装了BeautifulSoup4,如下所示:

sudo easy_install BeautifulSoup4
Searching for BeautifulSoup4
Best match: beautifulsoup4 4.4.1
Processing beautifulsoup4-4.4.1-py2.7.egg
beautifulsoup4 4.4.1 is already the active version in easy-install.pth

Using /Library/Python/2.7/site-packages/beautifulsoup4-4.4.1-py2.7.egg
Processing dependencies for BeautifulSoup4
Finished processing dependencies for BeautifulSoup4

Run Code Online (Sandbox Code Playgroud)

这样我不能在python3中使用bs4,如何在python3上安装bs4？

beautifulsoup python-2.7 python-3.x bs4

mik*_*ang

2016 02-21

2
推荐指数

1
解决办法

6280
查看次数

使用Python和BeautifulSoup访问网页中标签的title属性

我是Python的新手,我正在尝试从特定网址中检索所有标题,但我无法这样做.代码编译没有任何错误,但仍然没有得到输出.

import requests
import sys
from bs4 import BeautifulSoup

def test_function(num):
    url = "https://www.zomato.com/chennai/restaurants?buffet=1&page=" +       
    str(num)
    source_code = requests.get(url) 
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll('title'):
        print(link)
test_function(1)

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup bs4

RDP*_*DPD

2015 04-17

1
推荐指数

1
解决办法

2190
查看次数

Python在页面上计算数字或字母数

我在Python中发出请求requests.

然后我bs4用来选择想要的div.我现在想要计算该div中文本的长度,但是我从中获取的字符串也包括所有标记,例如:

<div><a class="some_class">Text here!</a></div>

Run Code Online (Sandbox Code Playgroud)

我想只计算Text here!,没有所有div和a标签.

任何人都知道我该怎么做？

html python beautifulsoup bs4

Luk*_*uka

2015 11-06

1
推荐指数

1
解决办法

535
查看次数

提取存储的html文件的URL

我已经存储了一些HTML文件并将其重命名。有什么可能的方法可以提取python中html文件的URL。

编辑：我希望找到.html文件的URL，而不是其中存在的链接。我正在寻找一种通用的方法，因为我有很多文件。

python urllib2 bs4

Abh*_*tia

2015 05-19

-1
推荐指数

1
解决办法

104
查看次数

标签统计

bs4 ×12

python ×10

beautifulsoup ×8

html ×2

python-3.x ×2

urllib2 ×2

html-parsing ×1

http ×1

python-2.7 ×1

python-requests ×1

regex ×1

selenium ×1

urllib3 ×1

web-scraping ×1

webdriver ×1

标签 统计

标签统计