标签: beautifulsoup

如何用beautifulsoup4提取HTML？

html看起来像这样:

<td class='Thistd'><a ><img /></a>Here is some text.</td>

Run Code Online (Sandbox Code Playgroud)

我只想得到字符串<td>.我不需要<a>...</a>.我怎样才能做到这一点？

我的代码:

from bs4 import BeautifulSoup
html = """<td class='Thistd'><a><img /></a>Here is some text.</td>"""

soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
    print td
    print '============='

Run Code Online (Sandbox Code Playgroud)

我得到的是 <td class='Thistd'><a ><img /></a>Here is some text.</td>

但我只是需要 Here is some text.

python beautifulsoup

jia*_* Ma

2015 10-15

1
推荐指数

1
解决办法

68
查看次数

使用Python中的beautifulsoup从网站中提取数字

我正在尝试使用urllib来获取一个html页面,然后使用beautifulsoup来提取数据.我想从comments_42.html获取所有数字并打印出它们的总和,然后显示数据的数量.这是我的代码,我正在尝试使用正则表达式,但它对我不起作用.

import urllib
from bs4 import BeautifulSoup
url = 'http://python-data.dr-chuck.net/comments_42.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('span')
for tag in tags:
    print tag

Run Code Online (Sandbox Code Playgroud)

python regex beautifulsoup

Sal*_*sha

lucky-day

1
推荐指数

1
解决办法

1万
查看次数

Python从URL抓取pdf

我想从URL“ http://www.nycgo.com/venues/thalia-restaurant#menu ”中抓取文本，我感兴趣的文本位于页面的“菜单”选项卡中。我尝试了BeautifulSoup来获取页面上的所有文本，但是以下代码的返回值缺少菜单中的所有文本。

html = urllib2.urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")
html=html.read()
soup = BS(html)
print soup.get_text()

Run Code Online (Sandbox Code Playgroud)

当我检查菜单内容中的元素时，菜单的内容似乎是页面上html的一部分。我确实注意到，当实际浏览页面时，菜单完全加载需要几秒钟。不知道这是否就是上面的代码无法获取菜单内容的原因。

任何见识将不胜感激。

html python beautifulsoup

Cam*_*slu

lucky-day

1
推荐指数

1
解决办法

5509
查看次数

可以使用BeautifulSoup或Python中的regex解析此半结构化文本文件

如何解析此文本文件并仅提取每行中的第一个值？

file.txt:

HTTP://google.com,username2,mypassword1

HTTP://yahoo.com,username3,mypassword2

HTTP://ebay.com,username4,mypassword7

预期产量:

http://google.com
http://yahoo.com
http://ebay.com

Run Code Online (Sandbox Code Playgroud)

有可能做美丽的汤或某种正则表达式吗？

python regex parsing beautifulsoup

Cod*_*alk

2016 01-24

1
推荐指数

1
解决办法

234
查看次数

AttributeError：“ bytes”对象在Python中编码后没有属性“ find_all”

我收到以下错误。我已经在Google上进行了足够的搜索。但是没有什么可以解决我的问题。我的问题似乎与其他人不同。我正在使用BeautifulSoup。

我认为以下几行引起了问题。

soup = BeautifulSoup(req.content, 'html.parser').encode("utf-8")

Run Code Online (Sandbox Code Playgroud)

当我试图找到所有div有一个holder课时：

data = soup.find_all("div", {"class":"holder"})

Run Code Online (Sandbox Code Playgroud)

如果显示以下错误：

追溯（最近一次通话最近）：文件“ web_crawler.py”，第32行，数据= soup.find_all（“ div”，{“ class”：“ holder”}））AttributeError：'bytes'对象没有属性'find_all '

是在encoding制造问题吗？

python django beautifulsoup

Md *_*man

lucky-day

1
推荐指数

1
解决办法

3921
查看次数

python beautifulsoup：lxml html.parser

我必须使用beautifulsoup，但是我不知道我必须使用哪个解析器。我对lxml和html.parser犹豫不决，或者为什么不两者都选择。如何知道网页是否符合lxml？如何知道网页是否符合html解析器？非常感谢

python lxml beautifulsoup html-parser

Ano*_*mus

lucky-day

1
推荐指数

1
解决办法

1054
查看次数

用beautifulsoup解析<br>标签

我正在爬网一个网站，
标记的结构是：

<div class="content"
    <p> 
        "C Space"
        <br>
        "802 white avenue"
        <br>
        "xyz 123"
        <br>
        "Lima"
    </p>

Run Code Online (Sandbox Code Playgroud)

当我使用beautifulsoup使用以下命令获取文本时：

html=urlopen("something")
bsObj = BeautifulSoup(html,"html5lib")
templist = bsObj.find("div",{"class":"content"})
print(templist.get_text())

Run Code Online (Sandbox Code Playgroud)

我得到以下输出：C Space802 white avenuexyz 123Lima

而我希望输出为：C Space 802 white avenue xyz 123 Lima。

从后续br标签获取数据时，如何添加额外的空格？

谢谢

html tags beautifulsoup web-crawler web-scraping

ksh*_*ava

2017 04-27

1
推荐指数

1
解决办法

2453
查看次数

熊猫:无法从DataFrame列中剥离HTML标记

我有一个Pandas DataFrame,其中text包含一个包含HTML 的列.我想获得文本,即剥离标签.我尝试在下面执行以下操作:

from bs4 import BeautifulSoup
result_df['text'] = BeautifulSoup(result_df['text']).get_text()

Run Code Online (Sandbox Code Playgroud)

但是,我最终收到此错误:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Run Code Online (Sandbox Code Playgroud)

我做错了什么？

谢谢!

python beautifulsoup pandas

bcl*_*man

lucky-day

1
推荐指数

2
解决办法

1765
查看次数

刮刮谷歌财经(BeautifulSoup)

我正在尝试抓取Google财经,并获取"相关股票"表,其中包含基于Chrome中网页检查器的ID"cc-table"和类"gf-table".(示例链接:https://www.google.com/finance？q = tsla)

但是当我运行.find("table")或.findAll("table")时,此表不会出现.我可以在Python的HTML内容中找到带有表格内容的JSON外观对象,但不知道如何获取它.有任何想法吗？

python beautifulsoup web-scraping python-3.x web

use*_*034

lucky-day

1
推荐指数

1
解决办法

2873
查看次数

WebSscping与BeautifulSoup,获得空列表

我正在通过https://www.wunderground.com/获取基本天气数据(如每日高/低温)来搜索网页图谱(搜索随机邮政编码).

我已经尝试了我的代码的各种变体,但它不断返回一个温度应该是的空列表.老实说,我只是不知道自己哪里出错了.谁能指出我正确的方向？

import requests
from bs4 import BeautifulSoup
response=requests.get('https://www.wunderground.com/cgi-bin/findweather/getForecast?query=76502')
response_data = BeautifulSoup(response.content, 'html.parser')
results=response_data.select("strong.high")

Run Code Online (Sandbox Code Playgroud)

我还尝试了以下各种其他变体:

results = response_data.find_all('strong', class_ = 'high')
results = response_data.select('div.small_6 columns > strong.high' )

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping

rez*_*ale

lucky-day

1
推荐指数

1
解决办法

835
查看次数

标签统计

beautifulsoup ×10

python ×9

web-scraping ×3

html ×2

regex ×2

django ×1

html-parser ×1

lxml ×1

pandas ×1

parsing ×1

python-3.x ×1

tags ×1

web ×1

web-crawler ×1

标签 统计

标签统计