标签: beautifulsoup

使用 beautifulsoup 解析来自 RSS feed 的所有子项元素

如何从 RSS 提要中获取每个项目标签内所有内容的字符串？

输入示例（简化）：

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Test</title>
<item>
  <title>Hello world1</title>
  <comments>Hi there</comments>
  <pubDate>Tue, 21 Nov 2011 20:10:10 +0000</pubDate>
</item>
<item>
  <title>Hello world2</title>
  <comments>Good afternoon</comments>
  <pubDate>Tue, 22 Nov 2011 20:10:10 +0000</pubDate>
</item>
<item>
  <title>Hello world3</title>
  <comments>blue paint</comments>
  <pubDate>Tue, 23 Nov 2011 20:10:10 +0000</pubDate>
</item>
</channel>
</rss>

Run Code Online (Sandbox Code Playgroud)

我需要一个 python 函数来获取这个 RSS 文件（我现在使用 beautifulsoup），并且有一个遍历每个项目的循环。我需要一个变量，其中包含每个项目中所有内容的字符串。

第一个循环结果示例：

<title>Hello world1</title>
<comments>Hi there</comments>
<pubDate>Tue, 21 Nov 2011 20:10:10 +0000</pubDate>

Run Code Online (Sandbox Code Playgroud)

这段代码给了我第一个结果，但是我如何获得接下来的所有结果呢？

html_data = BeautifulSoup(xml)
print html_data.channel.item

Run Code Online (Sandbox Code Playgroud)

python rss beautifulsoup

dee*_*ell

2011 11-22

2
推荐指数

1
解决办法

7468
查看次数

Python：如何解析需要登录的网页的HTML？

我正在尝试解析需要登录的网页的 HTML。我可以使用以下脚本获取网页的 HTML：

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen ('https://www.example.com')
soup = BeautifulSoup (webpage)
print soup
#This would print the source of example.com

Run Code Online (Sandbox Code Playgroud)

但事实证明，尝试获取我登录的网页的源代码更加困难。我尝试将 ('https://www.example.com') 替换为 ('https://user:pass@example.com')，但收到无效 URL 错误。

有人知道我该怎么做吗？提前致谢。

html python parsing beautifulsoup

Dam*_*en

lucky-day

2
推荐指数

1
解决办法

6691
查看次数

美丽汤中的这个错误意味着什么？

我正在使用 PyQt4 和 BeautifulSoup 编写小脚本。基本上，您指定 url 和脚本，以便从网页下载所有图片。

在输出中，当我提供http://yahoo.com时，它会下载除一张之外的所有图片：

...
Download Complete
Download Complete
File name is wrong 
Traceback (most recent call last):
  File "./picture_downloader.py", line 41, in loadComplete
    self.download_image()
  File "./picture_downloader.py", line 58, in download_image
    print 'File name is wrong ',image['src']
  File "/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.1.3-py2.7.egg/bs4/element.py", line 879, in __getitem__
    return self.attrs[key]
KeyError: 'src'

Run Code Online (Sandbox Code Playgroud)

http://stackoverflow.com的输出是：

Download Complete
File name is wrong  h
Download Complete

Run Code Online (Sandbox Code Playgroud)

最后，这是代码的一部分：

# SLOT for loadFinished
def loadComplete(self): 
    self.download_image()

def download_image(self):
    html = unicode(self.frame.toHtml()).encode('utf-8')
    soup = bs(html) …

Run Code Online (Sandbox Code Playgroud)

python pyqt beautifulsoup

Vor*_*Vor

lucky-day

2
推荐指数

1
解决办法

1万
查看次数

Python - Beautiful Soup 查找文本不起作用

commentary = soup.find('div', {'id' : 'live-text-commentary-wrapper'})
findtoure = commentary.findAll(text = 'Gnegneri Toure Yaya')

Run Code Online (Sandbox Code Playgroud)

我不明白为什么这不起作用。

评论的输出是：

<div id="live-text-commentary-wrapper">
  <h2 id="live-text-introduction">Live Text Commentary</h2>
  <div class="live-text blq-clearfix" id="live-text">
    <span>90:00 
    <span class="extra-info">+3:04 
    <span class="icon-live-text-full-time">Full time</span></span></span>
    <p class="event">
    <span class="event-title">
      <strong>Full Time</strong>
    </span> The referee ends the match.</p>
    <span>90:00 
    <span class="extra-info">+2:52</span></span>
    <p>Gael Clichy produces a cross, clearance made by Mike Williamson.</p>
    <span>90:00 
    <span class="extra-info">+0:41</span></span>
    <p>Shot by Shola Ameobi from 20 yards. Save made by Joe Hart.</p>
    <span>90:00 
    <span class="extra-info">+0:07</span></span>
    <p>The ball is crossed by Davide …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

use*_*606

2013 02-25

2
推荐指数

1
解决办法

5347
查看次数

beautifulsoup 4 + python：字符串返回“无”

我试图用 BeautifulSoup4 和 Python 2.7.6 解析一些 html，但字符串返回“None”。我试图解析的 HTML 是：

<div class="booker-booking">
    2&nbsp;rooms
    &#0183;
    USD&nbsp;0
    <!-- Commission: USD  -->
</div>

Run Code Online (Sandbox Code Playgroud)

我的Python片段是：

 data = soup.find('div', class_='booker-booking').string

Run Code Online (Sandbox Code Playgroud)

我还尝试过以下两种：

data = soup.find('div', class_='booker-booking').text
data = soup.find('div', class_='booker-booking').contents[0]

Run Code Online (Sandbox Code Playgroud)

两者都返回：

u'\n\t\t2\xa0rooms \n\t\t\xb7\n\t\tUSD\xa00\n\t\t\n

Run Code Online (Sandbox Code Playgroud)

我最终试图将第一行放入一个仅表示“2 Rooms”的变量中，将第三行放入另一个仅表示“USD 0”的变量中。

python parsing beautifulsoup html-parsing

cro*_*eaf

lucky-day

2
推荐指数

1
解决办法

1万
查看次数

使用 beautifulsoup 抓取 <h2> 标签

我正在使用 beautiful soup 抓取网站数据。我想要以下的锚点值（我的名字是昵称）。但我在谷歌上搜索了很多，但找不到任何完美的解决方案来解决我的查询。

news_panel = soup.findAll('div', {'class': 'menuNewsPanel_MenuNews1'})
for news in news_panel:
    temp = news.find('h2')        
    print temp

Run Code Online (Sandbox Code Playgroud)

输出：

<h2 class="menuNewsHl2_MenuNews1"><a href="index.php?ref=MjBfMDFfMDhfMTRfMV84XzFfOTk2NDA=">My name is nick</a></h2>

Run Code Online (Sandbox Code Playgroud)

但我想要这样的输出： My name is nick

python beautifulsoup web-scraping

Dev*_*per

2021 02-01

2
推荐指数

1
解决办法

2万
查看次数

从特定标签中删除样式 BeautifulSoup/Python

假设我有一碗汤，我想删除所有段落的所有样式标签。所以我想把整个汤都放进<p style='blah' id='bla' class=...>去。<p id='bla' class=...>但我不想碰<img style='...'>标签。我该怎么做？

html python beautifulsoup html-parsing

Cha*_*les

2014 09-20

2
推荐指数

1
解决办法

3207
查看次数

使用 BeautifulSoup 保存网页内容

我正在尝试使用 BeautifulSoup 使用以下代码来抓取网页：

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
    s = url.read()

soup = BeautifulSoup(s)

with open("scraped.txt", "w", encoding="utf-8") as f:
    f.write(soup.get_text())
    f.close()

Run Code Online (Sandbox Code Playgroud)

问题是它保存的是维基百科的主页而不是特定的文章。为什么该地址不起作用？我应该如何更改它？

python beautifulsoup web-scraping python-3.x

Omi*_*mid

2014 08-12

2
推荐指数

1
解决办法

3806
查看次数

使用 Beautifulsoup 在 html 页面中查找 CSRF 令牌

HTML 看起来像这样

<input type="hidden" name="csrfToken" value="ajax:SOME_TOKEN"/>

Run Code Online (Sandbox Code Playgroud)

我尝试了几种不同的方法，但总是出现错误。我认为这种方式看起来是正确的，但显然不是。

soup = BeautifulSoup(html_page)
soup.find('input', {'name':'csrfToken'})

Run Code Online (Sandbox Code Playgroud)

我不断得到：

TypeError: 'expected string or buffer'

Run Code Online (Sandbox Code Playgroud)

有什么想法吗？

python beautifulsoup

Mor*_*len

lucky-day

2
推荐指数

1
解决办法

7480
查看次数

漂亮的汤 div，带有类和 id

我是初学者，想问如何使用 beautiful soup 从以下类型的代码中提取数据：

<div class="about-book" id="aboutbook">
Blah blah blah
</div>

Run Code Online (Sandbox Code Playgroud)

当存在具有不同 id 的“about-book”和具有不同类名的“aboutbook”时，如何获得“Blah blah blah”。我想要的是类名和 id 的组合。

python beautifulsoup web-scraping python-2.7

Nik*_*ilS

2015 01-22

2
推荐指数

1
解决办法

1万
查看次数

标签统计

beautifulsoup ×10

python ×10

web-scraping ×3

html ×2

html-parsing ×2

parsing ×2

pyqt ×1

python-2.7 ×1

python-3.x ×1

rss ×1

标签 统计

标签统计