标签: beautifulsoup

不要自动放置html,head和body标签,beautifulsoup

使用带有html5lib的beautifulsoup,它会自动放置html,head和body标签:

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

Run Code Online (Sandbox Code Playgroud)

我可以设置任何选项,关闭此行为？

python beautifulsoup html5lib

Ber*_*ire

lucky-day

29
推荐指数

3
解决办法

8379
查看次数

美丽的汤:'ResultSet'对象没有属性'find_all'？

我正在尝试使用Beautiful Soup刮一张简单的桌子.这是我的代码:

import requests
from bs4 import BeautifulSoup

url = 'https://gist.githubusercontent.com/anonymous/c8eedd8bf41098a8940b/raw/c7e01a76d753f6e8700b54821e26ee5dde3199ab/gistfile1.txt'
r = requests.get(url)

soup = BeautifulSoup(r.text)
table = soup.find_all(class_='dataframe')

first_name = []
last_name = []
age = []
preTestScore = []
postTestScore = []

for row in table.find_all('tr'):
    col = table.find_all('td')

    column_1 = col[0].string.strip()
    first_name.append(column_1)

    column_2 = col[1].string.strip()
    last_name.append(column_2)

    column_3 = col[2].string.strip()
    age.append(column_3)

    column_4 = col[3].string.strip()
    preTestScore.append(column_4)

    column_5 = col[4].string.strip()
    postTestScore.append(column_5)

columns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore}
df = pd.DataFrame(columns)
df

Run Code Online (Sandbox Code Playgroud)

但是,每当我运行它时,我都会收到此错误:

--------------------------------------------------------------------------- …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup

Ant*_*ton

lucky-day

29
推荐指数

2
解决办法

6万
查看次数

使用BeautifulSoup和Python获取元标记内容属性

我正在尝试使用python和美丽的汤来提取下面标签的内容部分:

<meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://superfunevents.com/events/super-fun-event-1/" />

Run Code Online (Sandbox Code Playgroud)

我正在使用BeautifulSoup来加载页面并找到其他东西(这也从源代码中隐藏的id标签中获取文章id),但我不知道正确的方法来搜索html并找到这些位,我尝试过find和findAll的变种无济于事.代码迭代目前的网址列表...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#importing the libraries
from urllib import urlopen
from bs4 import BeautifulSoup

def get_data(page_no):
    webpage = urlopen('http://superfunevents.com/?p=' + str(i)).read()
    soup = BeautifulSoup(webpage, "lxml")
    for tag in soup.find_all("article") :
        id = tag.get('id')
        print id
# the hard part that doesn't work - I know this example is well off the mark!        
    title = soup.find("og:title", "content")
    print (title.get_text())
    url = soup.find("og:url", "content")
    print …

Run Code Online (Sandbox Code Playgroud)

html python beautifulsoup web-scraping

the*_*t_1

2016 04-21

29
推荐指数

3
解决办法

3万
查看次数

使用 Beautiful Soup 时出现错误“AttributeError 'collections' has no attribute 'Callable'”

我按照所有步骤安装 Beautiful Soup，但仍然出现此错误：

\n
AttributeError：模块“collections”没有属性“Callable”
\n

堆栈跟踪

我正在使用Python\xc2\xa03.10。

python beautifulsoup

Tan*_*010

2022 10-21

29
推荐指数

5
解决办法

5万
查看次数

如何使用Beautiful Soup查找带有特定文本的标签？

我有以下html(标记为\n的换行符):

...
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>
...

Run Code Online (Sandbox Code Playgroud)

如何找到我要找的文字？下面的代码返回第一个找到的值,所以我需要以某种方式过滤固定文本.

result = soup.find('td', {'class' :'pos'}).find('strong').text

Run Code Online (Sandbox Code Playgroud)

更新.如果我使用以下代码:

title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'})
self.response.out.write(str(title.string).decode('utf8'))

Run Code Online (Sandbox Code Playgroud)

然后它返回固定文本:.

html python beautifulsoup web-scraping

LA_*_*LA_

2019 07-10

28
推荐指数

4
解决办法

8万
查看次数

UnicodeEncodeError:'ascii'编解码器不能编码字符u'\ u2026'

我正在学习urllib2和Beautiful Soup,并且在第一次测试时遇到如下错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

Run Code Online (Sandbox Code Playgroud)

关于这种类型的错误似乎有很多帖子,我已经尝试了我能理解的解决方案,但似乎有22个跟他们一起,例如:

我想打印post.text(文本是一个美丽的汤方法,只返回文本). str(post.text)并post.text产生unicode错误(如右撇号'和...).

所以我在post = unicode(post)上面添加str(post.text),然后我得到:

AttributeError: 'unicode' object has no attribute 'text'

Run Code Online (Sandbox Code Playgroud)

我也试过(post.text).encode()和(post.text).renderContents().后者产生错误:

AttributeError: 'unicode' object has no attribute 'renderContents'

Run Code Online (Sandbox Code Playgroud)

然后我尝试str(post.text).renderContents()并得到错误:

AttributeError: 'str' object has no attribute 'renderContents'

Run Code Online (Sandbox Code Playgroud)

如果我可以在文档的顶部定义某个位置'make this content 'interpretable''并仍然可以访问所需的text功能,那将是很棒的.

更新: 建议后:

如果我在post = post.decode("utf-8")上面添加,str(post.text) …

python unicode urllib2 beautifulsoup python-2.7

use*_*287

2018 07-09

28
推荐指数

1
解决办法

5万
查看次数

如何使用Python BeautifulSoup将输出写入html文件

我通过删除一些标签修改了一个html文件beautifulsoup.现在我想将结果写回html文件中.我的代码:

from bs4 import BeautifulSoup
from bs4 import Comment

soup = BeautifulSoup(open('1.html'),"html.parser")

[x.extract() for x in soup.find_all('script')]
[x.extract() for x in soup.find_all('style')]
[x.extract() for x in soup.find_all('meta')]
[x.extract() for x in soup.find_all('noscript')]
[x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
html =soup.contents
for i in html:
    print i

html = soup.prettify("utf-8")
with open("output1.html", "wb") as file:
    file.write(html)

Run Code Online (Sandbox Code Playgroud)

由于我使用了soup.prettify,它会生成如下的html:

<p>
    <strong>
     BATAM.TRIBUNNEWS.COM, BINTAN
    </strong>
    - Tradisi pedang pora mewarnai serah terima jabatan pejabat di
    <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">
     Polres
    </a>
    <a …

Run Code Online (Sandbox Code Playgroud)

html python beautifulsoup

Kim*_*ung

2018 04-06

28
推荐指数

3
解决办法

3万
查看次数

如何处理IncompleteRead:在python中

我想从网站上获取一些数据.然而,它回报了我incomplete read.我想要获取的数据是一组庞大的嵌套链接.我在网上进行了一些研究,发现这可能是由于服务器错误(在达到预期大小之前完成了一个分块传输编码).我还在此链接上找到了上面的解决方法

但是,我不确定如何在我的情况下使用它.以下是我正在处理的代码

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)

for tag in links:
    name = tag['alt']
    tag['url'] = urlparse.urljoin(urls, tag['url'])
    r = br.open(tag['url'])
    page_child = br.response().read()
    soup_child = BeautifulSoup(page_child)
    contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
    data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
    print contracts
    print data_usage

Run Code Online (Sandbox Code Playgroud)

请帮帮我.谢谢

python mechanize beautifulsoup web-scraping python-2.7

作者

lucky-day

27
推荐指数

3
解决办法

4万
查看次数

BeautifulSoup .prettify()的自定义缩进宽度

有没有办法为.prettify()函数定义自定义缩进宽度？从我可以从它的来源获得 -

def prettify(self, encoding=None, formatter="minimal"):
    if encoding is None:
        return self.decode(True, formatter=formatter)
    else:
        return self.encode(encoding, True, formatter=formatter)

Run Code Online (Sandbox Code Playgroud)

无法指定缩进宽度.我认为这是因为功能中的这一行decode_contents()-

s.append(" " * (indent_level - 1))

Run Code Online (Sandbox Code Playgroud)

其中固定长度为1个空格!(为什么!!)我尝试指定indent_level=4,这只是导致 -

    <section>
     <article>
      <h1>
      </h1>
      <p>
      </p>
     </article>
    </section>

Run Code Online (Sandbox Code Playgroud)

这看起来简直是愚蠢的.:|

现在,我可以解决这个问题,但我只是想确定是否有任何我遗漏的东西.因为这应该是一个基本功能.: - /

如果你有更好的方法来美化HTML代码,请告诉我.

python code-formatting indentation beautifulsoup

Bib*_*ath

2015 08-28

27
推荐指数

3
解决办法

7252
查看次数

美丽的汤findAll找不到它们

我正在尝试解析一个网站,并获得一些与BeautifulSoup.findAll的信息,但它找不到所有..我正在使用python3

代码是这样的

#!/usr/bin/python3

from bs4 import BeautifulSoup
from urllib.request import urlopen

page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())

manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)

for manga in manga_img:
    print (manga['href'])

Run Code Online (Sandbox Code Playgroud)

它只打印了一半......

python beautifulsoup findall python-3.x

Cle*_*pto

2017 10-26

27
推荐指数

1
解决办法

5万
查看次数