使用带有html5lib的beautifulsoup,它会自动放置html,head和body标签:
BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>
Run Code Online (Sandbox Code Playgroud)
我可以设置任何选项,关闭此行为?
我正在尝试使用Beautiful Soup刮一张简单的桌子.这是我的代码:
import requests
from bs4 import BeautifulSoup
url = 'https://gist.githubusercontent.com/anonymous/c8eedd8bf41098a8940b/raw/c7e01a76d753f6e8700b54821e26ee5dde3199ab/gistfile1.txt'
r = requests.get(url)
soup = BeautifulSoup(r.text)
table = soup.find_all(class_='dataframe')
first_name = []
last_name = []
age = []
preTestScore = []
postTestScore = []
for row in table.find_all('tr'):
col = table.find_all('td')
column_1 = col[0].string.strip()
first_name.append(column_1)
column_2 = col[1].string.strip()
last_name.append(column_2)
column_3 = col[2].string.strip()
age.append(column_3)
column_4 = col[3].string.strip()
preTestScore.append(column_4)
column_5 = col[4].string.strip()
postTestScore.append(column_5)
columns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore}
df = pd.DataFrame(columns)
df
Run Code Online (Sandbox Code Playgroud)
但是,每当我运行它时,我都会收到此错误:
--------------------------------------------------------------------------- …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用python和美丽的汤来提取下面标签的内容部分:
<meta property="og:title" content="Super Fun Event 1" />
<meta property="og:url" content="http://superfunevents.com/events/super-fun-event-1/" />
Run Code Online (Sandbox Code Playgroud)
我正在使用BeautifulSoup来加载页面并找到其他东西(这也从源代码中隐藏的id标签中获取文章id),但我不知道正确的方法来搜索html并找到这些位,我尝试过find和findAll的变种无济于事.代码迭代目前的网址列表...
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#importing the libraries
from urllib import urlopen
from bs4 import BeautifulSoup
def get_data(page_no):
webpage = urlopen('http://superfunevents.com/?p=' + str(i)).read()
soup = BeautifulSoup(webpage, "lxml")
for tag in soup.find_all("article") :
id = tag.get('id')
print id
# the hard part that doesn't work - I know this example is well off the mark!
title = soup.find("og:title", "content")
print (title.get_text())
url = soup.find("og:url", "content")
print …Run Code Online (Sandbox Code Playgroud) 我按照所有步骤安装 Beautiful Soup,但仍然出现此错误:
\n\n\nAttributeError:模块“collections”没有属性“Callable”
\n

我正在使用Python\xc2\xa03.10。
\n我有以下html(标记为\n的换行符):
...
<tr>
<td class="pos">\n
"Some text:"\n
<br>\n
<strong>some value</strong>\n
</td>
</tr>
<tr>
<td class="pos">\n
"Fixed text:"\n
<br>\n
<strong>text I am looking for</strong>\n
</td>
</tr>
<tr>
<td class="pos">\n
"Some other text:"\n
<br>\n
<strong>some other value</strong>\n
</td>
</tr>
...
Run Code Online (Sandbox Code Playgroud)
如何找到我要找的文字?下面的代码返回第一个找到的值,所以我需要以某种方式过滤固定文本.
result = soup.find('td', {'class' :'pos'}).find('strong').text
Run Code Online (Sandbox Code Playgroud)
更新.如果我使用以下代码:
title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'})
self.response.out.write(str(title.string).decode('utf8'))
Run Code Online (Sandbox Code Playgroud)
然后它返回固定文本:.
我正在学习urllib2和Beautiful Soup,并且在第一次测试时遇到如下错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)
关于这种类型的错误似乎有很多帖子,我已经尝试了我能理解的解决方案,但似乎有22个跟他们一起,例如:
我想打印post.text(文本是一个美丽的汤方法,只返回文本).
str(post.text)并post.text产生unicode错误(如右撇号'和...).
所以我在post = unicode(post)上面添加str(post.text),然后我得到:
AttributeError: 'unicode' object has no attribute 'text'
Run Code Online (Sandbox Code Playgroud)
我也试过(post.text).encode()和(post.text).renderContents().后者产生错误:
AttributeError: 'unicode' object has no attribute 'renderContents'
Run Code Online (Sandbox Code Playgroud)
然后我尝试str(post.text).renderContents()并得到错误:
AttributeError: 'str' object has no attribute 'renderContents'
Run Code Online (Sandbox Code Playgroud)
如果我可以在文档的顶部定义某个位置'make this content 'interpretable''并仍然可以访问所需的text功能,那将是很棒的.
更新: 建议后:
如果我在post = post.decode("utf-8")上面添加,str(post.text) …
我通过删除一些标签修改了一个html文件beautifulsoup.现在我想将结果写回html文件中.我的代码:
from bs4 import BeautifulSoup
from bs4 import Comment
soup = BeautifulSoup(open('1.html'),"html.parser")
[x.extract() for x in soup.find_all('script')]
[x.extract() for x in soup.find_all('style')]
[x.extract() for x in soup.find_all('meta')]
[x.extract() for x in soup.find_all('noscript')]
[x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
html =soup.contents
for i in html:
print i
html = soup.prettify("utf-8")
with open("output1.html", "wb") as file:
file.write(html)
Run Code Online (Sandbox Code Playgroud)
由于我使用了soup.prettify,它会生成如下的html:
<p>
<strong>
BATAM.TRIBUNNEWS.COM, BINTAN
</strong>
- Tradisi pedang pora mewarnai serah terima jabatan pejabat di
<a href="http://batam.tribunnews.com/tag/polres/" title="Polres">
Polres
</a>
<a …Run Code Online (Sandbox Code Playgroud) 我想从网站上获取一些数据.然而,它回报了我incomplete read.我想要获取的数据是一组庞大的嵌套链接.我在网上进行了一些研究,发现这可能是由于服务器错误(在达到预期大小之前完成了一个分块传输编码).我还在此链接上找到了上面的解决方法
但是,我不确定如何在我的情况下使用它.以下是我正在处理的代码
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)
for tag in links:
name = tag['alt']
tag['url'] = urlparse.urljoin(urls, tag['url'])
r = br.open(tag['url'])
page_child = br.response().read()
soup_child = BeautifulSoup(page_child)
contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
print contracts
print data_usage
Run Code Online (Sandbox Code Playgroud)
请帮帮我.谢谢
有没有办法为.prettify()函数定义自定义缩进宽度?从我可以从它的来源获得 -
def prettify(self, encoding=None, formatter="minimal"):
if encoding is None:
return self.decode(True, formatter=formatter)
else:
return self.encode(encoding, True, formatter=formatter)
Run Code Online (Sandbox Code Playgroud)
无法指定缩进宽度.我认为这是因为功能中的这一行decode_contents()-
s.append(" " * (indent_level - 1))
Run Code Online (Sandbox Code Playgroud)
其中固定长度为1个空格!(为什么!!)我尝试指定indent_level=4,这只是导致 -
<section>
<article>
<h1>
</h1>
<p>
</p>
</article>
</section>
Run Code Online (Sandbox Code Playgroud)
这看起来简直是愚蠢的.:|
现在,我可以解决这个问题,但我只是想确定是否有任何我遗漏的东西.因为这应该是一个基本功能.: - /
如果你有更好的方法来美化HTML代码,请告诉我.
我正在尝试解析一个网站,并获得一些与BeautifulSoup.findAll的信息,但它找不到所有..我正在使用python3
代码是这样的
#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())
manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)
for manga in manga_img:
print (manga['href'])
Run Code Online (Sandbox Code Playgroud)
它只打印了一半......
beautifulsoup ×10
python ×10
html ×3
web-scraping ×3
python-2.7 ×2
findall ×1
html5lib ×1
indentation ×1
mechanize ×1
python-3.x ×1
unicode ×1
urllib2 ×1