小编cas*_*ova的帖子

强制转换为Unicode:需要字符串或缓冲区,找到Tag

我正在尝试进行网页抓取并使用以下代码:

import mechanize
from bs4 import BeautifulSoup

url = "http://www.indianexpress.com/news/indian-actions-discriminating-against-us-exp/1131015/"
br =  mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('p'):
    articletext += tag.contents[0]
print articletext

Run Code Online (Sandbox Code Playgroud)

但我收到以下错误:

Traceback (most recent call last):
  File "C:/Python27/crawler/express.py", line 15, in <module>
    articletext += tag.contents[0]
TypeError: coercing to Unicode: need string or buffer, Tag found

Run Code Online (Sandbox Code Playgroud)

有人可以帮我解决这个错误,我是Python编程的新手.

python beautifulsoup web-crawler web-scraping python-2.7

cas*_*ova

2013 11-12

4
推荐指数

1
解决办法

2971
查看次数

从Python中的Unicode Web Scrape输出ascii文件

我是Python编程的新手.我在我的Python文件中使用以下代码:

import gethtml
import articletext
url = "http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece"
result = articletext.getArticle(url)
text_file = open("Output.txt", "w")

text_file.write(result)

text_file.close()

Run Code Online (Sandbox Code Playgroud)

该文件articletext.py包含以下代码:

from bs4 import BeautifulSoup
import gethtml
def getArticleText(webtext):
    articletext = ""
    soup = BeautifulSoup(webtext)
    for tag in soup.findAll('p'):
        articletext += tag.contents[0]
    return articletext

def getArticle(url):
    htmltext = gethtml.getHtmlText(url)
    return getArticleText(htmltext)

Run Code Online (Sandbox Code Playgroud)

但是我收到以下错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 473: ordinal not in range(128)
To print the result into the output file, what proper code should I write ? …

Run Code Online (Sandbox Code Playgroud)

python unicode

cas*_*ova

2013 12-24

3
推荐指数

1
解决办法

484
查看次数

网络刮痧形成新闻数据库

我正在为不同的新闻媒体创建一个网络刮板.我试图为The Hindu报纸创建一个.

我想从其档案中提到的各种链接中获取新闻.让我们说我想在第二天提到的链接上获取新闻:http://www.thehindu.com/archive/web/2010/06/19/那是2010年6月19日.

现在我写了以下几行代码:

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.contents[0]
print articletext

Run Code Online (Sandbox Code Playgroud)

但我无法得到所需的结果.我基本上卡住了.有人可以帮我解决一下吗？

python beautifulsoup python-2.7 python-3.x

cas*_*ova

2013 11-13

1
推荐指数

1
解决办法

1003
查看次数