如何从python 3中的url读取html

Question

如何从python 3中的url读取html

use*_*305 32 html python url

我看了以前类似的问题,只是更加困惑.

在python 3.4中,我希望在给定url的情况下将html页面作为字符串读取.

在Perl中,我使用get()执行LWP :: Simple.

matplotlib 1.3.1示例说:import urllib; u1=urllib.urlretrieve(url).python3找不到urlretrieve.

我试过u1 = urllib.request.urlopen(url),似乎得到一个HTTPResponse对象,但我不能打印它或得到它的长度或索引它.

u1.body不存在.我找不到HTTPResponsepython3中的描述.

HTTPResponse对象中是否有一个属性可以为我提供html页面的原始字节？

(来自其他问题的不相关的东西包括urllib2,我的python中不存在,csv解析器等)

编辑:

我在先前的问题中找到了一些部分(大部分)完成工作的东西:

u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')

for lines in u2.readlines():
    print (lines)

Run Code Online (Sandbox Code Playgroud)

我说'部分'因为我不想读单独的行,而只是一个大字符串.

我可以将这些行连接起来,但是每行打印都会在其前面添加一个字符"b".

它来自哪里？

同样,我想我可以在连接之前删除第一个字符,但这确实是一个kloodge.

Answer 1

dav*_*dgh 55

请注意,Python3不会将html代码作为字符串读取,而是作为a bytearray,因此您需要将其转换为一个decode.

import urllib.request

fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

print(mystr)

Run Code Online (Sandbox Code Playgroud)

假定它的UTF-8编码不是一个好主意。您应该尝试阅读标题 (2认同)

Answer 2

小智 35

尝试"请求"模块,它更简单.

#pip install requests for installation

import requests

url = 'https://www.google.com/'
r = requests.get(url)
r.text

Run Code Online (Sandbox Code Playgroud)

更多信息> http://docs.python-requests.org/en/master/

你是什么意思？import libname 也在 py3 中使用 (4认同)

Answer 3

小智 9

urllib.request.urlopen(url).read() 应该将原始HTML页面作为字符串返回给您.

@user1067305 奇怪... `request.urlopen()` [返回一个 `HTTPResponse`](https://docs.python.org/3.4/library/urllib.request.html?highlight=urllib.request#urllib.request .urlopen)，[他们确实有](https://docs.python.org/3.4/library/http.client.html#http.client.HTTPResponse.read) `read()` 方法...... (2认同)

Answer 4

Ram*_*ngh 8

import requests

url = requests.get("http://yahoo.com")
htmltext = url.text
print(htmltext)

Run Code Online (Sandbox Code Playgroud)

这将类似于urllib.urlopen.

Answer 5

Dis*_*ath 5

使用urllib读取html页面非常简单。由于您想将其作为单个字符串阅读，因此我将向您展示。

导入urllib.request：

#!/usr/bin/python3.5

import urllib.request

Run Code Online (Sandbox Code Playgroud)

准备我们的要求

request = urllib.request.Request('http://www.w3schools.com')

Run Code Online (Sandbox Code Playgroud)

请求网页时，请务必使用“尝试/除外”，因为事情很容易出错。urlopen（）请求页面。

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

Run Code Online (Sandbox Code Playgroud)

类型是一个很棒的函数，它将告诉我们变量是什么“类型”。在这里，response是一个http.response对象。

print(type(response))

Run Code Online (Sandbox Code Playgroud)

我们的响应对象的read函数会将html作为字节存储到我们的变量中。同样，type（）将验证这一点。

htmlBytes = response.read()

print(type(htmlBytes))

Run Code Online (Sandbox Code Playgroud)

现在我们对字节变量使用解码函数来获取单个字符串。

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

Run Code Online (Sandbox Code Playgroud)

如果您确实希望将此字符串拆分为单独的行，则可以使用split（）函数。通过这种形式，我们可以轻松地遍历以打印出整个页面或进行其他任何处理。

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

Run Code Online (Sandbox Code Playgroud)

希望这会提供更详细的答案。Python文档和教程很棒，我将其用作参考，因为它将回答您可能遇到的大多数问题。

归档时间：	11 年，8 月前
查看次数：	89397 次
最近记录：	6 年，8 月前