使用urllib删除python中的换行符

Question

使用urllib删除python中的换行符

我使用的是Python 3.x. 在使用urllib.request下载网页时,我\n之间的关系很多.我试图使用论坛其他主题中给出的方法删除它,但我无法这样做.我用过strip()功能和replace()功能......但没有运气!我在eclipse上运行这段代码.这是我的代码:

import urllib.request

#Downloading entire Web Document 
def download_page(a):
    opener = urllib.request.FancyURLopener({})
    try:
        open_url = opener.open(a)
        page = str(open_url.read())
        return page
    except:
        return""  
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)

#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)

Run Code Online (Sandbox Code Playgroud)

我无法发现\n在raw_html变量中获得大量内容的原因.

Answer 1

jfs*_*jfs 7

你的download_page()函数破坏了html(str()调用),这就是你在输出中看到\n(两个字符\和n)的原因.不要使用.replace()或其他类似的解决方案,download_page()而是修复功能:

from urllib.request import urlopen

with urlopen("http://www.zseries.in") as response:
    html_content = response.read()

Run Code Online (Sandbox Code Playgroud)

此时html_content包含一个bytes对象.要将其作为文本,您需要知道它的字符编码,例如,从Content-Typehttp标头获取它:

encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)

Run Code Online (Sandbox Code Playgroud)

请参阅在Python中获取HTTP响应的字符集/编码的好方法.

如果服务器没有在Content-Type标头中传递字符集,那么有很复杂的规则来计算html5文档中的字符编码,例如,它可以在html文档中指定:( <meta charset="utf-8">你需要一个html解析器来获取它).

如果你正确阅读了html,那么你不应该\n在页面中看到文字字符.

归档时间：	10 年，8 月前
查看次数：	4531 次
最近记录：	9 年，1 月前