怎么打开html文件？

Question

怎么打开html文件？

dav*_*vid 19 python character-encoding python-2.7

我有html文件称test.html它有一个字?????.

我打开test.html并使用以下代码块打印它的内容:

file = open("test.html", "r")
print file.read()

Run Code Online (Sandbox Code Playgroud)

但它打印??????,为什么会发生这种情况,我该如何解决？

BTW.当我打开文本文件时它很好用.

编辑:我试过这个:

>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????

Run Code Online (Sandbox Code Playgroud)

Answer 1

vks*_*vks 34

import codecs
f=codecs.open("test.html", 'r')
print f.read()

Run Code Online (Sandbox Code Playgroud)

尝试这样的事情.

我也尝试 codecs.open("test.html",'r','utf-8') ，但是当我打印 f.read() 我得到 unicode 解码错误！ (2认同)

Answer 2

小智 12

我今天也遇到了这个问题。我使用的是Windows，系统语言默认为中文。因此，有人可能会遇到类似的 Unicode 错误。只需添加encoding = 'utf-8'：

with open("test.html", "r", encoding='utf-8') as f:
    text= f.read()

Run Code Online (Sandbox Code Playgroud)

Answer 3

Ben*_*min 8

您可以使用'urllib'阅读HTML页面.

 #python 2.x

  import urllib

  page = urllib.urlopen("your path ").read()
  print page

Run Code Online (Sandbox Code Playgroud)

Answer 4

wen*_*zul 5

使用带有编码参数的codecs.open。

import codecs
f = codecs.open("test.html", 'r', 'utf-8')

Run Code Online (Sandbox Code Playgroud)

Answer 5

Dib*_*eph 5

您可以使用以下代码:

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup

f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print document

Run Code Online (Sandbox Code Playgroud)

如果要删除其间的所有空行并将所有单词作为字符串(也避免使用特殊字符,数字),则还包括:

import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match("^[A-Za-z]*$",line):
            if (line not in stop and len(line)>1):
                st=st+" "+line
print st

Run Code Online (Sandbox Code Playgroud)

*最初定义st为stringst=""

归档时间：	11 年前
查看次数：	114355 次
最近记录：	6 年，10 月前