将io.BytesIO转换为io.StringIO以解析HTML页面

Question

将io.BytesIO转换为io.StringIO以解析HTML页面

Shi*_*pra 18 html beautifulsoup type-conversion pycurl stringio

我正在尝试解析通过pyCurl检索的HTML页面,但是pyCurl WRITEFUNCTION将页面返回为BYTES而不是字符串,所以我无法使用BeautifulSoup解析它.

有没有办法将io.BytesIO转换为io.StringIO？

或者还有其他方法来解析HTML页面吗？

我正在使用Python 3.3.2.

Answer 1

kak*_*eys 32

接受的答案中的代码实际上完全从流中读取以进行解码.下面是正确的方法,将一个流转换为另一个流,其中数据可以通过块读取.

# Initialize a read buffer
input = io.BytesIO(
    b'Inital value for read buffer with unicode characters ' +
    'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')

# Read from the buffer
print(wrapper.read())

Run Code Online (Sandbox Code Playgroud)

Answer 2

Ant*_*ile 10

一种天真的方法:

# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()

# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8')  # Or use the encoding you expect

# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need

Run Code Online (Sandbox Code Playgroud)

谢谢,它确实有效.但是我使用bytes_io.getvalue()代替bytes_io.read(),因为前者不起作用. (4认同)
通常，您必须在read（）调用之前调用`bytes_io.seek（0）`。正如@AnthonySottile所提到的，`getvalue`解决了这个问题。 (2认同)

归档时间：	11 年，5 月前
查看次数：	23459 次
最近记录：	7 年，5 月前