如何将"bytes"对象转换为pandas Dataframe,Python3.x中的文字字符串？

Question

如何将"bytes"对象转换为pandas Dataframe,Python3.x中的文字字符串？

Sha*_*ang 15 python arrays byte python-3.x pandas

我有一个Python3.x pandas DataFrame,其中某些列是字符串,表示为字节(如在Python2.x中)

import pandas as pd
df = pd.DataFrame(...)
df
       COLUMN1         ....
0      b'abcde'        ....
1      b'dog'          ....
2      b'cat1'         ....
3      b'bird1'        ....
4      b'elephant1'    ....

Run Code Online (Sandbox Code Playgroud)

当我通过列访问时df.COLUMN1,我明白了Name: COLUMN1, dtype: object

但是,如果我按元素访问,它是一个"字节"对象

df.COLUMN1.ix[0].dtype
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'dtype'

Run Code Online (Sandbox Code Playgroud)

如何将这些转换为"常规"字符串？也就是说,我怎么能摆脱这个b''前缀？

Answer 1

EdC*_*ica 35

您可以使用vectorised str.decode将字节字符串解码为普通字符串:

df['COLUMN1'].str.decode("utf-8")

Run Code Online (Sandbox Code Playgroud)

要对多个列执行此操作,您只需选择str列:

str_df = df.select_dtypes([np.object])

Run Code Online (Sandbox Code Playgroud)

转换所有这些:

str_df = str_df.stack().str.decode('utf-8').unstack()

Run Code Online (Sandbox Code Playgroud)

然后,您可以将转换后的cols与原始df cols交换出来:

for col in str_df:
    df[col] = str_df[col]

Run Code Online (Sandbox Code Playgroud)

Answer 2

Chr*_*nto 6

结合@EdChum 和@Yu Zhou 的答案，一个更简单的解决方案是：

for col, dtype in df.dtypes.items():
    if dtype == np.object:  # Only process byte object columns.
        df[col] = df[col].apply(lambda x: x.decode("utf-8"))

Run Code Online (Sandbox Code Playgroud)

申请不是这里的出路。使用 df[col].str.decode('utf-8')` (4认同)

Answer 3

小智 5

我添加了数据帧中某些列充满 str 或混合 str 和字节的问题。通过对@Christabella Irwanto提供的解决方案进行微小修改即可解决：（我更喜欢str.decode('utf-8')@Mad Physicist 所建议的方法）

for col, dtype in df.dtypes.items():
        if dtype == object:  # Only process object columns.
            # decode, or return original value if decode return Nan
            df[col] = df[col].str.decode('utf-8').fillna(df[col]) 


>>> df[col]
0        Element
1     b'Element'
2         b'165'
3            165
4             25
5             25

>>> df[col].str.decode('utf-8').fillna(df[col])
0     Element
1     Element
2         165
3         165
4          25
5          25
6          25

Run Code Online (Sandbox Code Playgroud)

（替换np.object为object与最新的 numpy 版本一起使用）

归档时间：	9 年，6 月前
查看次数：	17387 次
最近记录：	6 年，10 月前