在熊猫数据框中查找嵌套列

Question

在熊猫数据框中查找嵌套列

Dan*_*ats 8 python python-3.x pandas pyarrow

我有一个包含许多列（压缩）JSON 格式的大型数据集。我正在尝试将其转换为镶木地板以进行后续处理。某些列具有嵌套结构。现在我想忽略这个结构，只是将这些列作为（JSON）字符串写出来。

所以对于我确定的列，我正在做：

df[column] = df[column].astype(str)

Run Code Online (Sandbox Code Playgroud)

但是，我不确定哪些列是嵌套的，哪些不是。当我用镶木地板书写时，我看到以下消息：

<stack trace redacted> 

  File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>

Run Code Online (Sandbox Code Playgroud)

这表明我未能将我的一列从嵌套对象转换为字符串。但应该归咎于哪个专栏？我怎么知道？

当我打印.dtypes我的 Pandas 数据框时，我无法区分字符串和嵌套值，因为两者都显示为object.

编辑：该错误通过显示结构详细信息提示了嵌套列，但这非常耗时调试。此外，它只打印第一个错误，如果您有多个嵌套列，这可能会很烦人

Answer 1

gdl*_*lmx 3

将嵌套结构转换为字符串

如果我正确理解你的问题，你想将那些嵌套的 Python 对象（列表、字典）序列化为dfJSON 字符串，并保持其他元素不变。最好编写自己的转换方法：

def json_serializer(obj):
    if isinstance(obj, [list, dict]): # please add other types that you considered as nested structure to the type list
        return json.dumps(obj)
    return obj

df = df.applymap(json_serializer)

Run Code Online (Sandbox Code Playgroud)

如果数据帧很大，使用astype(str)会更快。

nested_cols = []
for c in df:
    if any(isinstance(obj, [list, dict]) for obj in df[c]):
        nested_cols.append(c)

for c in nested_cols:
    df[c] = df[c].astype(str) # this convert every element in the column independent of their types

Run Code Online (Sandbox Code Playgroud)

由于调用中的短路评估，这种方法具有性能优势any(...)。一旦命中列中的第一个嵌套对象，它将立即返回，并且不会浪费时间检查其余对象。如果任何“Dtype Introspection”方法适合您的数据，那么使用它会更快。

检查 pyarrow 的最新版本

我假设这些嵌套结构需要转换为字符串，只是因为它们会导致pyarrow.parquet.write_table. 也许您根本不需要转换它，因为据报道最近已经解决了pyarrow 中处理嵌套列的问题（2020 年 3 月 29 日，版本 0.17.0）。但支持可能存在问题并正在积极讨论中。

归档时间：	5 年，4 月前
查看次数：	2399 次
最近记录：	5 年，4 月前