我什么时候应该使用熊猫的 Categorical dtype？

Question

我什么时候应该使用熊猫的 Categorical dtype？

Bra*_*mon 5 python memory pandas categorical-data

我的问题涉及优化熊猫系列的内存使用。文档说明，

a 的内存使用量与Categorical类别数加上数据长度成正比。相比之下，objectdtype 是数据长度的常数倍。

我的理解是，pandasCategorical数据实际上是到表示类别的唯一（向下转换）整数的映射，其中整数本身占用（大概）比构成objectdtype的字符串少的字节。

我的问题：有没有规则的拇指使用时pd.Categorical将不保存记忆了object？前面提到的比例有多直接，它不也取决于系列中每个元素（字符串）的长度吗？

在下面的测试中，pd.Categorical似乎遥遥领先。

import string

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(444)
%matplotlib inline

def mem_usage(obj, index=False, total=True, deep=True):
    """Memory usage of pandas Series or DataFrame."""
    # Ported from https://www.dataquest.io/blog/pandas-big-data/
    usg = obj.memory_usage(index=index, deep=deep)
    if isinstance(obj, pd.DataFrame) and total:
        usg = usg.sum()
    # Bytes to megabytes
    return usg / 1024 ** 2

catgrs = tuple(string.printable)

lengths = np.arange(1, 10001, dtype=np.uint16)
sizes = []
for length in lengths:
    obj = pd.Series(np.random.choice(catgrs, size=length))
    cat = obj.astype('category')
    sizes.append((mem_usage(obj), mem_usage(cat)))
sizes = np.array(sizes)

fig, ax = plt.subplots()
ax.plot(sizes)
ax.set_ylabel('Size (MB)')
ax.set_xlabel('Series length')
ax.legend(['object dtype', 'category dtype'])
ax.set_title('Memory usage of object vs. category dtype')

Run Code Online (Sandbox Code Playgroud)

虽然，对于n <125，pd.Categorical是稍微大。

fig, ax = plt.subplots()
ax.plot(sizes[:200])
ax.set_ylabel('Size (MB)')
ax.set_xlabel('Series length')
ax.legend(['object dtype', 'category dtype'])
ax.set_title('Memory usage of object vs. category dtype')

Run Code Online (Sandbox Code Playgroud)

Answer 1

Gol*_*ion 0

分类 astype 使用较少的内存。然而，一种热编码可以让您保持级别的分类排名。您可以分析分类器系数以了解分类数据的行为和预测。

归档时间：	8 年前
查看次数：	1124 次
最近记录：	5 年前