Pandas DataFrame缓慢显示形状或dtypes

Ric*_*d H 5 python performance dataframe pandas

我很新的pythonpandas.任何指导,评论和建议表示赞赏!

这是我的问题:在我打电话df.shape或之后返回结果需要几分钟df.dtypes.将DataFrame16106585列.存储三列int64,一列为float64,另一列为datetime64.

我使用以下代码练习加载和转换python.加载和转换都有很好的性能,但是当我检查输出时遇到了这个问题.

更新1:

将一些列设置为索引后,df.shape时间从80 + s下降到1.7s,但df.dtypes仍然保持在80 + s

import pandas as pd

###############
# Load
###############
raw = pd.read_csv("data.zip", compression='zip')

###############
# Transform
###############

payment_method = {
   "Cash": 1
   "Card": 2
}

df = raw. \
    assign(
        # Encode site ids to int. Only two sites in this data
        site     = (raw.site == "A").astype(int),
        # Encode payment types to int
        payment  = 
            [payment_method.get(k, 0) for k in raw.payment],
        # Rescale values
        amount   = raw.amount / 1e6,
        # Convert integer date key to datetime
        sold_date= pd.to_datetime(
            [str(dt) for dt in raw. sold_date],
            format="%Y%m%d")
    )

###############
# Check point
###############

df.shape # pain point I. Took minutes to return
# Out[9]: (1610658, 5)

df.dtypes # pain point II
# Out[10]: 
# site                       int64
# acct_key                   int64
# sold_date         datetime64[ns]
# amount                   float64
# payment                    int64
Run Code Online (Sandbox Code Playgroud)

如果我将数据帧转换为numpy.ndarray,我可以立即得到结果.我想我一定会错过一些东西.请给我一些指示.

非常感谢!

系统:OS X 10.12 Python:3.6.1 Numpy:1.12 Pandas:0.20.2 Jupiter console:5.1.0

小智 1

尝试减小 DataFrame 的大小:

int_columns = df.select_dtypes(include=["int"]).columns
df[int_columns] = df[int_columns].apply(pd.to_numeric, downcast='unsigned')
float_columns = df.select_dtypes(include=["float"]).columns
df[float_columns] = df[float_columns].apply(pd.to_numeric, downcast='float')
Run Code Online (Sandbox Code Playgroud)