Ric*_*d H 5 python performance dataframe pandas
我很新的python和pandas.任何指导,评论和建议表示赞赏!
这是我的问题:在我打电话df.shape或之后返回结果需要几分钟df.dtypes.将DataFrame有1610658行5列.存储三列int64,一列为float64,另一列为datetime64.
我使用以下代码练习加载和转换python.加载和转换都有很好的性能,但是当我检查输出时遇到了这个问题.
更新1:
将一些列设置为索引后,df.shape时间从80 + s下降到1.7s,但df.dtypes仍然保持在80 + s
import pandas as pd
###############
# Load
###############
raw = pd.read_csv("data.zip", compression='zip')
###############
# Transform
###############
payment_method = {
"Cash": 1
"Card": 2
}
df = raw. \
assign(
# Encode site ids to int. Only two sites in this data
site = (raw.site == "A").astype(int),
# Encode payment types to int
payment =
[payment_method.get(k, 0) for k in raw.payment],
# Rescale values
amount = raw.amount / 1e6,
# Convert integer date key to datetime
sold_date= pd.to_datetime(
[str(dt) for dt in raw. sold_date],
format="%Y%m%d")
)
###############
# Check point
###############
df.shape # pain point I. Took minutes to return
# Out[9]: (1610658, 5)
df.dtypes # pain point II
# Out[10]:
# site int64
# acct_key int64
# sold_date datetime64[ns]
# amount float64
# payment int64
Run Code Online (Sandbox Code Playgroud)
如果我将数据帧转换为numpy.ndarray,我可以立即得到结果.我想我一定会错过一些东西.请给我一些指示.
非常感谢!
系统:OS X 10.12 Python:3.6.1 Numpy:1.12 Pandas:0.20.2 Jupiter console:5.1.0
小智 1
尝试减小 DataFrame 的大小:
int_columns = df.select_dtypes(include=["int"]).columns
df[int_columns] = df[int_columns].apply(pd.to_numeric, downcast='unsigned')
float_columns = df.select_dtypes(include=["float"]).columns
df[float_columns] = df[float_columns].apply(pd.to_numeric, downcast='float')
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
456 次 |
| 最近记录: |