kia*_*ari 3 python performance timezone datetime pandas
我有许多包含 Unix 纪元时间的 csv 文件,需要将其转换为人类可读的日期/时间。下面的 Python 代码可以完成这项工作,但速度非常慢。
df['dt'] = pd.to_datetime(df['epoch'], unit='s')
df['dt'] = df.apply(lambda x: x['dt'].tz_localize('UTC').tz_convert('Europe/Amsterdam'), axis=1)
Run Code Online (Sandbox Code Playgroud)
实际上,第二行是瓶颈(100 万行约 30 秒)。因此,即使借助多处理,它也无法扩展,因为我总共拥有超过十亿条记录。我怎样才能让它更快?
pandas
,纯Python版本是将unix时间戳字符串转换为可读日期pandas.Series.dt.tz_localize
&pandas.Series.dt.tz_convert
都是向量化函数,不需要使用.apply()
。\n.apply()
矢量化.dt
必须使用访问器pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True)
和删除可能会更好.dt.tz_localize(\'UTC\')
可能会更好。import pandas as pd\n\n# test dataframe with 1M rows\ndf = pd.DataFrame({\'DT\': [1349720105, 1349806505, 1349892905, 1349979305, 1350065705]})\ndf[\'DT\'] = pd.to_datetime(df[\'DT\'], unit=\'s\')\ndf = pd.concat([df]*200000).reset_index(drop=True)\n\n# display(df.head()\n DT\n2012-10-08 18:15:05\n2012-10-09 18:15:05\n2012-10-10 18:15:05\n2012-10-11 18:15:05\n2012-10-12 18:15:05\n\n# convert the column\ndf[\'DT\'] = df[\'DT\'].dt.tz_localize(\'UTC\').dt.tz_convert(\'Europe/Amsterdam\')\n\n# display(df.head())\n DT\n2012-10-08 20:15:05+02:00\n2012-10-09 20:15:05+02:00\n2012-10-10 20:15:05+02:00\n2012-10-11 20:15:05+02:00\n2012-10-12 20:15:05+02:00\n\nprint(df.info())\n<class \'pandas.core.frame.DataFrame\'>\nRangeIndex: 1000000 entries, 0 to 999999\nData columns (total 1 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 DT 1000000 non-null datetime64[ns, Europe/Amsterdam]\ndtypes: datetime64[ns, Europe/Amsterdam](1)\nmemory usage: 7.6 MB\n
Run Code Online (Sandbox Code Playgroud)\n\'UTC\'
当转换为datetime
dtype
with时,此选项更加简洁并本地化pandas.to_datetime()
。df[\'DT\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).dt.tz_convert(\'Europe/Amsterdam\')\n
Run Code Online (Sandbox Code Playgroud)\nx[\'dt\'].tz_localize(\'UTC\')
在.apply()
df[\'DT_1\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).dt.tz_convert(\'Europe/Amsterdam\')\ndf[\'DT_2\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).apply(lambda x: x.tz_convert(\'Europe/Amsterdam\'))\n
Run Code Online (Sandbox Code Playgroud)\n%%timeit
测试.apply()
这测试了可比较的矢量化版本,与来自 OP 的版本相比,其中\'DT\'
已经转换为datetime
dtype
.%%timeit\ndf[\'DT\'].dt.tz_localize(\'UTC\').dt.tz_convert(\'Europe/Amsterdam\')\n[out]:\n4.4 ms \xc2\xb1 494 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n%%timeit\ndf.apply(lambda x: x[\'DT\'].tz_localize(\'UTC\').tz_convert(\'Europe/Amsterdam\'), axis=1)\n[out]:\n35.9 s \xc2\xb1 572 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n
归档时间: |
|
查看次数: |
7129 次 |
最近记录: |