kia*_*ari 3 python performance timezone datetime pandas
我有许多包含 Unix 纪元时间的 csv 文件,需要将其转换为人类可读的日期/时间。下面的 Python 代码可以完成这项工作,但速度非常慢。
df['dt'] = pd.to_datetime(df['epoch'], unit='s')
df['dt'] = df.apply(lambda x: x['dt'].tz_localize('UTC').tz_convert('Europe/Amsterdam'), axis=1)
Run Code Online (Sandbox Code Playgroud)
实际上,第二行是瓶颈(100 万行约 30 秒)。因此,即使借助多处理,它也无法扩展,因为我总共拥有超过十亿条记录。我怎样才能让它更快?
pandas,纯Python版本是将unix时间戳字符串转换为可读日期pandas.Series.dt.tz_localize&pandas.Series.dt.tz_convert都是向量化函数,不需要使用.apply()。\n.apply()矢量化.dt必须使用访问器pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True)和删除可能会更好.dt.tz_localize(\'UTC\')可能会更好。import pandas as pd\n\n# test dataframe with 1M rows\ndf = pd.DataFrame({\'DT\': [1349720105, 1349806505, 1349892905, 1349979305, 1350065705]})\ndf[\'DT\'] = pd.to_datetime(df[\'DT\'], unit=\'s\')\ndf = pd.concat([df]*200000).reset_index(drop=True)\n\n# display(df.head()\n DT\n2012-10-08 18:15:05\n2012-10-09 18:15:05\n2012-10-10 18:15:05\n2012-10-11 18:15:05\n2012-10-12 18:15:05\n\n# convert the column\ndf[\'DT\'] = df[\'DT\'].dt.tz_localize(\'UTC\').dt.tz_convert(\'Europe/Amsterdam\')\n\n# display(df.head())\n DT\n2012-10-08 20:15:05+02:00\n2012-10-09 20:15:05+02:00\n2012-10-10 20:15:05+02:00\n2012-10-11 20:15:05+02:00\n2012-10-12 20:15:05+02:00\n\nprint(df.info())\n<class \'pandas.core.frame.DataFrame\'>\nRangeIndex: 1000000 entries, 0 to 999999\nData columns (total 1 columns):\n # Column Non-Null Count Dtype \n--- ------ -------------- ----- \n 0 DT 1000000 non-null datetime64[ns, Europe/Amsterdam]\ndtypes: datetime64[ns, Europe/Amsterdam](1)\nmemory usage: 7.6 MB\nRun Code Online (Sandbox Code Playgroud)\n\'UTC\'当转换为datetime dtypewith时,此选项更加简洁并本地化pandas.to_datetime()。df[\'DT\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).dt.tz_convert(\'Europe/Amsterdam\')\nRun Code Online (Sandbox Code Playgroud)\nx[\'dt\'].tz_localize(\'UTC\')在.apply()df[\'DT_1\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).dt.tz_convert(\'Europe/Amsterdam\')\ndf[\'DT_2\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).apply(lambda x: x.tz_convert(\'Europe/Amsterdam\'))\nRun Code Online (Sandbox Code Playgroud)\n%%timeit测试.apply()这测试了可比较的矢量化版本,与来自 OP 的版本相比,其中\'DT\'已经转换为datetime dtype.%%timeit\ndf[\'DT\'].dt.tz_localize(\'UTC\').dt.tz_convert(\'Europe/Amsterdam\')\n[out]:\n4.4 ms \xc2\xb1 494 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n%%timeit\ndf.apply(lambda x: x[\'DT\'].tz_localize(\'UTC\').tz_convert(\'Europe/Amsterdam\'), axis=1)\n[out]:\n35.9 s \xc2\xb1 572 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
7129 次 |
| 最近记录: |