如何将unix纪元时间转换为带有pandas时区的日期时间

kia*_*ari 3 python performance timezone datetime pandas

我有许多包含 Unix 纪元时间的 csv 文件,需要将其转换为人类可读的日期/时间。下面的 Python 代码可以完成这项工作,但速度非常慢。

df['dt'] = pd.to_datetime(df['epoch'], unit='s')
df['dt'] = df.apply(lambda x: x['dt'].tz_localize('UTC').tz_convert('Europe/Amsterdam'), axis=1)
Run Code Online (Sandbox Code Playgroud)

实际上,第二行是瓶颈(100 万行约 30 秒)。因此,即使借助多处理,它也无法扩展,因为我总共拥有超过十亿条记录。我怎样才能让它更快?

Tre*_*ney 7

\n
import pandas as pd\n\n# test dataframe with 1M rows\ndf = pd.DataFrame({\'DT\': [1349720105, 1349806505, 1349892905, 1349979305, 1350065705]})\ndf[\'DT\'] = pd.to_datetime(df[\'DT\'], unit=\'s\')\ndf = pd.concat([df]*200000).reset_index(drop=True)\n\n# display(df.head()\n                 DT\n2012-10-08 18:15:05\n2012-10-09 18:15:05\n2012-10-10 18:15:05\n2012-10-11 18:15:05\n2012-10-12 18:15:05\n\n# convert the column\ndf[\'DT\'] = df[\'DT\'].dt.tz_localize(\'UTC\').dt.tz_convert(\'Europe/Amsterdam\')\n\n# display(df.head())\n                       DT\n2012-10-08 20:15:05+02:00\n2012-10-09 20:15:05+02:00\n2012-10-10 20:15:05+02:00\n2012-10-11 20:15:05+02:00\n2012-10-12 20:15:05+02:00\n\nprint(df.info())\n<class \'pandas.core.frame.DataFrame\'>\nRangeIndex: 1000000 entries, 0 to 999999\nData columns (total 1 columns):\n #   Column  Non-Null Count    Dtype                           \n---  ------  --------------    -----                           \n 0   DT      1000000 non-null  datetime64[ns, Europe/Amsterdam]\ndtypes: datetime64[ns, Europe/Amsterdam](1)\nmemory usage: 7.6 MB\n
Run Code Online (Sandbox Code Playgroud)\n

选择

\n
    \n
  • \'UTC\'当转换为datetime dtypewith时,此选项更加简洁并本地化pandas.to_datetime()
  • \n
\n
df[\'DT\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).dt.tz_convert(\'Europe/Amsterdam\')\n
Run Code Online (Sandbox Code Playgroud)\n
    \n
  • OP 原始实现中最耗时的方面是x[\'dt\'].tz_localize(\'UTC\').apply()
  • \n
  • 以下代码的运行时间大约相同,即几毫秒内。
  • \n
\n
df[\'DT_1\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).dt.tz_convert(\'Europe/Amsterdam\')\ndf[\'DT_2\'] = pd.to_datetime(df[\'DT\'], unit=\'s\', utc=True).apply(lambda x: x.tz_convert(\'Europe/Amsterdam\'))\n
Run Code Online (Sandbox Code Playgroud)\n

%%timeit测试

\n
    \n
  • 100 万行
  • \n
  • .apply()这测试了可比较的矢量化版本,与来自 OP 的版本相比,其中\'DT\'已经转换为datetime dtype.
  • \n
\n
%%timeit\ndf[\'DT\'].dt.tz_localize(\'UTC\').dt.tz_convert(\'Europe/Amsterdam\')\n[out]:\n4.4 ms \xc2\xb1 494 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n\n%%timeit\ndf.apply(lambda x: x[\'DT\'].tz_localize(\'UTC\').tz_convert(\'Europe/Amsterdam\'), axis=1)\n[out]:\n35.9 s \xc2\xb1 572 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n