将字符串日期转换为不与Cython和POSIX C库一起使用的纪元时间

dsi*_*mie 5 python date epoch cython pandas

我有一个非常大的pandas数据帧,我想创建一个列,其中包含自ISO-8601格式日期字符串的纪元以来的秒数.

我最初使用标准的Python库,但结果很慢.我曾尝试使用POSIX的C库函数来代替这个strptimemktime直接,但一直没能得到的时间转换正确的答案.

这是代码(在IPython窗口中运行)

%load_ext cythonmagic

%%cython
from posix.types cimport time_t
cimport numpy as np
import numpy as np
import time
cdef extern from "sys/time.h" nogil:
    struct tm:
        int tm_sec
        int tm_min
        int tm_hour
        int tm_mday
        int tm_mon
        int tm_year
        int tm_wday
        int tm_yday
        int tm_isdst
    time_t mktime(tm *timeptr)
    char *strptime(const char *s, const char *format, tm *tm)
cdef to_epoch_c(const char *date_text):
    cdef tm time_val
    strptime(date_text, "%Y-%m-%d", &time_val)
    return <unsigned int>mktime(&time_val)
cdef to_epoch_py(const char *date_text):
    return np.uint32(time.mktime(time.strptime(date_text, "%Y-%m-%d")))
cpdef np.ndarray[unsigned int] apply_epoch_date_c(np.ndarray col_date):
    cdef Py_ssize_t i, n = len(col_date)
    cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
    for i in range(len(col_date)):
        res[i] = to_epoch_c(col_date[i])
    return res
cpdef np.ndarray[unsigned int] apply_epoch_date_py(np.ndarray col_date):
    cdef Py_ssize_t i, n = len(col_date)
    cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
    for i in range(len(col_date)):
        res[i] = to_epoch_py(col_date[i])
    return res
Run Code Online (Sandbox Code Playgroud)

创建的结构strptime看起来不正确,小时,分钟和秒值太大,删除它们或将它们设置为0似乎没有得到我正在寻找的答案.

这是一个小测试df,它显示了c方法的值不正确:

from pandas import DataFrame
test = DataFrame({'date_text':["2015-05-18" for i in range(3)]}, dtype=np.uint32)

apply_epoch_date_py(test['date_text'].values)
Output: array([1431903600, 1431903600, 1431903600], dtype=uint32)
apply_epoch_date_c(test['date_text'].values)
Output: array([4182545380, 4182617380, 4182602980], dtype=uint32)
Run Code Online (Sandbox Code Playgroud)

我不明白为什么c版本的值并不总是相同,并且到目前为止它们应该是什么.我希望这个错误相当小,因为这两个在大型数据帧上的时间差异很大(我不确定c版本现在做的工作少了多少,因为它没有按预期工作)

test_large = DataFrame({'date_text':["2015-05-18" for i in range(int(10e6))]}, dtype=np.uint32)
%timeit -n 1 -r 1 apply_epoch_date_py(test_large['date_text'].values)
Output: 1 loops, best of 1: 1min 58s per loop
%timeit apply_epoch_date_c(test_large['date_text'].values)
Output: 1 loops, best of 3: 5.59 s per loop
Run Code Online (Sandbox Code Playgroud)

我查了一下这个cython time.h 帖子和一个字符串创建帖子的一般c unix时间,这可能对回答的人有用.

因此,我的主要问题是关于函数to_epoch_c为什么会产生不正确的值?谢谢

更新:

来自@Jeff的方法确实是使用pandas解决这个问题的最快最简单的方法.

与其他方法相比,Python中strptime/mktime的性能较差.这里提到的另一种基于Python的方法要快得多.为这篇文章中提到的所有方法运行转换(加上pd.to_datetime给定的字符串格式)可以提供有趣的结果.具有infer_datetime_format的Pandas很容易实现最快,扩展性很好.如果你告诉熊猫日期格式是多么迟钝,那就有点不合情理了.

绩效比较

两种熊猫方法的配置文件比较:

%prun -l 3 pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
352 function calls (350 primitive calls) in 0.021 seconds
Ordered by: internal time
List reduced from 96 to 3 due to restriction <3>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.013    0.013    0.013    0.013 {pandas.tslib.array_to_datetime}
    1    0.005    0.005    0.005    0.005 {pandas.lib.isnullobj}
    1    0.001    0.001    0.021    0.021 <string>:1(<module>)

%prun -l 3 pd.to_datetime(df['date_text'],format="%Y-%m-%d", box=False).values.view('i8')/10**9
109 function calls (107 primitive calls) in 0.253 seconds

Ordered by: internal time
List reduced from 55 to 3 due to restriction <3>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.251    0.251    0.251    0.251 {pandas.tslib.array_strptime}
    1    0.001    0.001    0.253    0.253 <string>:1(<module>)
    1    0.000    0.000    0.252    0.252 tools.py:176(to_datetime)
Run Code Online (Sandbox Code Playgroud)

Pad*_*ham 4

看来如果您没有传入time_val.tm_hour, time_val.tm_min并且time_val.tm_sec日期解析不正确,将值设置为0将返回正确的时间戳:

\n\n
cdef extern from "sys/time.h" nogil:\n    struct tm:\n        int    tm_sec   #Seconds [0,60].\n        int    tm_min   #Minutes [0,59].\n        int    tm_hour  #Hour [0,23].\n        int    tm_mday  #Day of month [1,31].\n        int    tm_mon   #Month of year [0,11].\n        int    tm_year  #Years since 1900.\n        int    tm_wday  #Day of week [0,6] (Sunday =0).\n        int    tm_yday  #Day of year [0,365].\n        int    tm_isdst #Daylight Savings\n    time_t mktime(tm *timeptr)\n    char *strptime(const char *s, const char *format, tm *tm)\ncdef to_epoch_c(const char *date_text):\n    cdef tm time_val\n    time_val.tm_hour,  time_val.tm_min,  time_val.tm_sec= 0, 0, 0\n    strptime(date_text, "%Y-%m-%d", &time_val)\n    return  <unsigned int>mktime(&time_val)\n
Run Code Online (Sandbox Code Playgroud)\n\n

如果您print(time.strptime(date_text, "%Y-%m-%d"))看到 python 将这些值设置为0(如果您不将它们传递给 strptime):

\n\n
 time.struct_time(tm_year=2015, tm_mon=5, tm_mday=18, tm_hour=12, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=138, tm_isdst=-1)\n
Run Code Online (Sandbox Code Playgroud)\n\n

将值设置为默认值0into_epoch_c也会返回0

\n\n
{\'tm_sec\': 0, \'tm_hour\': 0, \'tm_mday\': 18, \'tm_isdst\': 1, \'tm_year\': 115, \'tm_mon\': 4, \'tm_yday\': 137, \'tm_wday\': 1, \'tm_min\': 0}\n
Run Code Online (Sandbox Code Playgroud)\n\n

如果您不设置它们,则会返回随机时间戳,因为似乎有各种值tm_sec等......:

\n\n
 {\'tm_sec\': -1437999996, \'tm_hour\': 0, \'tm_mday\': 0, \'tm_isdst\': -1438000080, \'tm_year\': 32671, \'tm_mon\': -1412460224, \'tm_yday\': 0, \'tm_wday\': 5038405, \'tm_min\': 32671}\n{\'tm_sec\': -1437999996, \'tm_hour\': 4, \'tm_mday\': 14, \'tm_isdst\': 0, \'tm_year\': 69, \'tm_mon\': 10, \'tm_yday\': 317, \'tm_wday\': 5, \'tm_min\': 32671}\n{\'tm_sec\': -1437999996, \'tm_hour\': 9, \'tm_mday\': 14, \'tm_isdst\': 0, \'tm_year\': 69, \'tm_mon\': 10, \'tm_yday\': 317, \'tm_wday\': 5, \'tm_min\': 32671}\n
Run Code Online (Sandbox Code Playgroud)\n\n

我想当你不以类似的方式传递它们时,Python 可能会处理,但我还没有查看源代码,所以也许更有经验的人c会确认。

\n\n

如果你尝试传递少于 9 个元素,time.time_struct你会得到一个错误,这在某种程度上证实了我的想法:

\n\n
In [60]: import time  \nIn [61]: struct = time.struct_time((2015, 6, 18))\n---------------------------------------------------------------------------\nTypeError                                 Traceback (most recent call last)\n<ipython-input-61-ee40483c37d4> in <module>()\n----> 1 struct = time.struct_time((2015, 6, 18))\n\nTypeError: time.struct_time() takes a 9-sequence (3-sequence given)\n
Run Code Online (Sandbox Code Playgroud)\n\n

您必须传递包含 9 个元素的序列:

\n\n
In [63]: struct = time.struct_time((2015, 6, 18, 0, 0, 0, 0, 0, 0))    \nIn [64]: struct\nOut[65]: time.struct_time(tm_year=2015, tm_mon=6, tm_mday=18, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=0, tm_isdst=0)\n
Run Code Online (Sandbox Code Playgroud)\n\n

无论如何,通过更改,您会在以下两个方面获得相同的行为:

\n\n
In [16]: import pandas as pd\n\nIn [17]: import numpy as np\n\nIn [18]: test = pd.DataFrame({\'date_text\' : ["2015-05-18" for i in range(3)]}, dtype=np.uint32)\n\nIn [19]: apply_epoch_date_c(test[\'date_text\'].values)\nOut[19]: array([1431903600, 1431903600, 1431903600], dtype=uint32)\n\nIn [20]: apply_epoch_date_py(test[\'date_text\'].values)\nOut[20]: array([1431903600, 1431903600, 1431903600], dtype=uint32)\n
Run Code Online (Sandbox Code Playgroud)\n\n

自 1970 年 1 月 1 日以来的每个日期的一些测试都显示返回相同的时间戳:

\n\n
In [55]: from datetime import datetime, timedelta\n\nIn [56]: tests = np.array([(datetime.strptime("1970-1-1","%Y-%m-%d")+timedelta(i)).strftime("%Y-%m-%d") for i in range(16604)])\n\nIn [57]: a = apply_epoch_date_c( tests)\n\nIn [58]: b = apply_epoch_date_py( tests)\n\nIn [59]: for d1,d2 in zip(a,b):\n             assert d1 == d1\n   ....:     \n\nIn [60]: \n
Run Code Online (Sandbox Code Playgroud)\n\n

对 cython 代码的两种实现进行计时似乎确实更加高效:

\n\n
In [21]: timeit apply_epoch_date_py(test[\'date_text\'].values)\n10000 loops, best of 3: 73 \xc2\xb5s per loop\n\nIn [22]: timeit apply_epoch_date_c(test[\'date_text\'].values)\n100000 loops, best of 3: 10.8 \xc2\xb5s per loop\n
Run Code Online (Sandbox Code Playgroud)\n