dsi*_*mie 5 python date epoch cython pandas
我有一个非常大的pandas数据帧,我想创建一个列,其中包含自ISO-8601格式日期字符串的纪元以来的秒数.
我最初使用标准的Python库,但结果很慢.我曾尝试使用POSIX的C库函数来代替这个strptime和mktime直接,但一直没能得到的时间转换正确的答案.
这是代码(在IPython窗口中运行)
%load_ext cythonmagic
%%cython
from posix.types cimport time_t
cimport numpy as np
import numpy as np
import time
cdef extern from "sys/time.h" nogil:
struct tm:
int tm_sec
int tm_min
int tm_hour
int tm_mday
int tm_mon
int tm_year
int tm_wday
int tm_yday
int tm_isdst
time_t mktime(tm *timeptr)
char *strptime(const char *s, const char *format, tm *tm)
cdef to_epoch_c(const char *date_text):
cdef tm time_val
strptime(date_text, "%Y-%m-%d", &time_val)
return <unsigned int>mktime(&time_val)
cdef to_epoch_py(const char *date_text):
return np.uint32(time.mktime(time.strptime(date_text, "%Y-%m-%d")))
cpdef np.ndarray[unsigned int] apply_epoch_date_c(np.ndarray col_date):
cdef Py_ssize_t i, n = len(col_date)
cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
for i in range(len(col_date)):
res[i] = to_epoch_c(col_date[i])
return res
cpdef np.ndarray[unsigned int] apply_epoch_date_py(np.ndarray col_date):
cdef Py_ssize_t i, n = len(col_date)
cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
for i in range(len(col_date)):
res[i] = to_epoch_py(col_date[i])
return res
Run Code Online (Sandbox Code Playgroud)
创建的结构strptime看起来不正确,小时,分钟和秒值太大,删除它们或将它们设置为0似乎没有得到我正在寻找的答案.
这是一个小测试df,它显示了c方法的值不正确:
from pandas import DataFrame
test = DataFrame({'date_text':["2015-05-18" for i in range(3)]}, dtype=np.uint32)
apply_epoch_date_py(test['date_text'].values)
Output: array([1431903600, 1431903600, 1431903600], dtype=uint32)
apply_epoch_date_c(test['date_text'].values)
Output: array([4182545380, 4182617380, 4182602980], dtype=uint32)
Run Code Online (Sandbox Code Playgroud)
我不明白为什么c版本的值并不总是相同,并且到目前为止它们应该是什么.我希望这个错误相当小,因为这两个在大型数据帧上的时间差异很大(我不确定c版本现在做的工作少了多少,因为它没有按预期工作)
test_large = DataFrame({'date_text':["2015-05-18" for i in range(int(10e6))]}, dtype=np.uint32)
%timeit -n 1 -r 1 apply_epoch_date_py(test_large['date_text'].values)
Output: 1 loops, best of 1: 1min 58s per loop
%timeit apply_epoch_date_c(test_large['date_text'].values)
Output: 1 loops, best of 3: 5.59 s per loop
Run Code Online (Sandbox Code Playgroud)
我查了一下这个cython time.h 帖子和一个字符串创建帖子的一般c unix时间,这可能对回答的人有用.
因此,我的主要问题是关于函数to_epoch_c为什么会产生不正确的值?谢谢
更新:
来自@Jeff的方法确实是使用pandas解决这个问题的最快最简单的方法.
与其他方法相比,Python中strptime/mktime的性能较差.这里提到的另一种基于Python的方法要快得多.为这篇文章中提到的所有方法运行转换(加上pd.to_datetime给定的字符串格式)可以提供有趣的结果.具有infer_datetime_format的Pandas很容易实现最快,扩展性很好.如果你告诉熊猫日期格式是多么迟钝,那就有点不合情理了.

两种熊猫方法的配置文件比较:
%prun -l 3 pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
352 function calls (350 primitive calls) in 0.021 seconds
Ordered by: internal time
List reduced from 96 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.013 0.013 0.013 0.013 {pandas.tslib.array_to_datetime}
1 0.005 0.005 0.005 0.005 {pandas.lib.isnullobj}
1 0.001 0.001 0.021 0.021 <string>:1(<module>)
%prun -l 3 pd.to_datetime(df['date_text'],format="%Y-%m-%d", box=False).values.view('i8')/10**9
109 function calls (107 primitive calls) in 0.253 seconds
Ordered by: internal time
List reduced from 55 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.251 0.251 0.251 0.251 {pandas.tslib.array_strptime}
1 0.001 0.001 0.253 0.253 <string>:1(<module>)
1 0.000 0.000 0.252 0.252 tools.py:176(to_datetime)
Run Code Online (Sandbox Code Playgroud)
看来如果您没有传入time_val.tm_hour, time_val.tm_min并且time_val.tm_sec日期解析不正确,将值设置为0将返回正确的时间戳:
cdef extern from "sys/time.h" nogil:\n struct tm:\n int tm_sec #Seconds [0,60].\n int tm_min #Minutes [0,59].\n int tm_hour #Hour [0,23].\n int tm_mday #Day of month [1,31].\n int tm_mon #Month of year [0,11].\n int tm_year #Years since 1900.\n int tm_wday #Day of week [0,6] (Sunday =0).\n int tm_yday #Day of year [0,365].\n int tm_isdst #Daylight Savings\n time_t mktime(tm *timeptr)\n char *strptime(const char *s, const char *format, tm *tm)\ncdef to_epoch_c(const char *date_text):\n cdef tm time_val\n time_val.tm_hour, time_val.tm_min, time_val.tm_sec= 0, 0, 0\n strptime(date_text, "%Y-%m-%d", &time_val)\n return <unsigned int>mktime(&time_val)\nRun Code Online (Sandbox Code Playgroud)\n\n如果您print(time.strptime(date_text, "%Y-%m-%d"))看到 python 将这些值设置为0(如果您不将它们传递给 strptime):
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=18, tm_hour=12, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=138, tm_isdst=-1)\nRun Code Online (Sandbox Code Playgroud)\n\n将值设置为默认值0into_epoch_c也会返回0:
{\'tm_sec\': 0, \'tm_hour\': 0, \'tm_mday\': 18, \'tm_isdst\': 1, \'tm_year\': 115, \'tm_mon\': 4, \'tm_yday\': 137, \'tm_wday\': 1, \'tm_min\': 0}\nRun Code Online (Sandbox Code Playgroud)\n\n如果您不设置它们,则会返回随机时间戳,因为似乎有各种值tm_sec等......:
{\'tm_sec\': -1437999996, \'tm_hour\': 0, \'tm_mday\': 0, \'tm_isdst\': -1438000080, \'tm_year\': 32671, \'tm_mon\': -1412460224, \'tm_yday\': 0, \'tm_wday\': 5038405, \'tm_min\': 32671}\n{\'tm_sec\': -1437999996, \'tm_hour\': 4, \'tm_mday\': 14, \'tm_isdst\': 0, \'tm_year\': 69, \'tm_mon\': 10, \'tm_yday\': 317, \'tm_wday\': 5, \'tm_min\': 32671}\n{\'tm_sec\': -1437999996, \'tm_hour\': 9, \'tm_mday\': 14, \'tm_isdst\': 0, \'tm_year\': 69, \'tm_mon\': 10, \'tm_yday\': 317, \'tm_wday\': 5, \'tm_min\': 32671}\nRun Code Online (Sandbox Code Playgroud)\n\n我想当你不以类似的方式传递它们时,Python 可能会处理,但我还没有查看源代码,所以也许更有经验的人c会确认。
如果你尝试传递少于 9 个元素,time.time_struct你会得到一个错误,这在某种程度上证实了我的想法:
In [60]: import time \nIn [61]: struct = time.struct_time((2015, 6, 18))\n---------------------------------------------------------------------------\nTypeError Traceback (most recent call last)\n<ipython-input-61-ee40483c37d4> in <module>()\n----> 1 struct = time.struct_time((2015, 6, 18))\n\nTypeError: time.struct_time() takes a 9-sequence (3-sequence given)\nRun Code Online (Sandbox Code Playgroud)\n\n您必须传递包含 9 个元素的序列:
\n\nIn [63]: struct = time.struct_time((2015, 6, 18, 0, 0, 0, 0, 0, 0)) \nIn [64]: struct\nOut[65]: time.struct_time(tm_year=2015, tm_mon=6, tm_mday=18, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=0, tm_isdst=0)\nRun Code Online (Sandbox Code Playgroud)\n\n无论如何,通过更改,您会在以下两个方面获得相同的行为:
\n\nIn [16]: import pandas as pd\n\nIn [17]: import numpy as np\n\nIn [18]: test = pd.DataFrame({\'date_text\' : ["2015-05-18" for i in range(3)]}, dtype=np.uint32)\n\nIn [19]: apply_epoch_date_c(test[\'date_text\'].values)\nOut[19]: array([1431903600, 1431903600, 1431903600], dtype=uint32)\n\nIn [20]: apply_epoch_date_py(test[\'date_text\'].values)\nOut[20]: array([1431903600, 1431903600, 1431903600], dtype=uint32)\nRun Code Online (Sandbox Code Playgroud)\n\n自 1970 年 1 月 1 日以来的每个日期的一些测试都显示返回相同的时间戳:
\n\nIn [55]: from datetime import datetime, timedelta\n\nIn [56]: tests = np.array([(datetime.strptime("1970-1-1","%Y-%m-%d")+timedelta(i)).strftime("%Y-%m-%d") for i in range(16604)])\n\nIn [57]: a = apply_epoch_date_c( tests)\n\nIn [58]: b = apply_epoch_date_py( tests)\n\nIn [59]: for d1,d2 in zip(a,b):\n assert d1 == d1\n ....: \n\nIn [60]: \nRun Code Online (Sandbox Code Playgroud)\n\n对 cython 代码的两种实现进行计时似乎确实更加高效:
\n\nIn [21]: timeit apply_epoch_date_py(test[\'date_text\'].values)\n10000 loops, best of 3: 73 \xc2\xb5s per loop\n\nIn [22]: timeit apply_epoch_date_c(test[\'date_text\'].values)\n100000 loops, best of 3: 10.8 \xc2\xb5s per loop\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
681 次 |
| 最近记录: |