Eli*_*i S 5 python timezone timestamp pandas timezone-offset
I have hourly data records that were recorded in local daylight time (for me this is US/Pacific). These will be read in through csv. A gap exists at the beginning of DST at 02:00 when we spring forward. In fall, I believe that the data collected at 01:00 PDT is labeled 01:00 and the next hour is labeled 02:00 (and assumes PST).
I would like to translate the timestamps so they play well with other data stored in PST. Below is my attempt, in which I have focused on only the index which should simplify discussion.
tndx = pd.DatetimeIndex(["2016-11-06 00:00",""2016-11-06 01:00","2016-11-06 02:00","2016-11-06 03:00"])
tndx.tz_localize('US/Pacific',ambiguous="NaT").tz_convert('Etc/GMT+8')
print(tndx).tz_localize(None)
Run Code Online (Sandbox Code Playgroud)
Output is:
DatetimeIndex(['2016-11-05 23:00:00-08:00', 'NaT',
'2016-11-06 02:00:00-08:00', '2016-11-06 03:00:00-08:00']
Run Code Online (Sandbox Code Playgroud)
There are two things wrong about this. First, from the perspective of PST it seems like I am now missing two timestamps at 00:00 and 01:00. I get that the procedure is lossy, but I don't see that the procedure has to be lossy beyond one timestamp. I get an exception with ambiguous = "infer"
because there are no redundant values. When I explicitly set this to a boolean array, as suggested by karajdaar, I don't lose the extra time point. However, the boolean list isn't that easy to come by -- I can't use tndx because it isn't tz aware yet. The only way I can think of is this circuitous route through datetime.dst that involves a separate DataFrame and conversion:
# Create a date range that spans the possible times and is hourly
ndx2 = pd.date_range(start=pd.Timestamp(2016,11,5), end =pd.Timestamp(2016,11,7),freq='H',tz='US/Pacific')
# Here is the determination of whether it is dst
isdst = [bool(x.dst()) for x in ndx2.to_pydatetime()]
# I use DataFrame indexing to perform the lookup
# for values in my original index
df2 = pd.DataFrame({"isdst":isdst},index=ndx2.tz_localize(None))
df2 = df2.loc[~df2.index.duplicated(keep="last")]
ambig = df2[tndx] # This is what I would use for ambiguous
Run Code Online (Sandbox Code Playgroud)
Second, I used Etc/GMT+8 because I essentially blundered into discovering it gives the right offsets and timestamps, particularly after I make the stamps naive again. If I do not strip the time zone information (ie without the last tz_convert(None)
) the output would be:
>>> tndx.tz_localize('US/Pacific',ambiguous='NaT').tz_convert('Etc/GMT+8')
DatetimeIndex(['2016-11-05 23:00:00-08:00', 'NaT',
'2016-11-06 02:00:00-08:00', '2016-11-06 03:00:00-08:00'],
dtype='datetime64[ns, Etc/GMT+8]', freq=None)
Run Code Online (Sandbox Code Playgroud)
The offsets in this case look fine, but the timezone in the dtype seems misleading and in any event why is a time zone called GMT+8 giving offsets of -8? What am I not understanding about these conversions?
小智 0
在这种情况下,偏移量看起来不错,但 dtype 中的时区似乎具有误导性,无论如何,为什么名为 GMT+8 的时区给出的偏移量为 -8?我对这些转换有什么不理解的地方?
我发布这个问题是因为寻找答案让我来到这里并且我找到了更多信息。
pandas 时区转换功能似乎基于IANA 时区数据库
在时区数据库的等文件中是方便的注释
与区域名称中的 POSIX TZ 设置保持一致,即使这与许多人的期望相反。POSIX 在格林威治以西出现积极迹象,但许多人预计格林威治以东也出现积极迹象。例如,TZ='Etc/GMT+4' 使用缩写“-04”,对应于 UT 晚 4 小时(即格林威治以西),尽管许多人认为它意味着比 UT 早 4 小时(即格林威治以东)格林威治)。
另请参阅IANA 时区的 Wikipedia 条目,其中表示
“Etc”的特殊区域用于某些行政区域,特别是代表协调世界时的“Etc/UTC”。为了符合 POSIX 风格,那些以“Etc/GMT”开头的区域名称的符号与标准 ISO 8601 约定相反。在“Etc”区域中,GMT 以西的区域名称中带有正号,而以东的区域名称中带有负号(例如“Etc/GMT-14”比 GMT 早 14 小时)。
归档时间: |
|
查看次数: |
160 次 |
最近记录: |