ptp*_*til 5 python datetime pandas
使用Pandas处理包含日期,数字,类别等的一些基于时间序列的数据.
我遇到的问题是让pandas从CSV创建的DataFrame中正确处理我的日期/时间列.我的数据中有18个日期列,它们不是连续的,原始CSV中的未知值的字符串值为"未知".有些列的ALL单元格中包含有效的日期时间,并正确地通过pandas read_csv方法猜测它们的dtype.但是有些列在特定数据样本中将所有单元格设置为"未知",并将这些列作为对象输入.
我加载CSV的代码如下:
self.datecols = ['Claim Date', 'Lock Date', 'Closed Date', 'Service Date', 'Latest_Submission', 'Statement Date 1', 'Statement Date 2', 'Statement Date 3', 'Patient Payment Date 1', 'Patient Payment Date 2', 'Patient Payment Date 3', 'Primary 1 Payment Date', 'Primary 2 Payment Date', 'Primary 3 Payment Date', 'Secondary 1 Payment Date', 'Secondary 2 Payment Date', 'Tertiary Payment Date']
self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=True, na_values=['Unknown'])
self.csvbear = pd.DataFrame.convert_objects(self.csvbear, convert_dates='coerce')
print self.csvbear.dtypes
print self.csvbear['Tertiary Payment Date'].values
Run Code Online (Sandbox Code Playgroud)
打印self.csvbear.dtypes的输出
Prac object
Doctor Name object
Practice Name object
Specialty object
Speciality Code int64
Claim Date datetime64[ns]
Lock Date datetime64[ns]
Progress Note Locked object
Aging by Claim Date int64
Aging by Lock Date int64
Closed Date datetime64[ns]
Service Date datetime64[ns]
Week Number int64
Month datetime64[ns]
Current Insurance object
...
Secondary 2 Deductible float64
Secondary 2 Co Insurance float64
Secondary 2 Member Balance float64
Secondary 2 Paid float64
Secondary 2 Witheld float64
Secondary 2 Ins object
Tertiary Payment Date object
Tertiary Payment ID float64
Tertiary Allowed float64
Tertiary Deductible float64
Tertiary Co Insurance float64
Tertiary Member Balance float64
Tertiary Paid float64
Tertiary Witheld float64
Tertiary Ins float64
Length: 96, dtype: object
[nan nan nan ..., nan nan nan]
Press any key to continue . . .
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,Tertiary Payment Date col应该是datetime64 dtype,但它只是一个对象,它的实际内容只是NaN(从read_csv函数中放入字符串'Unknown').
如何可靠地转换所有日期列以将datetime64作为dtype并将NaT用于"未知"单元格?
如果你有一个全纳列,它将不会被正确强制read_csv.最简单的就是这样做(如果列已经是datetime64 [ns]将直接通过).
In [3]: df = DataFrame(dict(A = Timestamp('20130101'), B = np.random.randn(5), C = np.nan))
In [4]: df
Out[4]:
A B C
0 2013-01-01 00:00:00 -0.859994 NaN
1 2013-01-01 00:00:00 -2.562136 NaN
2 2013-01-01 00:00:00 0.410673 NaN
3 2013-01-01 00:00:00 0.480578 NaN
4 2013-01-01 00:00:00 0.464771 NaN
[5 rows x 3 columns]
In [5]: df.dtypes
Out[5]:
A datetime64[ns]
B float64
C float64
dtype: object
In [6]: df['A'] = pd.to_datetime(df['A'])
In [7]: df['C'] = pd.to_datetime(df['C'])
In [8]: df
Out[8]:
A B C
0 2013-01-01 00:00:00 -0.859994 NaT
1 2013-01-01 00:00:00 -2.562136 NaT
2 2013-01-01 00:00:00 0.410673 NaT
3 2013-01-01 00:00:00 0.480578 NaT
4 2013-01-01 00:00:00 0.464771 NaT
[5 rows x 3 columns]
In [9]: df.dtypes
Out[9]:
A datetime64[ns]
B float64
C datetime64[ns]
dtype: object
Run Code Online (Sandbox Code Playgroud)
convert_objects不会强制将列转换为datetime,除非它具有至少1个非日期的日期(这就是为什么你的例子失败).to_datetime可能会更积极,因为它"知道"你真的想转换它.
| 归档时间: |
|
| 查看次数: |
11462 次 |
| 最近记录: |