我有一个数据帧列表,我试图使用串联功能组合.
dataframe_lists = [df1, df2, df3]
result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
Run Code Online (Sandbox Code Playgroud)
完整的追溯是:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
2 check(dataframe_lists)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
753 verify_integrity=verify_integrity,
754 copy=copy)
--> 755 return op.get_result()
756
757
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
924
925 new_data = concatenate_block_managers(
--> 926 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
927 if not self.copy:
928 new_data._consolidate_inplace()
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
4061 copy=copy),
4062 placement=placement)
-> 4063 for placement, join_units in concat_plan]
4064
4065 return BlockManager(blocks, axes)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
4061 copy=copy),
4062 placement=placement)
-> 4063 for placement, join_units in concat_plan]
4064
4065 return BlockManager(blocks, axes)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
4150 raise AssertionError("Concatenating join units along axis0")
4151
-> 4152 empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
4153
4154 to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
4139 return np.dtype('m8[ns]'), tslib.iNaT
4140 else: # pragma
-> 4141 raise AssertionError("invalid dtype determination in get_concat_dtype")
4142
4143
AssertionError: invalid dtype determination in get_concat_dtype
Run Code Online (Sandbox Code Playgroud)
我认为错误在于其中一个数据帧是空的.我使用simple函数check
来验证并返回空数据帧的标题:
def check(list_of_df):
headers = []
for df in dataframe_lists:
if df.empty is not True:
continue
else:
headers.append(df.columns)
return headers
Run Code Online (Sandbox Code Playgroud)
我想知道是否可以使用此函数,如果在空数据帧的情况下,只返回空数据帧的标题并将其附加到连接的数据帧.输出将是标题的单行(并且,在重复列名称的情况下,只是标题的单个实例(如在连接函数的情况下).我有两个示例数据源,一个和两个非空数据集.这是一个空数据帧.
我想得到的连接有列标题...
'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
Run Code Online (Sandbox Code Playgroud)
将一个空数据框的标题与该行一起添加(如果它们是新的).
'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
Run Code Online (Sandbox Code Playgroud)
我欢迎有关这方面的最佳方法的反馈.
如下面的答案详细说明,这是一个意想不到的结果:
不幸的是,由于这种材料的敏感性,我无法分享实际数据.根据要点提出的内容如下:
A= data[data['RRT'] == 'A'] #Select just the columns with from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']
Run Code Online (Sandbox Code Playgroud)
对于每个新数据帧,我然后应用此逻辑:
for column_name, column in A.transpose().iterrows():
AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']] #get select columns indexed with dataframe, "A"
Run Code Online (Sandbox Code Playgroud)
当我在空数据帧A上执行绑定方法时:
AColumns.count
Run Code Online (Sandbox Code Playgroud)
这是输出:
<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>
Run Code Online (Sandbox Code Playgroud)
最后,我使用以下内容导入了CSV:
data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True, chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)
Run Code Online (Sandbox Code Playgroud)
我不确定我能提供什么.连接方法适用于满足要求所需的所有其他数据帧.我还查看了Pandas internals.py和完整的跟踪.要么我有太多的NaN列,重复的列名或混合的dtypes(后者是最不可能的罪魁祸首).
再次感谢您的指导.
这通常意味着您在其中一个数据框中有两个具有相同名称的列.
您可以通过查看输出来检查是否是这种情况
len(df.columns) > len(np.unique(df.columns))
Run Code Online (Sandbox Code Playgroud)
对于df
您尝试连接的每个数据帧.
您可以通过使用Counter
以下方式识别罪魁祸首列:
from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]
Run Code Online (Sandbox Code Playgroud)
我无法重现您的错误,它对我来说没问题:
df1 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/42708e6a3ca0aed9b79b/raw/f37738994c3285e1b670d3926e716ae027dc30bc/sample_data.csv')
df2 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/26eb4ce1578e0844eb82/raw/23d9063dad7793d87a2fed2275857c85b59d56bb/sample2.csv')
df3 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/0721bd8b71416b54eccd/raw/b7ecae63beff88bd076a93d83500eb5fa67e1278/empty_df.csv')
pd.concat([df1,df2,df3], keys = ['one', 'two','three'], ignore_index=True).head()
Out[68]:
'B' 'C' 'D' 'E' 'F' 'G' 'A' AT AccountNum AcctType ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
ToAccountNum ToAccountT TransferAmount TransferMade TransferTimestamp 0 NaN NaN 4 True 1/7/2000 0:00
1 NaN NaN 4 True 1/8/2000 0:00
2 NaN NaN 6 True 1/9/2000 0:00
3 NaN NaN 6 True 1/10/2000 0:00
4 NaN NaN 0 False 1/11/2000 0:00
Ttype Unnamed: 0 WA WC Zip
0 D 4 NaN NaN NaN
1 D 5 NaN NaN NaN
2 D 13 NaN NaN NaN
3 D 14 NaN NaN NaN
4 T 25 NaN NaN NaN
[5 rows x 41 columns]
Run Code Online (Sandbox Code Playgroud)