AssertionError的解决方案:在连接Dataframe列表上的操作时,get_concat_dtype中的dtype确定无效

ahl*_*989 6 python csv pandas

我有一个数据帧列表,我试图使用串联功能组合.

dataframe_lists = [df1, df2, df3]

result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
Run Code Online (Sandbox Code Playgroud)

完整的追溯是:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
      2 check(dataframe_lists)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    753                        verify_integrity=verify_integrity,
    754                        copy=copy)
--> 755     return op.get_result()
    756 
    757 

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
    924 
    925             new_data = concatenate_block_managers(
--> 926                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
    927             if not self.copy:
    928                 new_data._consolidate_inplace()

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
   4150         raise AssertionError("Concatenating join units along axis0")
   4151 
-> 4152     empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
   4153 
   4154     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
   4139         return np.dtype('m8[ns]'), tslib.iNaT
   4140     else:  # pragma
-> 4141         raise AssertionError("invalid dtype determination in get_concat_dtype")
   4142 
   4143 

AssertionError: invalid dtype determination in get_concat_dtype
Run Code Online (Sandbox Code Playgroud)

我认为错误在于其中一个数据帧是空的.我使用simple函数check来验证并返回空数据帧的标题:

  def check(list_of_df):

    headers = []
    for df in dataframe_lists:
        if df.empty is not True:
            continue
        else:  
            headers.append(df.columns)

    return headers
Run Code Online (Sandbox Code Playgroud)

我想知道是否可以使用此函数,如果在空数据帧的情况下,只返回空数据帧的标题并将其附加到连接的数据帧.输出将是标题的单行(并且,在重复列名称的情况下,只是标题的单个实例(如在连接函数的情况下).我有两个示例数据源,一个两个非空数据集.这是一个空数据帧.

我想得到的连接有列标题...

 'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
Run Code Online (Sandbox Code Playgroud)

将一个空数据框的标题与该行一起添加(如果它们是新的).

 'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
Run Code Online (Sandbox Code Playgroud)

我欢迎有关这方面的最佳方法的反馈.

如下面的答案详细说明,这是一个意想不到的结果:

不幸的是,由于这种材料的敏感性,我无法分享实际数据.根据要点提出的内容如下:

A= data[data['RRT'] == 'A'] #Select just the columns with  from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']
Run Code Online (Sandbox Code Playgroud)

对于每个新数据帧,我然后应用此逻辑:

for column_name, column in A.transpose().iterrows():
    AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']]  #get select columns indexed with dataframe, "A"
Run Code Online (Sandbox Code Playgroud)

当我在空数据帧A上执行绑定方法时:

AColumns.count
Run Code Online (Sandbox Code Playgroud)

这是输出:

<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>
Run Code Online (Sandbox Code Playgroud)

最后,我使用以下内容导入了CSV:

data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True,  chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)
Run Code Online (Sandbox Code Playgroud)

我不确定我能提供什么.连接方法适用于满足要求所需的所有其他数据帧.我还查看了Pandas internals.py和完整的跟踪.要么我有太多的NaN列,重复的列名或混合的dtypes(后者是最不可能的罪魁祸首).

再次感谢您的指导.

小智 11

在我们的一个项目中,我们遇到了同样的错误.调试后我们发现了问题.我们的一个数据框有2列具有相同的名称.重命名其中一列后,我们的问题就解决了.


Abr*_*odj 7

这通常意味着您在其中一个数据框中有两个具有相同名称的列.

您可以通过查看输出来检查是否是这种情况

len(df.columns) > len(np.unique(df.columns))
Run Code Online (Sandbox Code Playgroud)

对于df您尝试连接的每个数据帧.

您可以通过使用Counter以下方式识别罪魁祸首列:

from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]
Run Code Online (Sandbox Code Playgroud)


max*_*moo 0

我无法重现您的错误,它对我来说没问题:

df1 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/42708e6a3ca0aed9b79b/raw/f37738994c3285e1b670d3926e716ae027dc30bc/sample_data.csv')
df2 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/26eb4ce1578e0844eb82/raw/23d9063dad7793d87a2fed2275857c85b59d56bb/sample2.csv')
df3 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/0721bd8b71416b54eccd/raw/b7ecae63beff88bd076a93d83500eb5fa67e1278/empty_df.csv')
pd.concat([df1,df2,df3], keys = ['one', 'two','three'], ignore_index=True).head()

Out[68]: 
   'B'  'C'  'D'  'E'  'F'  'G'  'A'  AT  AccountNum  AcctType ...   0  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
1  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
2  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    

   ToAccountNum  ToAccountT  TransferAmount  TransferMade  TransferTimestamp  0           NaN         NaN               4          True      1/7/2000 0:00   
1           NaN         NaN               4          True      1/8/2000 0:00   
2           NaN         NaN               6          True      1/9/2000 0:00   
3           NaN         NaN               6          True     1/10/2000 0:00   
4           NaN         NaN               0         False     1/11/2000 0:00   

   Ttype  Unnamed: 0  WA   WC  Zip  
0      D           4 NaN  NaN  NaN  
1      D           5 NaN  NaN  NaN  
2      D          13 NaN  NaN  NaN  
3      D          14 NaN  NaN  NaN  
4      T          25 NaN  NaN  NaN  

[5 rows x 41 columns]
Run Code Online (Sandbox Code Playgroud)