使用 groupby/aggregate 返回多列

Ian*_*des 6 python pandas

我有一个示例数据集,我想按一列对其进行分组,然后根据现有列的所有值生成 4 个新列。

以下是一些示例数据:

data = {'AlignmentId': {0: u'ENSMUST00000000001.4-1',
  1: u'ENSMUST00000000001.4-1',
  2: u'ENSMUST00000000003.13-0',
  3: u'ENSMUST00000000003.13-0',
  4: u'ENSMUST00000000003.13-0'},
 'name': {0: u'NonCodingDeletion',
  1: u'NonCodingInsertion',
  2: u'CodingDeletion',
  3: u'CodingInsertion',
  4: u'NonCodingDeletion'},
 'value_CDS': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan},
 'value_mRNA': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}}
df = pd.DataFrame.from_dict(data)
Run Code Online (Sandbox Code Playgroud)

看起来像这样:

               AlignmentId                name  value_mRNA  value_CDS
0   ENSMUST00000000001.4-1   NonCodingDeletion        21.0        NaN
1   ENSMUST00000000001.4-1  NonCodingInsertion        26.0        NaN
2  ENSMUST00000000003.13-0      CodingDeletion         1.0        1.0
3  ENSMUST00000000003.13-0     CodingInsertion         1.0        1.0
4  ENSMUST00000000003.13-0   NonCodingDeletion         2.0        NaN
Run Code Online (Sandbox Code Playgroud)

我想根据name列中值的存在/不存在返回布尔值,具体取决于列中是否value_CDS仅包含空值。我制作了这个函数来做到这一点:

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s.name)
    else:
        c = set(s.name)
    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)
Run Code Online (Sandbox Code Playgroud)

并做到了这一点:

merged = df.groupby('AlignmentId').aggregate(aggfunc)
Run Code Online (Sandbox Code Playgroud)

这给了我错误ValueError: Shape of passed values is (318, 4), indices imply (318, 3)

如何从 groupby-aggregate 返回多个新列?

我正在寻找的输出是:

ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)
Run Code Online (Sandbox Code Playgroud)

然后我最好将其放入一个 5 列的数据框中。

如果我使用.apply,则输出不正确:

ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0    (False, False, False, False)
Run Code Online (Sandbox Code Playgroud)

但如果我一次抓住一组,它是正确的:

In [380]: for aln_id, d in df.groupby('AlignmentId'):
   .....:     print aggfunc(d)
   .....:
(False, False, False, False)
(True, True, True, False)
Run Code Online (Sandbox Code Playgroud)

jez*_*ael 7

您需要更改name['name'],因为.name返回组名称(列分组依据的值):

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s['name'])
    else:
        c = set(s['name'])

    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
AlignmentId
ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0       (True, True, True, False)
dtype: object
Run Code Online (Sandbox Code Playgroud)
def aggfunc(s):

    print ('Name of group is: {}'.format((s.name)))  
    print ('Column name is:\n {}'.format(s['name']))  


merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

Name of group is: ENSMUST00000000001.4-1
Column name is:
 0     NonCodingDeletion
1    NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000001.4-1
Column name is:
 0     NonCodingDeletion
1    NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000003.13-0
Column name is:
 2       CodingDeletion
3      CodingInsertion
4    NonCodingDeletion
Name: name, dtype: object
Run Code Online (Sandbox Code Playgroud)

改进的代码:

def aggfunc(s):
    #if and else return same c, so omitted
    c = set(s['name'])

    #added Series for return columns instead tuples
    cols = ['col1','col2','col3','col4']
    return pd.Series(('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c), index=cols)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

                          col1   col2   col3   col4
AlignmentId                                        
ENSMUST00000000001.4-1   False  False  False  False
ENSMUST00000000003.13-0   True   True   True  False
Run Code Online (Sandbox Code Playgroud)

  • 尝试改进的代码我收到错误 ValueError: Function does not reduce (3认同)