在Python Pandas中以长格式附加列表元素

Kub*_*888 4 python list dataframe python-3.x pandas

我有以下数据:

study_id       list_value
1              ['aaa', 'bbb']
1              ['aaa']
1              ['ccc']
2              ['ddd', 'eee', 'aaa']
2              np.NaN
2              ['zzz', 'aaa', 'bbb']
Run Code Online (Sandbox Code Playgroud)

我怎样才能将它转换成这样的东西?

study_id       list_value
1              ['aaa', 'bbb', 'ccc']
1              ['aaa', 'bbb', 'ccc']
1              ['aaa', 'bbb', 'ccc']
2              ['aaa', 'bbb', 'ddd', 'eee', 'zzz'] 
2              ['aaa', 'bbb', 'ddd', 'eee', 'zzz'] 
2              ['aaa', 'bbb', 'ddd', 'eee', 'zzz'] # order of list item doesn't matter
Run Code Online (Sandbox Code Playgroud)

piR*_*red 5

defaultdict

from collections import defaultdict

d = defaultdict(set)

for t in df.dropna(subset=['list_value']).itertuples():
    d[t.study_id] |= set(t.list_value)

df.assign(list_value=df.study_id.map(pd.Series(d).apply(sorted)))


   study_id       list_value
0         1        [a, b, c]
1         1        [a, b, c]
2         1        [a, b, c]
3         2  [a, b, d, e, z]
4         2  [a, b, d, e, z]
5         2  [a, b, d, e, z]
Run Code Online (Sandbox Code Playgroud)

np.unique 和其他其他诡计

请注意结果是 ndarray

df.assign(
    list_value=df.study_id.map(
        df.set_index('study_id').list_value.dropna().sum(level=0).apply(np.unique)
    )
)

   study_id       list_value
0         1        [a, b, c]
1         1        [a, b, c]
2         1        [a, b, c]
3         2  [a, b, d, e, z]
4         2  [a, b, d, e, z]
5         2  [a, b, d, e, z]
Run Code Online (Sandbox Code Playgroud)

我们需要用它sorted来一路走来

df.assign(
    list_value=df.study_id.map(
        df.set_index('study_id').list_value.dropna()
          .sum(level=0).apply(np.unique).apply(sorted)
    )
)
Run Code Online (Sandbox Code Playgroud)

总路!

df.assign(
    list_value=df.study_id.map(
        df.list_value.str.join('|').groupby(df.study_id).apply(
            lambda x: sorted(set('|'.join(x.dropna()).split('|')))
        )
    )
)

   study_id       list_value
0         1        [a, b, c]
1         1        [a, b, c]
2         1        [a, b, c]
3         2  [a, b, d, e, z]
4         2  [a, b, d, e, z]
5         2  [a, b, d, e, z]
Run Code Online (Sandbox Code Playgroud)

建立

df = pd.DataFrame(dict(
    study_id=[1, 1, 1, 2, 2, 2],
    list_value=[['a', 'b'], ['a'], ['c'], ['d', 'e', 'a'], np.nan, ['z', 'a', 'b']]
), columns=['study_id', 'list_value'])
Run Code Online (Sandbox Code Playgroud)


cs9*_*s95 5

itertools.chainGroupBy.transform
第一,摆脱的NaN使用列表理解你的专栏里面(凌乱,我知道,但是这是做的最快的方法).

df['list_value'] = [
    [] if not isinstance(x, list) else x for x in df.list_value
]
Run Code Online (Sandbox Code Playgroud)

接下来,study_id将您的列表分组并展平,GroupBy.transform并使用a提取唯一值set.

from itertools import chain

df['list_value'] = df.groupby('study_id').list_value.transform(
    lambda x: [list(set(chain.from_iterable(x)))]
)
Run Code Online (Sandbox Code Playgroud)

最后一步,如果您打算改变单个列表项,您可能想要这样做

df['list_value'] = [x[:] for x in df['list_value']]
Run Code Online (Sandbox Code Playgroud)

如果不是,则一个列表中的更改将反映在该组中的所有子列表中.

df
   study_id                 list_value
0         1            [aaa, ccc, bbb]
1         1            [aaa, ccc, bbb]
2         1            [aaa, ccc, bbb]
3         2  [bbb, ddd, eee, aaa, zzz]
4         2  [bbb, ddd, eee, aaa, zzz]
5         2  [bbb, ddd, eee, aaa, zzz]
Run Code Online (Sandbox Code Playgroud)