从数据框中的列中提取字典值

Question

从数据框中的列中提取字典值

我正在寻找一种优化代码的方法.

我有这种形式的输入数据:

import pandas as pn

a=[{'Feature1': 'aa1','Feature2': 'bb1','Feature3': 'cc2' },
 {'Feature1': 'aa2','Feature2': 'bb2' },
 {'Feature1': 'aa1','Feature2': 'cc1' }
 ]
b=['num1','num2','num3']


df= pn.DataFrame({'num':b, 'dic':a })

Run Code Online (Sandbox Code Playgroud)

我想从上面数据框中的'dic'列(如果存在)中的字典中提取元素'Feature3'.到目前为止,我能够解决它,但我不知道这是否是最快的方式,它似乎有点过于复杂.

Feature3=[]
for idx, row in df['dic'].iteritems():
    l=row.keys()

    if 'Feature3' in l:
        Feature3.append(row['Feature3'])
    else:
        Feature3.append(None)

df['Feature3']=Feature3
print df

Run Code Online (Sandbox Code Playgroud)

是否有更好/更快/更简单的方法将此Feature3提取到数据框中的单独列？

提前感谢您的帮助.

Answer 1

Ale*_*der 14

您可以使用列表推导从数据框中的每一行中提取特征3,并返回一个列表.

feature3 = [d.get('Feature3') for d in df.dic]

Run Code Online (Sandbox Code Playgroud)

如果'Feature3'不在dic,则默认返回None.

您甚至不需要pandas,因为您可以再次使用列表解析从原始字典中提取特征a.

feature3 = [d.get('Feature3') for d in a]

Run Code Online (Sandbox Code Playgroud)

Answer 2

moz*_*way 8

现在有一个 vectorial* 方法，您可以使用str访问器：

df['dic'].str['Feature3']

Run Code Online (Sandbox Code Playgroud)

输出：

0     cc2
1    None
2    None
Name: dic, dtype: object

Run Code Online (Sandbox Code Playgroud)

*矢量是指该方法直接作用于 Series 并处理缺失值，这在内部仍然是一个循环。

Answer 3

小智 6

df['Feature3'] = df['dic'].apply(lambda x: x.get('Feature3'))

Run Code Online (Sandbox Code Playgroud)

同意maxymoo。考虑更改数据框的格式。

（旁注：大熊猫通常以pd的形式导入）

对我不起作用，用 ['key_name'] 来获取值 (3认同)

Answer 4

jez*_*ael 5

我认为您可以先创建 new DataFramebycomprehension然后创建新列，例如：

df1 = pd.DataFrame([x for x in df['dic']])
print df1
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

df['Feature3'] = df1['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Run Code Online (Sandbox Code Playgroud)

或一行：

df['Feature3'] = pd.DataFrame([x for x in df['dic']])['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Run Code Online (Sandbox Code Playgroud)

时间：

len(df) = 3：

In [24]: %timeit pd.DataFrame([x for x in df['dic']])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 596 µs per loop

In [25]: %timeit df.dic.apply(pn.Series)
1000 loops, best of 3: 1.43 ms per loop

Run Code Online (Sandbox Code Playgroud)

len(df) = 3000：

In [27]: %timeit pd.DataFrame([x for x in df['dic']])
100 loops, best of 3: 3.16 ms per loop

In [28]: %timeit df.dic.apply(pn.Series)
1 loops, best of 3: 748 ms per loop

Run Code Online (Sandbox Code Playgroud)

Answer 5

Ami*_*ory 5

如果您选择applya Series，则会获得一个不错的效果DataFrame：

>>> df.dic.apply(pn.Series)
    Feature1    Feature2    Feature3
0   aa1 bb1 cc2
1   aa2 bb2 NaN
2   aa1 cc1 NaN

Run Code Online (Sandbox Code Playgroud)

至此，您可以只使用常规的熊猫操作。

归档时间：	9 年，8 月前
查看次数：	22030 次
最近记录：	7 年前