将不规则的列表字典转换为pandas数据帧

Question

将不规则的列表字典转换为pandas数据帧

(或列表清单......我刚刚编辑过)

是否存在用于转换此类结构的现有python/pandas方法

food2 = {}
food2["apple"]   = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"]  = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]

Run Code Online (Sandbox Code Playgroud)

进入像这样的数据透视表？

+---------+-------+-----+-------+------+--------+--------+-----+
|         | fruit | veg | round | long | yellow | orange | red |
+---------+-------+-----+-------+------+--------+--------+-----+
| apple   | 1     |     | 1     |      |        |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| bananna | 1     |     |       | 1    | 1      |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| carrot  |       | 1   |       | 1    |        | 1      |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| raddish |       | 1   |       |      |        |        | 1   |
+---------+-------+-----+-------+------+--------+--------+-----+

Run Code Online (Sandbox Code Playgroud)

天真的,我可能只是循环通过字典.我看到如何在每个内部列表上使用地图,但我不知道如何在字典上加入/堆叠它们.一旦我加入了它们,我就可以使用pandas.pivot_table了

for key in food2:
    attrlist = food2[key]
    onefruit_pairs = map(lambda x: [key, x], attrlist)
    one_fruit_frame = pd.DataFrame(onefruit_pairs, columns=['fruit', 'attr'])
    print(one_fruit_frame)

     fruit    attr
0  bananna   fruit
1  bananna  yellow
2  bananna    long
    fruit    attr
0  carrot     veg
1  carrot  orange
2  carrot    long
   fruit   attr
0  apple  fruit
1  apple  round
     fruit attr
0  raddish  veg
1  raddish  red

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pau*_*ine 2

纯蟒蛇：

from itertools import chain

def count(d):
    cols = set(chain(*d.values()))
    yield ['name'] + list(cols)
    for row, values in d.items():
        yield [row] + [(col in values) for col in cols]

Run Code Online (Sandbox Code Playgroud)

测试：

>>> food2 = {           
    "apple": ["fruit", "round"],
    "bananna": ["fruit", "yellow", "long"],
    "carrot": ["veg", "orange", "long"],
    "raddish": ["veg", "red"]
}

>>> list(count(food2))
[['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'],
 ['bananna', True, False, True, True, False, False, False],
 ['carrot', True, True, False, False, True, False, False],
 ['apple', False, False, True, False, False, True, False],
 ['raddish', False, True, False, False, False, False, True]]

Run Code Online (Sandbox Code Playgroud)

[更新]

性能测试：

>>> from itertools import product
>>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7)))
>>> attrs = labels[:1000]
>>> import random
>>> sample = {}
>>> for k in labels:
...     sample[k] = random.sample(attrs, 5)
>>> import time
>>> n = time.time(); list(count(sample)); print time.time() - n                                                                
62.0367980003

Run Code Online (Sandbox Code Playgroud)

在我繁忙的机器上（打开了很多 chrome 选项卡），花费了不到 2 分钟，就完成了 279936 行 x 1000 列。如果性能不可接受，请告诉我。

[更新]

测试其他答案的性能：

>>> n = time.time(); \
...     df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \
...     print time.time() - n
72.0512290001

Run Code Online (Sandbox Code Playgroud)

下一行 ( df = pd.melt(...)) 花费的时间太长，因此我取消了测试。对这个结果持保留态度，因为它是在一台繁忙的机器上运行的。

归档时间：	9 年，10 月前
查看次数：	721 次
最近记录：	9 年，10 月前