Mar*_*ler 5 python pivot-table pandas
(或列表清单......我刚刚编辑过)
是否存在用于转换此类结构的现有python/pandas方法
food2 = {}
food2["apple"] = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"] = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]
Run Code Online (Sandbox Code Playgroud)
进入像这样的数据透视表?
+---------+-------+-----+-------+------+--------+--------+-----+
| | fruit | veg | round | long | yellow | orange | red |
+---------+-------+-----+-------+------+--------+--------+-----+
| apple | 1 | | 1 | | | | |
+---------+-------+-----+-------+------+--------+--------+-----+
| bananna | 1 | | | 1 | 1 | | |
+---------+-------+-----+-------+------+--------+--------+-----+
| carrot | | 1 | | 1 | | 1 | |
+---------+-------+-----+-------+------+--------+--------+-----+
| raddish | | 1 | | | | | 1 |
+---------+-------+-----+-------+------+--------+--------+-----+
Run Code Online (Sandbox Code Playgroud)
天真的,我可能只是循环通过字典.我看到如何在每个内部列表上使用地图,但我不知道如何在字典上加入/堆叠它们.一旦我加入了它们,我就可以使用pandas.pivot_table了
for key in food2:
attrlist = food2[key]
onefruit_pairs = map(lambda x: [key, x], attrlist)
one_fruit_frame = pd.DataFrame(onefruit_pairs, columns=['fruit', 'attr'])
print(one_fruit_frame)
fruit attr
0 bananna fruit
1 bananna yellow
2 bananna long
fruit attr
0 carrot veg
1 carrot orange
2 carrot long
fruit attr
0 apple fruit
1 apple round
fruit attr
0 raddish veg
1 raddish red
Run Code Online (Sandbox Code Playgroud)
纯蟒蛇:
from itertools import chain
def count(d):
cols = set(chain(*d.values()))
yield ['name'] + list(cols)
for row, values in d.items():
yield [row] + [(col in values) for col in cols]
Run Code Online (Sandbox Code Playgroud)
测试:
>>> food2 = {
"apple": ["fruit", "round"],
"bananna": ["fruit", "yellow", "long"],
"carrot": ["veg", "orange", "long"],
"raddish": ["veg", "red"]
}
>>> list(count(food2))
[['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'],
['bananna', True, False, True, True, False, False, False],
['carrot', True, True, False, False, True, False, False],
['apple', False, False, True, False, False, True, False],
['raddish', False, True, False, False, False, False, True]]
Run Code Online (Sandbox Code Playgroud)
[更新]
性能测试:
>>> from itertools import product
>>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7)))
>>> attrs = labels[:1000]
>>> import random
>>> sample = {}
>>> for k in labels:
... sample[k] = random.sample(attrs, 5)
>>> import time
>>> n = time.time(); list(count(sample)); print time.time() - n
62.0367980003
Run Code Online (Sandbox Code Playgroud)
在我繁忙的机器上(打开了很多 chrome 选项卡),花费了不到 2 分钟,就完成了 279936 行 x 1000 列。如果性能不可接受,请告诉我。
[更新]
测试其他答案的性能:
>>> n = time.time(); \
... df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \
... print time.time() - n
72.0512290001
Run Code Online (Sandbox Code Playgroud)
下一行 ( df = pd.melt(...)) 花费的时间太长,因此我取消了测试。对这个结果持保留态度,因为它是在一台繁忙的机器上运行的。
| 归档时间: |
|
| 查看次数: |
721 次 |
| 最近记录: |