Tho*_*hew 7 python numpy pandas
甲熊猫数据帧列系列,same_group需要根据两个现有的列的值,以从布尔值被创建,row和col.如果一行中的两个单元格在字典中具有相似的值(相交值),则行需要显示True memberships,否则为False(没有相交的值).如何以矢量化方式(不使用apply)执行此操作?
import pandas as pd
import numpy as np
n = np.nan
memberships = {
'a':['vowel'],
'b':['consonant'],
'c':['consonant'],
'd':['consonant'],
'e':['vowel'],
'y':['consonant', 'vowel']
}
congruent = pd.DataFrame.from_dict(
{'row': ['a','b','c','d','e','y'],
'a': [ n, -.8,-.6,-.3, .8, .01],
'b': [-.8, n, .5, .7,-.9, .01],
'c': [-.6, .5, n, .3, .1, .01],
'd': [-.3, .7, .3, n, .2, .01],
'e': [ .8,-.9, .1, .2, n, .01],
'y': [ .01, .01, .01, .01, .01, n],
}).set_index('row')
congruent.columns.names = ['col']
Run Code Online (Sandbox Code Playgroud)
cs = congruent.stack().to_frame()
cs.columns = ['score']
cs.reset_index(inplace=True)
cs.head(6)
Run Code Online (Sandbox Code Playgroud)
如何基于对字典的查找来完成创建此新列?
请注意,我试图找到交集,而不是等价.例如,第4行应该具有same_group1,因为a并且y都是元音(尽管这y有时是"元音",因此属于辅音和元音组).
# create a series to make it convenient to map
# make each member a set so I can intersect later
lkp = pd.Series(memberships).apply(set)
# get number of rows and columns
# map the sets to column and row indices
n, m = congruent.shape
c = congruent.columns.to_series().map(lkp).values
r = congruent.index.to_series().map(lkp).values
Run Code Online (Sandbox Code Playgroud)
print(c)
[{'vowel'} {'consonant'} {'consonant'} {'consonant'} {'vowel'}
{'consonant', 'vowel'}]
Run Code Online (Sandbox Code Playgroud)
print(r)
[{'vowel'} {'consonant'} {'consonant'} {'consonant'} {'vowel'}
{'consonant', 'vowel'}]
Run Code Online (Sandbox Code Playgroud)
# use np.repeat, np.tile, zip to create cartesian product
# this should match index after stacking
# apply set intersection for each pair
# empty sets are False, otherwise True
same = [
bool(set.intersection(*tup))
for tup in zip(np.repeat(r, m), np.tile(c, n))
]
# use dropna=False to ensure we maintain the
# cartesian product I was expecting
# then slice with boolean list I created
# and dropna
congruent.stack(dropna=False)[same].dropna()
row col
a e 0.80
y 0.01
b c 0.50
d 0.70
y 0.01
c b 0.50
d 0.30
y 0.01
d b 0.70
c 0.30
y 0.01
e a 0.80
y 0.01
y a 0.01
b 0.01
c 0.01
d 0.01
e 0.01
dtype: float64
Run Code Online (Sandbox Code Playgroud)
产生想要的结果
congruent.stack(dropna=False).reset_index(name='Score') \
.assign(same_group=np.array(same).astype(int)).dropna()
Run Code Online (Sandbox Code Playgroud)