FaC*_*fee 5 python string text split pandas
说我有以下数据帧df:
A B C
0 mom;dad;son; sister;son; yes;no;maybe;
1 dad; daughter;niece; no;snow;
2 son;dad; cat;son;dad; tree;dad;son;
3 daughter;mom; niece; referee;
4 dad;daughter; cat; dad;
Run Code Online (Sandbox Code Playgroud)
而你要检查,列之间A,B以及C,有一个共同的词,并创建一个列D有1,如果有且0如果没有任何.对于一个常见的单词,它足以让它出现在三列中的两列中.
结果应该是:
A B C D
0 mom;dad;son; sister;son; yes;no;maybe; 1
1 dad; daughter;niece; no;snow; 0
2 son;dad; cat;son;dad; tree;dad;son; 1
3 daughter;mom; niece; referee; 0
4 dad;daughter; cat; dad; 1
Run Code Online (Sandbox Code Playgroud)
我试图通过这样做来实现这个:
for index, row in df.iterrows():
w1=row['A'].split(';')
w2=row['B'].split(';')
w3=row['C'].split(';')
if len(set(w1).intersection(w2))>0 or len(set(w1).intersection(w3))>0 or len(set(w2).intersection(w3))>0:
df['D'][index]==1
else:
df['D'][index]==0
Run Code Online (Sandbox Code Playgroud)
但是,结果D列只有0因为(可能)我没有将w1中的每个单词与w2和w3中的其他单词进行比较.我怎么能实现这个目标?
使用stack+pandas.Series.str.get_dummies
df.assign(
D=df.stack().str.get_dummies(';').sum(level=0).gt(1).any(1).astype(int)
)
A B C D
0 mom;dad;son; sister;son; yes;no;maybe; 1
1 dad; daughter;niece; no;snow; 0
2 son;dad; cat;son;dad; tree;dad;son; 1
3 daughter;mom; niece; referee; 0
4 dad;daughter; cat; dad; 1
Run Code Online (Sandbox Code Playgroud)
请注意,当我们堆叠并获取假人时,中间结果如下所示:
cat dad daughter maybe mom niece no referee sister snow son tree yes
0 A 0 1 0 0 1 0 0 0 0 0 1 0 0
B 0 0 0 0 0 0 0 0 1 0 1 0 0
C 0 0 0 1 0 0 1 0 0 0 0 0 1
1 A 0 1 0 0 0 0 0 0 0 0 0 0 0
B 0 0 1 0 0 1 0 0 0 0 0 0 0
C 0 0 0 0 0 0 1 0 0 1 0 0 0
2 A 0 1 0 0 0 0 0 0 0 0 1 0 0
B 1 1 0 0 0 0 0 0 0 0 1 0 0
C 0 1 0 0 0 0 0 0 0 0 1 1 0
3 A 0 0 1 0 1 0 0 0 0 0 0 0 0
B 0 0 0 0 0 1 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 1 0 0 0 0 0
4 A 0 1 1 0 0 0 0 0 0 0 0 0 0
B 1 0 0 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 0 0 0
Run Code Online (Sandbox Code Playgroud)
前面的列嵌入在索引的第二级中.因此,我想总结第一级,以便查看该单词出现的次数.
那个总结中期看起来像:
cat dad daughter maybe mom niece no referee sister snow son tree yes
0 0 1 0 1 1 0 1 0 1 0 2 0 1
1 0 1 1 0 0 1 1 0 0 1 0 0 0
2 1 3 0 0 0 0 0 0 0 0 3 1 0
3 0 0 1 0 1 1 0 1 0 0 0 0 0
4 1 2 1 0 0 0 0 0 0 0 0 0 0
Run Code Online (Sandbox Code Playgroud)
请注意,我们赶'son'在第1行,'dad'并'son'在第3行等等.
如果它出现在1列以上(因此gt(1)),那么我想把它算作1(因此any(1).astype(int)).