pandas - 按部分字符串分组

Question

pandas - 按部分字符串分组

我想按部分子字符串对 DataFrame 进行分组。这是一个示例 .csv 文件：

GridCode,Key
1000,Colour
1000,Colours
1001,Behaviours
1001,Behaviour
1002,Favourite
1003,COLORS
1004,Honours

Run Code Online (Sandbox Code Playgroud)

到目前为止，我所做的是将文件导入为df = pd.read_csv(sample.csv)，然后将所有字符串都转换为小写df['Key'] = df['Key'].str.lower()。我尝试的第一件事是通过 GridCode 和 Key 进行分组：

g = df.groupby([df['GridCode'],df['Key']]).size()

Run Code Online (Sandbox Code Playgroud)

然后拆开并填充：

d = g.unstack().fillna(0)

Run Code Online (Sandbox Code Playgroud)

生成的数据帧是：

Key       behaviour  behaviours  colors  colour  colours  favourite  honours
GridCode                                                                    
1000              0           0       0       1        1          0        0
1001              1           1       0       0        0          0        0
1002              0           0       0       0        0          1        0
1003              0           0       1       0        0          0        0
1004              0           0       0       0        0          0        1

Run Code Online (Sandbox Code Playgroud)

现在我想做的是仅对包含子字符串“our”的字符串进行分组，在这种情况下仅避免颜色键，使用所需的子字符串创建一个新列。预期结果如下：

Key       'our'
GridCode                                                                    
1000        2              
1001        2
1002        1
1003        0
1004        1

Run Code Online (Sandbox Code Playgroud)

masked = df['Key'].str.contains('our')我还尝试用, then屏蔽 DataFrame df1 = df[mask]，但我不知道如何使用新的 groupby 计数创建一个新列。任何帮助将非常感激。

Answer 1

beh*_*uri 6

>>> import re  # for the re.IGNORECASE flag
>>> df['Key'].str.contains('our', re.IGNORECASE).groupby(df['GridCode']).sum()
GridCode
1000        2
1001        2
1002        1
1003        0
1004        1
Name: Key, dtype: float64

Run Code Online (Sandbox Code Playgroud)

另外，代替

df.groupby([df['GridCode'],df['Key']])

Run Code Online (Sandbox Code Playgroud)

最好这样做：

df.groupby(['GridCode', 'Key'])

Run Code Online (Sandbox Code Playgroud)

`.str.contains` 有 `case` 参数，它似乎完全符合 `re.IGNORECASE` 应该做的事情。所以这也应该有效：`df['Key'].str.contains('our', case=False)` (4认同)

归档时间：	11 年，3 月前
查看次数：	20425 次
最近记录：	11 年，3 月前