我有以下示例 DataFramed由两列“col1”和“col2”组成。我想找到整个 DataFrame d 的唯一名称列表。
d = {'col1':['Pat, Joseph',
'Tony, Hoffman',
'Miriam, Goodwin',
'Roxanne, Padilla',
'Julie, Davis',
'Muriel, Howell',
'Salvador, Reese',
'Kristopher, Mckenzie',
'Lucille, Thornton',
'Brenda, Wilkerson'],
'col2':['Kristopher, Mckenzie',
'Lucille, Thornton',
'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis',
'Muriel, Howell', 'Harriet, Phillips',
'Belinda, Drake;David, Ford', 'Jared, Cummings;Joanna, Burns;Bob, Cunningham',
'Keith, Hernandez;Pat, Joseph', 'Kristopher, Mckenzie', 'Lucille, Thornton']}
df = pd.DataFrame(data=d)
Run Code Online (Sandbox Code Playgroud)
对于列 col1,我可以使用函数 unique() 来完成它。
df.col1.unique()
array(['Pat, Joseph', 'Tony, Hoffman', 'Miriam, Goodwin',
'Roxanne, Padilla', 'Julie, Davis', 'Muriel, Howell',
'Salvador, Reese', 'Kristopher, Mckenzie', 'Lucille, Thornton',
'Brenda, Wilkerson'], dtype=object)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)len(df.col1) 10 # total number of rows len(df.col1.unique()) 9 # total number of unique rows
对于 col2,某些行有多个名称,以分号分隔。例如'Pete, Fitzgerald; Cecelia, Bass; Julie, Davis'。
如何使用向量操作从 col2 获取唯一名称?我试图避免 for 循环,因为实际数据集很大。
首先split由;s\*(正则表达式 -;带有零个或多个空格) to DataFrame,然后由stackforSeries和 last use重塑unique:
print (df['col2'].str.split(';\s*', expand=True).stack().unique())
['Kristopher, Mckenzie' 'Lucille, Thornton' 'Pete, Fitzgerald'
'Cecelia, Bass' 'Julie, Davis' 'Muriel, Howell' 'Harriet, Phillips'
'Belinda, Drake' 'David, Ford' 'Jared, Cummings' 'Joanna, Burns'
'Bob, Cunningham' 'Keith, Hernandez' 'Pat, Joseph']
Run Code Online (Sandbox Code Playgroud)
细节:
print (df['col2'].str.split(';\s*', expand=True))
0 1 2
0 Kristopher, Mckenzie None None
1 Lucille, Thornton None None
2 Pete, Fitzgerald Cecelia, Bass Julie, Davis
3 Muriel, Howell None None
4 Harriet, Phillips None None
5 Belinda, Drake David, Ford None
6 Jared, Cummings Joanna, Burns Bob, Cunningham
7 Keith, Hernandez Pat, Joseph None
8 Kristopher, Mckenzie None None
9 Lucille, Thornton None None
print (df['col2'].str.split(';\s*', expand=True).stack())
0 0 Kristopher, Mckenzie
1 0 Lucille, Thornton
2 0 Pete, Fitzgerald
1 Cecelia, Bass
2 Julie, Davis
3 0 Muriel, Howell
4 0 Harriet, Phillips
5 0 Belinda, Drake
1 David, Ford
6 0 Jared, Cummings
1 Joanna, Burns
2 Bob, Cunningham
7 0 Keith, Hernandez
1 Pat, Joseph
8 0 Kristopher, Mckenzie
9 0 Lucille, Thornton
dtype: object
Run Code Online (Sandbox Code Playgroud)
替代解决方案:
print (np.unique(np.concatenate(df['col2'].str.split(';\s*').values)))
['Belinda, Drake' 'Bob, Cunningham' 'Cecelia, Bass' 'David, Ford'
'Harriet, Phillips' 'Jared, Cummings' 'Joanna, Burns' 'Julie, Davis'
'Keith, Hernandez' 'Kristopher, Mckenzie' 'Lucille, Thornton'
'Muriel, Howell' 'Pat, Joseph' 'Pete, Fitzgerald']
Run Code Online (Sandbox Code Playgroud)
编辑:
对于所有唯一名称,stack首先为Series表单所有列添加:
print (df.stack().str.split(';\s*', expand=True).stack().unique())
['Pat, Joseph' 'Kristopher, Mckenzie' 'Tony, Hoffman' 'Lucille, Thornton'
'Miriam, Goodwin' 'Pete, Fitzgerald' 'Cecelia, Bass' 'Julie, Davis'
'Roxanne, Padilla' 'Muriel, Howell' 'Harriet, Phillips' 'Belinda, Drake'
'David, Ford' 'Salvador, Reese' 'Jared, Cummings' 'Joanna, Burns'
'Bob, Cunningham' 'Keith, Hernandez' 'Brenda, Wilkerson']
Run Code Online (Sandbox Code Playgroud)