如何获取数据框列字符串中每个单词的第一个字母

Mat*_*s M 2 python dataframe pandas

我有一个包含名字和姓氏的数据框列。我想从名称中提取缩写作为数据框中的另一列。对于以下数据框:

   Name
0 'Brad Pitt'
1 'Bill Gates'
2 'Elon Musk'
Run Code Online (Sandbox Code Playgroud)

我想出了一个解决方案:

df['initials'] = [df['Name'][i].split()[0][0] + df['Name'][i].split()[1][0] for i in range(len(df))]
Run Code Online (Sandbox Code Playgroud)

然而,对于像“John David Smith”这样的名字,这不起作用,因为我想要名字中每个单词的第一个字母。此外,由于我的数据框非常大,我想知道是否有“矢量化”解决方案(无for循环)。

先感谢您。

jez*_*ael 5

如果性能对于splitand join 很重要,请使用列表理解:

\n
df['initials'] = [' '.join(y[0] for y in x.split()) for x in df['Name']]\nprint (df)\n         Name initials\n0   Brad Pitt      B P\n1  Bill Gates      B G\n2   Elon Musk      E M\n
Run Code Online (Sandbox Code Playgroud)\n

或者:

\n
df['initials'] = df['Name'].apply(lambda x: ' '.join(y[0] for y in x.split()))\n
Run Code Online (Sandbox Code Playgroud)\n

解决方案 no for, but is 真的很慢:

\n
df['initials'] = df['Name'].str.split(expand=True).apply(lambda x: x.str[0]).fillna('').agg(' '.join, axis=1).str.rstrip()\n
Run Code Online (Sandbox Code Playgroud)\n

400k行的性能:

\n
print (df)\n               Name\n0         Brad Pitt\n1        Bill Gates\n2         Elon Musk\n3  John David Smith\n\ndf = pd.concat([df] * 100000, ignore_index=True)\n
Run Code Online (Sandbox Code Playgroud)\n

最快的是第二个和第一个解决方案,然后是第一个@mozway答案,最慢的是第二个@mozway解决方案:

\n
In [178]: %%timeit\n     ...: df['initials2'] = df['Name'].apply(lambda x: ' '.join(y[0] for y in x.split()))\n     ...: \n442 ms \xc2\xb1 3.38 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n\nIn [177]: %%timeit\n     ...: df['initials1'] = [' '.join(y[0] for y in x.split()) for x in df['Name']]\n     ...: \n     ...: \n485 ms \xc2\xb1 7.46 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n\nIn [180]: %%timeit\n     ...: df['initials'] = df['Name'].str.replace(r'(?<=\\w)\\w', '', regex=True)\n     ...: \n830 ms \xc2\xb1 8.19 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n\nIn [179]: %%timeit \n     ...: df['initials3'] = df['Name'].str.split(expand=True).apply(lambda x: x.str[0]).fillna('').agg(' '.join, axis=1).str.rstrip()\n     ...: \n18.8 s \xc2\xb1 772 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\nIn [181]: %%timeit\n     ...: df['initials'] = (df['Name'].str.extractall(r'(?<!\\w)(\\w)').groupby(level=0).agg(' '.join))                 \n     ...: \n     ...: \n25.3 s \xc2\xb1 692 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n