python中拉丁字符的特殊文本

Question

我有以下熊猫数据框：

the_df = pd.DataFrame({'id':[1,2],'name':['Joe','']})
the_df
    id  name
0   1   Joe
1   2

如您所见，我们可以将第二个名字读为“Sarah”，但它是用特殊字符编写的。

我想创建一个新列，将这些字符转换为拉丁字符。我试过这种方法：

the_df['latin_name'] = the_df['name'].str.extract(r'(^[a-zA-Z\s]*)')
the_df
    id  name    latin_name
0   1   Joe     Joe
1   2

但它不识别字母。请，对此的任何帮助将不胜感激。

Answer 1

the_df['name'].str.normalize('NFKC').str.extract(r'(^[a-zA-Z\s]*)')

输出：

       0
0    Joe
1  Sarah

打败我。我假设 [这是在幕后使用的](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize)，但它是`.str 的一部分很方便` 特殊方法 (3认同)
@juanpa.arrivillaga 是的，大约三周前我刚刚对此进行了一些研究，这让我记忆犹新。 (3认同)