删除 Pandas 数据框中包含非英语单词的行

Question

删除 Pandas 数据框中包含非英语单词的行

Oma*_*mar 6 python dataframe python-3.x pandas

我有一个由 4 行组成的 pandas 数据框，英文行包含新闻标题，有些行包含非英文单词，如下所示

\n

**She\xc3\x83\xc2\xa2\xc3\xa2\xe2\x80\x9a\xc2\xac\xc3\xa2\xe2\x80\x9e\xc2\xa2s the Hollywood Power Behind Those ...**\n

Run Code Online (Sandbox Code Playgroud)\n

我想删除像这样的所有行，即 Pandas 数据框中至少包含非英语字符的所有行。

\n

Answer 1

Cai*_*lva 9

如果使用 Python >= 3.7：

\n

df[df[\'col\'].map(lambda x: x.isascii())]\n

Run Code Online (Sandbox Code Playgroud)\n

col你的目标列在哪里。

\n

数据：

\n

df[df[\'col\'].map(lambda x: x.isascii())]\n

Run Code Online (Sandbox Code Playgroud)\n

|    | colA                                                  |\n|---:|:------------------------------------------------------|\n|  0 | **She\xc3\x83\xc2\xa2\xc3\xa2\xe2\x80\x9a\xc2\xac\xc3\xa2\xe2\x80\x9e\xc2\xa2s the Hollywood Power Behind Those ...** |\n|  1 | Hello, world!                                         |\n|  2 | Cain\xc3\xa3                                                 |\n|  3 | another value                                         |\n|  4 | test123*                                              |\n|  5 | \xc3\xa2bc                                                   |\n

Run Code Online (Sandbox Code Playgroud)\n

识别和过滤包含非英语字符的字符串（请参阅ASCII 可打印字符）：

\n

df[df.colA.map(lambda x: x.isascii())]\n

Run Code Online (Sandbox Code Playgroud)\n

输出：

\n

            colA\n1  Hello, world!\n3  another value\n4       test123*\n

Run Code Online (Sandbox Code Playgroud)\n

\n

最初的方法是使用用户定义的函数，如下所示：

\n

df = pd.DataFrame({\n    \'colA\': [\'**She\xc3\x83\xc2\xa2\xc3\xa2\xe2\x80\x9a\xc2\xac\xc3\xa2\xe2\x80\x9e\xc2\xa2s the Hollywood Power Behind Those ...**\', \n             \'Hello, world!\', \'Cain\xc3\xa3\', \'another value\', \'test123*\', \'\xc3\xa2bc\']\n})\n\nprint(df.to_markdown())\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	4 年，11 月前
查看次数：	7998 次
最近记录：	4 年，11 月前