如何使用正则表达式从字符串中仅检索阿拉伯语文本？

Question

如何使用正则表达式从字符串中仅检索阿拉伯语文本？

Ahs*_*que 5 python regex string unicode python-2.7

我有一个包含阿拉伯语和英语句子的字符串.我想要的只是提取阿拉伯语句子.

my_string="""
What is the reason
?????? ?????????? ??? ?????? ????? ????? ??????????????
behind this?
?????? ?????????? ??? ?????? ????? ????? ??????????????
"""

Run Code Online (Sandbox Code Playgroud)

此链接显示阿拉伯字母的Unicode范围是0600-06FF.

所以,我想到的非常基本的尝试是:

import re
print re.findall(r'[\u0600-\u06FF]+',my_string)

Run Code Online (Sandbox Code Playgroud)

但是,这会失败,因为它返回以下列表.

['What', 'is', 'the', 'reason', 'behind', 'this?']

Run Code Online (Sandbox Code Playgroud)

如您所见,这与我想要的完全相反.我在这里缺少什么？

NB

我知道我可以通过使用如下的反向匹配来匹配阿拉伯字母:

print re.findall(r'[^a-zA-Z\s0-9]+',my_string)

Run Code Online (Sandbox Code Playgroud)

但是,我不希望这样.

Answer 1

sty*_*ane 5

您可以使用re.sub空字符串替换 ascii 字符。

>>> my_string="""
... What is the reason
... ?????? ?????????? ??? ?????? ????? ????? ??????????????
... behind this?
... ?????? ?????????? ??? ?????? ????? ????? ??????????????
... """
>>> print(re.sub(r'[a-zA-Z?]', '', my_string).strip())
?????? ?????????? ??? ?????? ????? ????? ??????????????

?????? ?????????? ??? ?????? ????? ????? ??????????????

Run Code Online (Sandbox Code Playgroud)

您的正则表达式不起作用，因为您使用的是 Python 2，而您的字符串str需要转换my_string为 unicode 才能工作。但是它在 Python3.x 上运行良好

>>> print "".join(re.findall(ur'[\u0600-\u06FF]', unicode(my_string, "utf-8"), re.UNICODE))
??????????????????????????????????????????????????????????????????????????????????????????????????

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年前
查看次数：	2375 次
最近记录：	10 年前