从Python中的字符串中删除表情符号

Question

从Python中的字符串中删除表情符号

Mon*_*lal 25 python string unicode special-characters emoji

我在Python中发现了这个用于删除表情符号的代码,但它无效.你能帮忙解决其他问题吗？

我已经观察到我的所有emjois都开始了\xf但是当我尝试搜索时str.startswith("\xf")我得到了无效的字符错误.

emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)

Run Code Online (Sandbox Code Playgroud)

这是错误:

Traceback (most recent call last):
  File "test.py", line 52, in <module>
    re.sub(emoji_pattern,'',word)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Run Code Online (Sandbox Code Playgroud)

列表中的每个项目都可以是单词 ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']

更新:我使用了其他代码:

emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
                                 |\
                                 [\U0001F300-\U0001F5FF] # symbols & pictographs\
                                 |\
                                 [\U0001F680-\U0001F6FF] # transport & map symbols\
                                 |\
                                 [\U0001F1E0-\U0001F1FF] # flags (iOS)\
                          " " ", re.VERBOSE)

emoji_pattern.sub('', word)

Run Code Online (Sandbox Code Playgroud)

但这仍然不会删除表情符号并显示它们!任何线索为什么会这样？

Answer 1

Abd*_*dam 39

I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn't allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.

#!/usr/bin/env python
import re

text = u'This is a smiley face \U0001f602'
print(text) # with emoji

def deEmojify(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

print(deEmojify(text))

Run Code Online (Sandbox Code Playgroud)

This was my previous answer, do not use this.

def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

Run Code Online (Sandbox Code Playgroud)

这会去除所有非 ASCII 字符，并且这样做 ** 非常低效**（为什么不只是 `inputString.encode('ascii', 'ignore').decode('ascii')` 并在一个单一的步？）。更大的 Unicode 标准不仅仅是表情符号，你不能只是去掉拉丁文、希腊文、韩文、缅甸文、西藏文、埃及文或 [任何其他 Unicode 支持的脚本](https://en.wikipedia.org /wiki/Script_(Unicode)#List_of_scripts_in_Unicode) 只是为了删除表情符号。 (32认同)
@MonaJalal：该字符串实际上并不是 Unicode（它是代表实际 Unicode 的 UTF-8 编码的原始字节）。即使解码，它也根本没有表情符号（这些字节解码为左右“智能引号”）。如果这解决了您的问题，那么您的问题并不是您所问的问题；而是您的问题。这会删除所有非 ASCII 字符（包括简单的字符，例如带重音的 e、`é`），而不仅仅是表情符号。 (2认同)
@IsharaMalaviarachchi：我写了一个删除表情符号的不同问题的答案：[从多语言 Unicode 文本中删除表情符号](//stackoverflow.com/a/51785357) (2认同)

Answer 2

jfs*_*jfs 38

在Python 2上,您必须使用u''literal来创建Unicode字符串.此外,您应该传递re.UNICODEflag并将输入数据转换为Unicode(例如text = data.decode('utf-8')):

#!/usr/bin/env python
import re

text = u'This dog \U0001f602'
print(text) # with emoji

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji

Run Code Online (Sandbox Code Playgroud)

产量

This dog 
This dog

Run Code Online (Sandbox Code Playgroud)

注意:emoji_pattern只匹配一些表情符号(不是全部).查看哪些字符是表情符号.

它不适用于`เบอร์10 !! ส้มสวย01แฝดของ08พร้อมส่ง!`字符串是`\ xF0\x9F\x92\x8B\xF0\x9F` (2认同)

Answer 3

小智 19

删除表情符号的完整版本
？

import re
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

Run Code Online (Sandbox Code Playgroud)

这有效，但是：`u"\U00002702-\U000027B0"`是重复的，`u"\U000024C2-\U0001F251"`已经包含范围`u"\U00002500-\U00002BEF"`和`u"\U00002702-\U000027B0 ”`。另外 `u"\U00010000-\U0010ffff"` 已经包含了前面 5 个以上数字的所有内容，并且 `u"\u2600-\u2B55"` 已经包含了 `u"\u2640-\u2642"`。所以这个答案可以更短更简洁。 (4认同)

Answer 4

scw*_*ner 16

如果您正在使用接受的答案中的示例并仍然出现"错误的字符范围"错误,那么您可能正在使用窄版本(有关详细信息,请参阅此答案).似乎有效的正则表达式的重新格式化版本是:

emoji_pattern = re.compile(
    u"(\ud83d[\ude00-\ude4f])|"  # emoticons
    u"(\ud83c[\udf00-\uffff])|"  # symbols & pictographs (1 of 2)
    u"(\ud83d[\u0000-\uddff])|"  # symbols & pictographs (2 of 2)
    u"(\ud83d[\ude80-\udeff])|"  # transport & map symbols
    u"(\ud83c[\udde0-\uddff])"  # flags (iOS)
    "+", flags=re.UNICODE)

Run Code Online (Sandbox Code Playgroud)

Answer 5

小智 11

完成vesrion删除表情符号:

def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

Run Code Online (Sandbox Code Playgroud)

Answer 6

Kev*_*cka 9

接受的答案,以及其他人为我工作了一点,但我最终决定剥离基本多语言平面之外的所有字符.这排除了将来添加到其他Unicode平面(表情符号等等),这意味着每次添加新的Unicode字符时我都不必更新代码:).

在Python 2.7中,如果你的文本还没有,则转换为unicode,然后使用下面的负正则表达式(不包括在正则表达式中的任何东西,这是来自BMP的所有字符,除了代理,用于创建2字节的补充多语言平面字符).

NON_BMP_RE = re.compile(u"[^\U00000000-\U0000d7ff\U0000e000-\U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))

Run Code Online (Sandbox Code Playgroud)

Answer 7

小智 9

我可以通过以下方式摆脱表情符号。

\n

表情符号安装\n https://pypi.org/project/emoji/

\n

$ pip3 install emoji\n

Run Code Online (Sandbox Code Playgroud)\n

$ pip3 install emoji\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 8

小智 8

使用 Demoji 包， https://pypi.org/project/demoji/

import demoji

text=""
emoji_less_text = demoji.replace(text, "")

Run Code Online (Sandbox Code Playgroud)

Answer 9

小智 7

对此的最佳解决方案是使用外部库emoji。该库会不断更新最新的表情符号，因此可用于在任何文本中查找它们。与删除所有 unicode 字符的 ascii 解码方法不同，此方法保留它们并且只删除表情符号。

如果没有，请先安装表情符号库： pip install emoji
接下来将其导入您的文件/项目： import emoji
现在要删除所有表情符号，请使用以下语句： emoji.get_emoji_regexp().sub("", msg)其中 msg 是要编辑的文本

这就是你所需要的。

Answer 10

kin*_*ing 6

如果您不喜欢使用正则表达式，最好的解决方案是使用emoji表情包。

这是一个简单的函数，可以返回免费的表情符号文字（由于此SO Answer）：

import emoji
def give_emoji_free_text(text):
    allchars = [str for str in text.decode('utf-8')]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
    return clean_text

Run Code Online (Sandbox Code Playgroud)

如果要处理包含表情符号的字符串，这很简单

>> s1 = "Hi  How is your  and . Have a nice weekend "
>> print s1
Hi  How is your  and . Have a nice weekend 
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend

Run Code Online (Sandbox Code Playgroud)

如果要处理unicode（如@jfs所示），只需使用utf-8对其进行编码。

>> s2 = u'This dog \U0001f602'
>> print s2
This dog 
>> print give_emoji_free_text(s2.encode('utf8'))
This dog

Run Code Online (Sandbox Code Playgroud)

编辑

根据评论，它应该很简单：

def give_emoji_free_text(text):
    return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))

Run Code Online (Sandbox Code Playgroud)

如果“文本”已经解码，编辑中的代码将引发错误。在这种情况下，返回语句应该是“return emoji.get_emoji_regexp().sub(r'', text)”，其中我们删除不必要的“.decode('utf8')” (8认同)
该项目做得更好：它*包括正则表达式生成器功能*。使用`emoji.get_emoji_regexp（）。sub（r''，text.decode（'utf8'））`并完成它。不要只是一个接一个地遍历所有字符，那是非常低效的。 (7认同)
`emoji` 包有专门用于表情符号替换的内部函数 - `emoji.replace_emoji(str, replacement='')` (5认同)

Answer 11

Hel*_*nda 6

我找到了两个库来替换表情符号：

表情符号： https: //pypi.org/project/emoji/

import emoji
string = "  "
emoji.replace_emoji(string, replace="!")

Run Code Online (Sandbox Code Playgroud)

演示： https: //pypi.org/project/demoji/

import demoji
string = "  "
demoji.replace(string, repl="!")

Run Code Online (Sandbox Code Playgroud)

他们都有其他有用的方法。

Answer 12

小智 5

我试图收集Unicode的完整列表。我用它从推文中提取表情符号，对我来说效果很好。

# Emojis pattern
emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
    "]+", flags=re.UNICODE)

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年前
查看次数：	46153 次
最近记录：	6 年前