Mon*_*lal 25 python string unicode special-characters emoji
我在Python中发现了这个用于删除表情符号的代码,但它无效.你能帮忙解决其他问题吗?
我已经观察到我的所有emjois都开始了\xf
但是当我尝试搜索时str.startswith("\xf")
我得到了无效的字符错误.
emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)
Run Code Online (Sandbox Code Playgroud)
这是错误:
Traceback (most recent call last):
File "test.py", line 52, in <module>
re.sub(emoji_pattern,'',word)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
Run Code Online (Sandbox Code Playgroud)
列表中的每个项目都可以是单词 ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']
更新:我使用了其他代码:
emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
|\
[\U0001F300-\U0001F5FF] # symbols & pictographs\
|\
[\U0001F680-\U0001F6FF] # transport & map symbols\
|\
[\U0001F1E0-\U0001F1FF] # flags (iOS)\
" " ", re.VERBOSE)
emoji_pattern.sub('', word)
Run Code Online (Sandbox Code Playgroud)
Abd*_*dam 39
I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn't allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.
#!/usr/bin/env python
import re
text = u'This is a smiley face \U0001f602'
print(text) # with emoji
def deEmojify(text):
regrex_pattern = re.compile(pattern = "["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'',text)
print(deEmojify(text))
Run Code Online (Sandbox Code Playgroud)
This was my previous answer, do not use this.
def deEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
Run Code Online (Sandbox Code Playgroud)
jfs*_*jfs 38
在Python 2上,您必须使用u''
literal来创建Unicode字符串.此外,您应该传递re.UNICODE
flag并将输入数据转换为Unicode(例如text = data.decode('utf-8')
):
#!/usr/bin/env python
import re
text = u'This dog \U0001f602'
print(text) # with emoji
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji
Run Code Online (Sandbox Code Playgroud)
This dog
This dog
Run Code Online (Sandbox Code Playgroud)
注意:emoji_pattern
只匹配一些表情符号(不是全部).查看哪些字符是表情符号.
小智 19
删除表情符号的完整版本
?
import re
def remove_emojis(data):
emoj = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', data)
Run Code Online (Sandbox Code Playgroud)
scw*_*ner 16
如果您正在使用接受的答案中的示例并仍然出现"错误的字符范围"错误,那么您可能正在使用窄版本(有关详细信息,请参阅此答案).似乎有效的正则表达式的重新格式化版本是:
emoji_pattern = re.compile(
u"(\ud83d[\ude00-\ude4f])|" # emoticons
u"(\ud83c[\udf00-\uffff])|" # symbols & pictographs (1 of 2)
u"(\ud83d[\u0000-\uddff])|" # symbols & pictographs (2 of 2)
u"(\ud83d[\ude80-\udeff])|" # transport & map symbols
u"(\ud83c[\udde0-\uddff])" # flags (iOS)
"+", flags=re.UNICODE)
Run Code Online (Sandbox Code Playgroud)
小智 11
完成vesrion删除表情符号:
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
Run Code Online (Sandbox Code Playgroud)
接受的答案,以及其他人为我工作了一点,但我最终决定剥离基本多语言平面之外的所有字符.这排除了将来添加到其他Unicode平面(表情符号等等),这意味着每次添加新的Unicode字符时我都不必更新代码:).
在Python 2.7中,如果你的文本还没有,则转换为unicode,然后使用下面的负正则表达式(不包括在正则表达式中的任何东西,这是来自BMP的所有字符,除了代理,用于创建2字节的补充多语言平面字符).
NON_BMP_RE = re.compile(u"[^\U00000000-\U0000d7ff\U0000e000-\U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))
Run Code Online (Sandbox Code Playgroud)
小智 9
我可以通过以下方式摆脱表情符号。
\n表情符号安装\n https://pypi.org/project/emoji/
\n$ pip3 install emoji\n
Run Code Online (Sandbox Code Playgroud)\n$ pip3 install emoji\n
Run Code Online (Sandbox Code Playgroud)\n
小智 8
使用 Demoji 包, https://pypi.org/project/demoji/
import demoji
text=""
emoji_less_text = demoji.replace(text, "")
Run Code Online (Sandbox Code Playgroud)
如果您不喜欢使用正则表达式,最好的解决方案是使用emoji表情包。
这是一个简单的函数,可以返回免费的表情符号文字(由于此SO Answer):
import emoji
def give_emoji_free_text(text):
allchars = [str for str in text.decode('utf-8')]
emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
return clean_text
Run Code Online (Sandbox Code Playgroud)
如果要处理包含表情符号的字符串,这很简单
>> s1 = "Hi How is your and . Have a nice weekend "
>> print s1
Hi How is your and . Have a nice weekend
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend
Run Code Online (Sandbox Code Playgroud)
如果要处理unicode(如@jfs所示),只需使用utf-8对其进行编码。
>> s2 = u'This dog \U0001f602'
>> print s2
This dog
>> print give_emoji_free_text(s2.encode('utf8'))
This dog
Run Code Online (Sandbox Code Playgroud)
编辑
根据评论,它应该很简单:
def give_emoji_free_text(text):
return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))
Run Code Online (Sandbox Code Playgroud)
我找到了两个库来替换表情符号:
表情符号: https: //pypi.org/project/emoji/
import emoji
string = " "
emoji.replace_emoji(string, replace="!")
Run Code Online (Sandbox Code Playgroud)
演示: https: //pypi.org/project/demoji/
import demoji
string = " "
demoji.replace(string, repl="!")
Run Code Online (Sandbox Code Playgroud)
他们都有其他有用的方法。
小智 5
我试图收集Unicode的完整列表。我用它从推文中提取表情符号,对我来说效果很好。
# Emojis pattern
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u'\U00010000-\U0010ffff'
u"\u200d"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\u3030"
u"\ufe0f"
"]+", flags=re.UNICODE)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
46153 次 |
最近记录: |