re.sub()后错误的字符集

Question

re.sub()后错误的字符集

我有这个代码

import chardet, re    

content = "????? ????????????? ? ???????."
print content
print chardet.detect(content)
content = re.sub(u"(?i)[^-0-9a-z?-??«»\&\;\/\<\>\.,\s\(\)\*:!\?]", "", content)
print content
print chardet.detect(content)

Run Code Online (Sandbox Code Playgroud)

并输出

????? ????????????? ? ???????.
{'confidence': 0.99, 'encoding': 'utf-8'}
? ?  .
{'confidence': 0.5, 'encoding': 'windows-1252'}

Run Code Online (Sandbox Code Playgroud)

我做错了什么？我如何在re.sub()之后获得uft-8 str？(Python 2.7,re.sub()UTF-8文件,IDE Pycharm).

谢谢.

Answer 1

geo*_*org 7

这就是(我认为)你想要实现的目标(为了清晰起见,我简化了正则表达式):

#coding=utf8
import re    
content = u"????? XYZ ????????????? ? ??????????."
content = re.sub(u"(?iu)[^?-??]", ".", content)
print content.encode('utf8') # ?????.....?????????????.?....???????.

Run Code Online (Sandbox Code Playgroud)

请注意重点:

主题是unicode
表达式是unicode
表达式使用unicode标志(?u)来进行大小写折叠工作.

此外,对于严重的unicode工作,我推荐使用正则表达式模块,它提供了出色且几乎完整的unicode支持.考虑:

# drop everything except Cyrillic and spaces 
import regex
content = regex.sub(u'[^\p{Cyrillic}\p{Zs}]', '', content)

Run Code Online (Sandbox Code Playgroud)

虽然记录了re.UNICODE只有改变\w和朋友,但在我的测试中它也影响了case折叠(re.IGNORECASE):

Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> src = u'?? ?? ??'
>>> src
u'\u03a3\u03c3 \u03a6\u03c6 \u0393\u03b3'
>>> re.sub(ur'(?i)[?-?]', '-', src)
u'\u03a3- \u03a6- \u0393-'
>>> re.sub(ur'(?iu)[?-?]', '-', src)
u'-- -- --'

Run Code Online (Sandbox Code Playgroud)

所以这是一个没有文档的功能或文档问题.

归档时间：	12 年，10 月前
查看次数：	2543 次
最近记录：	6 年，5 月前