python检查utf-8字符串是否为大写

mat*_*hew 7 python unicode utf-8

当我有一个utf-8编码的字符串时,我遇到了.isupper()的问题.我有很多文本文件,我正在转换为xml.虽然文本变化很大,但格式是静态的.所有大写字母应包含在<title>标签和其他所有内容中<p>.它比这复杂得多,但这对我的问题应该足够了.

我的问题是这是一个utf-8文件.这是必须的,因为最终输出中会有很多非英文字符.这可能是时候提供一个简短的例子:

inputText.txt

恢复

培根ipsum dolor坐在amet条牛排丁骨鸡,irure地面圆nostrud aute pancetta火腿飞刀incididunt aliqua.Dolore短腰前鸡,夹头鼓槌ut汉堡ut andouille.在labum eiusmod短腰,排骨enim球尖香肠.Tenderloin ut consequat侧翼.Tempor officia sirloin duis.在pancetta do,ut dolore t-bone sint pork pariatur dolore chicken exercitation.Nostrud ribeye tail,ut ullamco venison mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.

DesiredOutput

    <title>RÉSUMÉ</title>
    <p>Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud
       aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick
       ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage.
       Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone
       sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison
       mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.
   </p>
Run Code Online (Sandbox Code Playgroud)

示例代码

    #!/usr/local/bin/python2.7
    # yes this is an alt-install of python

    import codecs
    import sys
    import re
    from xml.dom.minidom import Document

    def main():
        fn = sys.argv[1]
        input = codecs.open(fn, 'r', 'utf-8')
        output = codecs.open('desiredOut.xml', 'w', 'utf-8')
        doc = Documents()
        doc = parseInput(input,doc)
        print>>output, doc.toprettyxml(indent='  ',encoding='UTF-8')

    def parseInput(input, doc):
        tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines

        for i in range(len(tokens)):
            # THIS IS MY PROBLEM. .isupper() is never true.
            if str(tokens[i]).isupper(): 
                 title = doc.createElement('title')
                 tText = str(tokens[i]).strip('[\']')
                 titleText = doc.createTextNode(tText.title())
                 doc.appendChild(title)
                 title.appendChild(titleText)
            else: 
                p = doc.createElement('p')
                pText = str(tokens[i]).strip('[\']')
                paraText = doc.createTextNode(pText)
                doc.appendChild(p)
                p.appenedChild(paraText)

       return doc

if __name__ == '__main__':
    main()
Run Code Online (Sandbox Code Playgroud)

最终它非常直接,我会接受对我的代码的批评或建议.谁不愿意?特别是我不高兴str(tokens[i])也许有更好的方法来循环一个字符串列表?

但是这个问题目的是找出检查utf-8字符串是否大写的最有效方法.也许我应该考虑为此制作一个正则表达式.

请注意,我没有运行此代码,它可能无法正常运行.我从工作代码中挑选了部件,可能输错了一些东西.提醒我,我会纠正它.最后,请注意我没有使用lxml

Joh*_*hin 9

您发布的代码失败的主要原因(即使只有ascii字符!)是re.split()不会在零宽度匹配上拆分.r'\b'匹配零个字符:

>>> re.split(r'\b', 'foo-BAR_baz')
['foo-BAR_baz']
>>> re.split(r'\W+', 'foo-BAR_baz')
['foo', 'BAR_baz']
>>> re.split(r'[\W_]+', 'foo-BAR_baz')
['foo', 'BAR', 'baz']
Run Code Online (Sandbox Code Playgroud)

另外,你需要flags=re.UNICODE确保的Unicode的定义\b\W使用等.使用str()你所做的最多是不必要的.

所以它本身并不是一个真正的Unicode问题.然而,一些回答者试图将其解决为Unicode问题,并取得了不同程度的成功......这是我对Unicode问题的看法:

这类问题的一般解决方案是遵循适用于所有文本问题的标准bog-simple建议:尽可能早地将输入从字节串解码为unicode字符串.以unicode进行所有处理.尽可能晚地将输出unicode编码为字节字符串.

所以:byte_string.decode('utf8').isupper()是要走的路.像黑客一样byte_string.decode('ascii', 'ignore').isupper()要避免; 它们可以是(复杂的,不需要的,容易出错的) - 见下文.

一些代码:

# coding: ascii
import unicodedata

tests = (
    (u'\u041c\u041e\u0421\u041a\u0412\u0410', True), # capital of Russia, all uppercase
    (u'R\xc9SUM\xc9', True), # RESUME with accents
    (u'R\xe9sum\xe9', False), # Resume with accents
    (u'R\xe9SUM\xe9', False), # ReSUMe with accents
    )

for ucode, expected in tests:
    print
    print 'unicode', repr(ucode)
    for uc in ucode:
        print 'U+%04X %s' % (ord(uc), unicodedata.name(uc))
    u8 = ucode.encode('utf8')
    print 'utf8', repr(u8)
    actual1 = u8.decode('utf8').isupper() # the natural way of doing it
    actual2 = u8.decode('ascii', 'ignore').isupper() # @jathanism
    print expected, actual1, actual2
Run Code Online (Sandbox Code Playgroud)

Python 2.7.1的输出:

unicode u'\u041c\u041e\u0421\u041a\u0412\u0410'
U+041C CYRILLIC CAPITAL LETTER EM
U+041E CYRILLIC CAPITAL LETTER O
U+0421 CYRILLIC CAPITAL LETTER ES
U+041A CYRILLIC CAPITAL LETTER KA
U+0412 CYRILLIC CAPITAL LETTER VE
U+0410 CYRILLIC CAPITAL LETTER A
utf8 '\xd0\x9c\xd0\x9e\xd0\xa1\xd0\x9a\xd0\x92\xd0\x90'
True True False

unicode u'R\xc9SUM\xc9'
U+0052 LATIN CAPITAL LETTER R
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
utf8 'R\xc3\x89SUM\xc3\x89'
True True True

unicode u'R\xe9sum\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0073 LATIN SMALL LETTER S
U+0075 LATIN SMALL LETTER U
U+006D LATIN SMALL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9sum\xc3\xa9'
False False False

unicode u'R\xe9SUM\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9SUM\xc3\xa9'
False False True
Run Code Online (Sandbox Code Playgroud)

与Python 3.x的唯一区别是语法 - 原则(在unicode中进行所有处理)保持不变.