Python:在字符串中测试utf-8字符

use*_*660 2 python unicode right-to-left

我需要测试一个已经用str.encode('utf-8')编码的字符串是从右到左.我试过了

if u'\u200f' in str.decode('utf-8'):
  print 'found it'
Run Code Online (Sandbox Code Playgroud)

它既不抱怨也不行.

问:测试字符串中单个非ASCII字符出现的正确语法是什么?Python 2.6和我不能使用3.

问:我记得即使没有明确的RML,主要是从右到左的字符默认为RTL.有没有人知道测试这样一个字符串的方法而不知道期望哪种语言(即字符串可以是阿拉伯语,希伯来语或任何其他RTL语言)?

谢谢你的帮助.

unu*_*tbu 6

每个unicode角色都有一个"双向"类.您可以使用unicodedata.bidirectional找到双向类.该函数返回一个字符串,例如'L','R','AL'等,其含义如下:

| L   | Left_To_Right           | any strong left-to-right character                                |
| LRE | Left_To_Right_Embedding | U+202A: the LR embedding control                                  |
| LRO | Left_To_Right_Override  | U+202D: the LR override control                                   |
| R   | Right_To_Left           | any strong right-to-left (non-Arabic-type) character              |
| AL  | Arabic_Letter           | any strong right-to-left (Arabic-type) character                  |
| RLE | Right_To_Left_Embedding | U+202B: the RL embedding control                                  |
| RLO | Right_To_Left_Override  | U+202E: the RL override control                                   |
| PDF | Pop_Directional_Format  | U+202C: terminates an embedding or override control               |
| EN  | European_Number         | any ASCII digit or Eastern Arabic-Indic digit                     |
| ES  | European_Separator      | plus and minus signs                                              |
| ET  | European_Terminator     | a terminator in a numeric format context, includes currency signs |
| AN  | Arabic_Number           | any Arabic-Indic digit                                            |
| CS  | Common_Separator        | commas, colons, and slashes                                       |
| NSM | Nonspacing_Mark         | any nonspacing mark                                               |
| BN  | Boundary_Neutral        | most format characters, control codes, or noncharacters           |
| B   | Paragraph_Separator     | various newline characters                                        |
| S   | Segment_Separator       | various segment-related control codes                             |
| WS  | White_Space             | spaces                                                            |
| ON  | Other_Neutral           | most other symbols and punctuation marks                          |
Run Code Online (Sandbox Code Playgroud)

例如:

In [3]: import unicodedata as UD
In [5]: UD.bidirectional(u'\u0688')
Out[5]: 'AL'

In [6]: UD.bidirectional(u'\u200f')
Out[6]: 'R'

In [7]: UD.bidirectional(u'H')
Out[7]: 'L'
Run Code Online (Sandbox Code Playgroud)

因此,您可以通过确定字符串是否主要由双向类为或的字符组成来猜测字符串是否从右向左.RAL

例如,

# coding: utf-8
import unicodedata as UD

texts = ['?????'.decode('utf-8'),
         u'Hello']
for text in texts:
    x = len([None for ch in text if UD.bidirectional(ch) in ('R', 'AL')])/float(len(text))
    print('{t} => {c}'.format(t=text.encode('utf-8'), c='RTL' if x>0.5 else 'LTR'))
Run Code Online (Sandbox Code Playgroud)

产量

????? => RTL
Hello => LTR
Run Code Online (Sandbox Code Playgroud)

关于第一个问题:

问:测试字符串中单个非ASCII字符出现的正确语法是什么?Python 2.6和我不能使用3.

测试角色unicode是否正确的方法是正确的.如果u'\u200f' in str.decode('utf-8')没有抱怨,也没有工作,那么u'\u200f'是不是在unicode.