Java Regex匹配越南字符

aut*_*umn 3 java regex

我必须写一个正则表达式来限制输入字段,只允许越南字符,英语字符和数字.我知道如何限制英语字符([a-zA-Z])和数字([0-9]),但不知道如何限制越南字符.

谁能给我一个java正则表达式匹配越南字符?

越南的字符就像:ể,ứ(编辑:但我不知道所有这些.否则,我可以使用[a-list-of-chars],或者可能有一个范围,[a-d]而不是[abcd])

nha*_*tdh 17

越南字母表

路口越南字母和英文字母(即不管它是2个字母之间的共同)是英语减去的字母f,j,wz.

在越南,a,e,i,o,u,y被认为是元音.

除此之外,越南人还使用其他几个带有变音符号的字符.下面列出了字符的大写字母(小写版本具有1个字符到1个字符的映射,与德语中的ß不同):

Vietnamese has 6 tones, except for the first tone, the other 5 tones are indicated by another diacritic on the vowels. The tonal diacritics are acute á, grave à, hook ?, tilde ã and dot below ?. Since there are (6 + 6) vowels times 5 tones with diacritics, plus 6 vowels already with diacritic on the first tone, there are 66 glyphs of vowels with diacritic(s):

Here is the list of all (67) consonants and vowels with diacritic(s):

  Á À Ã ? ?
? ? ? ? ? ?
 ? ? ? ? ?
?
  É È ? ? ?
Ê ? ? ? ? ? 
  Í Ì ? ? ?
Ô ? ? ? ? ?
? ? ? ? ? ?
  Ó Ò Õ ? ?
? ? ? ? ? ?
  Ú Ù ? ? ?
  Ý ? ? ? ?
Run Code Online (Sandbox Code Playgroud)

These characters spread across different Latin blocks in Unicode. I handpicked these characters from Character Map, and I had to be careful not to pick characters which are visually identical to the character above. To be sure, we can print the names of the characters and check that they are Latin character rather than Greek or Cyrillic.

String VIETNAMESE_DIACRITIC_CHARACTERS = "???????????ÂÁÀÃ????????ÊÉÈ???ÍÌ????????Ô??????ÓÒÕ????????ÚÙ???Ý????";

for (char c: VIETNAMESE_DIACRITIC_CHARACTERS.toCharArray()) {
    System.out.println(c + ": " + Character.getName(c));
}
Run Code Online (Sandbox Code Playgroud)

结合性格

越南输入法如Unikey有两种模式:单码点模式("Unicode d?ng s?n")和组合标记模式("Unicode t?h?p").

例如,对于相同的字符?(U + 1EE3),可以有几种方法来指定它:

  • 作为单个代码点(1个代码点): ?
  • 作为?(U + 01A1)和下面的组合点(U + 0323)(2个代码点)的组合:??
  • 作为o组合钩子(U + 031B)和组合点(U + 0323)(3个代码点)的组合:o??

您可以将这些字符复制到浏览器的控制台中并检查其长度:

["?","??","o??"].forEach(function (e) {console.log(e.length);})
Run Code Online (Sandbox Code Playgroud)

If you want to match all those 3 variations above, you must list all possible combinations and permutations to specify the character, and you would have to do this for all the characters with diacritics as listed above, and in both uppercase and lowercase.

Easy enough?

Even if you answer yes, your code will become an unmaintainable mess that no one can understand.

Canonical Equivalence

Since there are more than one ways to specify the same text ?, without any transformation, it is not possible to compare ? and o?? to be equal.

"?".equals("o??") --> false
Run Code Online (Sandbox Code Playgroud)

Unicode Standard therefore define all 3 ways to specify ? above as canonically equivalent, and also define methods to normalize a string for comparison purpose.

Java模式支持Canonical Equivalence

Pattern类的参考实现(由Oracle,在Windows和其他平台上广泛使用)具有(部分)支持使用Pattern.CANON_EQ模式的规范等价匹配.从这个这个错误报告中可以看出,它是无法使用的.在撰写本文时,由于CANON_EQ"支持",所有版本都存在此错误,并且不太可能很快修复.但是,它并没有完全被打破,我们仍然可以利用该选项目前提供的任何东西.

以下是Pattern匹配越南语+英语字母表的结构,:

String VIETNAMESE_DIACRITIC_CHARACTERS 
        = "???????????ÂÁÀÃ????????ÊÉÈ???ÍÌ????????Ô??????ÓÒÕ????????ÚÙ???Ý????";

Pattern p =
    Pattern.compile("(?:[" + VIETNAMESE_DIACRITIC_CHARACTERS + "]|[A-Z])++",
                    Pattern.CANON_EQ |
                    Pattern.CASE_INSENSITIVE |
                    Pattern.UNICODE_CASE);
Run Code Online (Sandbox Code Playgroud)

The additional flags Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE are used to make the pattern matches case-insensitively for all Unicode characters. Pattern.CASE_INSENSITIVE alone only makes the pattern matches case-insensitively for characters in US-ASCII charset.

Note that the order of the characters in VIETNAMESE_DIACRITIC_CHARACTERS is significant. I don't recommend changing the order of the characters unless you understand the implication.

The input should be normalized with Canonical Decomposition (NKD) or Canonical Composition (NKC) before matching is performed on it. It ensures that the combining marks are in a canonical order.

Regardless of whether the input is preprocessed with Canonical Composition or Canonical Decomposition, the result looks the same. Running the code in the appendix should return visually identical result for the second and the third output:

Ba?n chi?nh la? ta?c gia? cu?a Wikipedia Mo?i ngu?o??i ?e??u co? the?? bie?n ta??p ba?i ngay la??p tu??c chi? ca??n nho?? va?i quy ta??c Co? sa??n ra??t nhie??u trang tro?? giu?p nhu? ta?o ba?i su??a ba?i hay ta?i a?nh Ba?n cu?ng ?u??ng nga?i ?a??t ca?u ho?i Hie??n chu?ng ta co? ba?i vie??t va? tha?nh vie?n

B?n chính là tác gi? c?a Wikipedia M?i ng??i ??u có th? biên t?p bài ngay l?p t?c ch? c?n nh? vài quy t?c Có s?n r?t nhi?u trang tr? giúp nh? t?o bài s?a bài hay t?i ?nh B?n c?ng ??ng ng?i ??t câu h?i Hi?n chúng ta có bài vi?t và thành viên

Failed attempts

Here are some failed attempts, which will be used to explain why the regex is constructed as shown above.

Attempt 1

String VIETNAMESE_DIACRITIC_CHARACTERS 
        = "???????????ÂÁÀÃ????????ÊÉÈ???ÍÌ????????Ô??????ÓÒÕ????????ÚÙ???Ý????";

Pattern p =
    Pattern.compile("[A-Z" + VIETNAMESE_DIACRITIC_CHARACTERS + "]++",
                    Pattern.CANON_EQ |
                    Pattern.CASE_INSENSITIVE |
                    Pattern.UNICODE_CASE);
Run Code Online (Sandbox Code Playgroud)

Why don't we include A-Z into a single character class instead of putting it in a separate character class and alternate with the diacritic character class?

Nope, the result is broken when we try to match on the Canonical Decomposition of the input string. The diacritics are not matched at all.

Ba n chi nh la ta c gia cu a Wikipedia Mo i ngu o i ?e u co the bie n ta p ba i ngay la p tu c chi ca n nho va i quy ta c Co sa n ra t nhie u trang tro giu p nhu ta o ba i su a ba i hay ta i a nh Ba n cu ng ?u ng nga i ?a t ca u ho i Hie n chu ng ta co ba i vie t va tha nh vie n

Attempt 2

String VIETNAMESE_DIACRITIC_CHARACTERS 
        = "ÁÀÃ????????Â??????ÉÈ???Ê?????ÍÌ???ÓÒÕ??Ô???????????ÚÙ?????????Ý????";

Pattern p =
    Pattern.compile("(?:[" + VIETNAMESE_DIACRITIC_CHARACTERS + "]|[A-Z])++",
                    Pattern.CANON_EQ |
                    Pattern.CASE_INSENSITIVE |
                    Pattern.UNICODE_CASE);
Run Code Online (Sandbox Code Playgroud)

The diacritic characters are declared in a character class, so the code should behave the same when I change the order of the character... Right?

Nope, some results are broken when we try to match on the Canonical Decomposition of the input string.

Ba?n chi?nh la? ta?c gia? cu?a Wikipedia Mo?i ngu?o? i ?e? u co? the? bie?n ta? p ba?i ngay la? p tu? c chi? ca? n nho? va?i quy ta? c Co? sa? n ra? t nhie? u trang tro? giu?p nhu? ta?o ba?i su? a ba?i hay ta?i a?nh Ba?n cu?ng ?u? ng nga?i ?a? t ca?u ho?i Hie? n chu?ng ta co? ba?i vie? t va? tha?nh vie?n

Explanation

The reference implementation (Oracle) implements Pattern.CANON_EQ mode by picking out characters in the expression which can be expanded into multiple characters under Canonical Decomposition and perform a textual transformation of the regex. Then, the expression will be compiled as per normal.

The first pass to transform the regex doesn't parse the expression properly, so it exhibits crazy behavior for very simple matching as seen in the bug reports above.

Fortunately, Pattern class spits out the regex after the transformation if there is an unmatched ( in the regex. Therefore, we can just add ( at the end to trigger PatternSyntaxException and look at the transformed regex string.

Let's mess up the solution regex above and see what is the regex string that enters the compilation step:

java.util.regex.PatternSyntaxException: Unclosed group near index 596
(?:(?:[?]|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A?|?|A??|Â?|?|A??|Â?|?|A??|Â?|?|A??|Â?|?|A??|??|?|A??|Â?|?|A?|Â|A?|Á|A?|À|A?|Ã|A?|?|A?|?|E??|Ê?|?|E??|Ê?|?|E??|Ê?|?|E??|Ê?|?|E??|??|?|E??|Ê?|?|E?|Ê|E?|É|E?|È|E?|?|E?|?|E?|?|I?|Í|I?|Ì|I?|?|I?|?|I?|?|O??|Ô?|?|O??|Ô?|?|O??|Ô?|?|O??|Ô?|?|O??|??|?|O??|Ô?|?|O?|Ô|O??|??|?|O??|Ó?|?|O??|??|?|O??|Ò?|?|O??|??|?|O??|??|?|O??|??|?|O??|Õ?|?|O??|??|?|O??|??|?|O?|?|O?|Ó|O?|Ò|O?|Õ|O?|?|O?|?|U??|??|?|U??|Ú?|?|U??|??|?|U??|Ù?|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|U?|?|U?|Ú|U?|Ù|U?|?|U?|?|U?|?|Y?|Ý|Y?|?|Y?|?|Y?|?|Y?|?)|[A-Z])++(

Run Code Online (Sandbox Code Playgroud)

As we can see, the engine grab all the characters which can expand under Canonical Decomposition, take it outside the character class and build an alternation.

It is still not very clear what is happening with the same characters repeating in alternation, so I will insert space between every character:

( ? : ( ? : [ ? ] | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? | ? | A ? ? | Â ? | ? | A ? ? | Â ? | ? | A ? ? | Â ? | ? | A ? ? | Â ? | ? | A ? ? | ? ? | ? | A ? ? | Â ? | ? | A ? | Â | A ? | Á | A ? | À | A ? | Ã | A ? | ? | A ? | ? | E ? ? | Ê ? | ? | E ? ? | Ê ? | ? | E ? ? | Ê ? | ? | E ? ? | Ê ? | ? | E ? ? | ? ? | ? | E ? ? | Ê ? | ? | E ? | Ê | E ? | É | E ? | È | E ? | ? | E ? | ? | E ? | ? | I ? | Í | I ? | Ì | I ? | ? | I ? | ? | I ? | ? | O ? ? | Ô ? | ? | O ? ? | Ô ? | ? | O ? ? | Ô ? | ? | O ? ? | Ô ? | ? | O ? ? | ? ? | ? | O ? ? | Ô ? | ? | O ? | Ô | O ? ? | ? ? | ? | O ? ? | Ó ? | ? | O ? ? | ? ? | ? | O ? ? | Ò ? | ? | O ? ? | ? ? | ? | O ? ? | ? ? | ? | O ? ? | ? ? | ? | O ? ? | Õ ? | ? | O ? ? | ? ? | ? | O ? ? | ? ? | ? | O ? | ? | O ? | Ó | O ? | Ò | O ? | Õ | O ? | ? | O ? | ? | U ? ? | ? ? | ? | U ? ? | Ú ? | ? | U ? ? | ? ? | ? | U ? ? | Ù ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? | ? | U ? | Ú | U ? | Ù | U ? | ? | U ? | ? | U ? | ? | Y ? | Ý | Y ? | ? | Y ? | ? | Y ? | ? | Y ? | ? ) | [ A - Z ] ) + + (

We can see that the bunch of same character repeating is not really the same - they are different sequences to represent the same character.

With the same method, let us analyze the regex in attempt 2 to see why it fails.

java.util.regex.PatternSyntaxException: Unclosed group near index 596
(?:(?:[?]|A?|Á|A?|À|A?|Ã|A?|?|A?|?|A?|?|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A?|Â|A??|Â?|?|A??|Â?|?|A??|Â?|?|A??|Â?|?|A??|??|?|A??|Â?|?|E?|É|E?|È|E?|?|E?|?|E?|?|E?|Ê|E??|Ê?|?|E??|Ê?|?|E??|Ê?|?|E??|Ê?|?|E??|??|?|E??|Ê?|?|I?|Í|I?|Ì|I?|?|I?|?|I?|?|O?|Ó|O?|Ò|O?|Õ|O?|?|O?|?|O?|Ô|O??|Ô?|?|O??|Ô?|?|O??|Ô?|?|O??|Ô?|?|O??|??|?|O??|Ô?|?|O?|?|O??|??|?|O??|Ó?|?|O??|??|?|O??|Ò?|?|O??|??|?|O??|??|?|O??|??|?|O??|Õ?|?|O??|??|?|O??|??|?|U?|Ú|U?|Ù|U?|?|U?|?|U?|?|U?|?|U??|??|?|U??|Ú?|?|U??|??|?|U??|Ù?|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|Y?|Ý|Y?|?|Y?|?|Y?|?|Y?|?)|[A-Z])++(

Run Code Online (Sandbox Code Playgroud)

Insert space between every character:

( ? : ( ? : [ ? ] | A ? | Á | A ? | À | A ? | Ã | A ? | ? | A ? | ? | A ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? ? | ? ? | ? | A ? | Â | A ? ? | Â ? | ? | A ? ? | Â ? | ? | A ? ? | Â ? | ? | A ? ? | Â ? | ? | A ? ? | ? ? | ? | A ? ? | Â ? | ? | E ? | É | E ? | È | E ? | ? | E ? | ? | E ? | ? | E ? | Ê | E ? ? | Ê ? | ? | E ? ? | Ê ? | ? | E ? ? | Ê ? | ? | E ? ? | Ê ? | ? | E ? ? | ? ? | ? | E ? ? | Ê ? | ? | I ? | Í | I ? | Ì | I ? | ? | I ? | ? | I ? | ? | O ? | Ó | O ? | Ò | O ? | Õ | O ? | ? | O ? | ? | O ? | Ô | O ? ? | Ô ? | ? | O ? ? | Ô ? | ? | O ? ? | Ô ? | ? | O ? ? | Ô ? | ? | O ? ? | ? ? | ? | O ? ? | Ô ? | ? | O ? | ? | O ? ? | ? ? | ? | O ? ? | Ó ? | ? | O ? ? | ? ? | ? | O ? ? | Ò ? | ? | O ? ? | ? ? | ? | O ? ? | ? ? | ? | O ? ? | ? ? | ? | O ? ? | Õ ? | ? | O ? ? | ? ? | ? | O ? ? | ? ? | ? | U ? | Ú | U ? | Ù | U ? | ? | U ? | ? | U ? | ? | U ? | ? | U ? ? | ? ? | ? | U ? ? | Ú ? | ? | U ? ? | ? ? | ? | U ? ? | Ù ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | U ? ? | ? ? | ? | Y ? | Ý | Y ? | ? | Y ? | ? | Y ? | ? | Y ? | ? ) | [ A - Z ] ) + + (

Notice that A ? | Â comes before A ? ? | Â ? | ? in the regex. This means that A ? will be tried first on the input A?? (A ? ?), and the repetition will end when it fails to match anything in the next iteration.

Since the order of the alternation is important, as a general rule, between 2 strings where one string is a prefix of the other, the longer string should go first in the alternation. In our case, we need to place the characters with more diacritics before the character with less or without diacritics.

Same problem with attempt 1:

java.util.regex.PatternSyntaxException: Unclosed group near index 589
(?:[A-Z?]|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A??|??|?|A?|?|A??|Â?|?|A??|Â?|?|A??|Â?|?|A??|Â?|?|A??|??|?|A??|Â?|?|A?|Â|A?|Á|A?|À|A?|Ã|A?|?|A?|?|E??|Ê?|?|E??|Ê?|?|E??|Ê?|?|E??|Ê?|?|E??|??|?|E??|Ê?|?|E?|Ê|E?|É|E?|È|E?|?|E?|?|E?|?|I?|Í|I?|Ì|I?|?|I?|?|I?|?|O??|Ô?|?|O??|Ô?|?|O??|Ô?|?|O??|Ô?|?|O??|??|?|O??|Ô?|?|O?|Ô|O??|??|?|O??|Ó?|?|O??|??|?|O??|Ò?|?|O??|??|?|O??|??|?|O??|??|?|O??|Õ?|?|O??|??|?|O??|??|?|O?|?|O?|Ó|O?|Ò|O?|Õ|O?|?|O?|?|U??|??|?|U??|Ú?|?|U??|??|?|U??|Ù?|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|U??|??|?|U?|?|U?|Ú|U?|Ù|U?|?|U?|?|U?|?|Y?|Ý|Y?|?|Y?|?|Y?|?|Y?|?)++(

Run Code Online (Sandbox Code Playgroud)

Since the alternations are formed after the original character class, the vowels in [A-Z] will be tried first, leading to the repetition terminating early when it encounters a stray combining mark.

Reference

Appendix

Below is the source code of the testing program.

Demo on ideone

import java.util.regex.*;
import java.text.*;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String VIETNAMESE_DIACRITIC_CHARACTERS 
            = "???????????ÂÁÀÃ????????ÊÉÈ???ÍÌ????????Ô??????ÓÒÕ????????ÚÙ???Ý????";
        /*
        for (char c: VIETNAMESE_DIACRITIC_CHARACTERS.toCharArray()) {
            System.out.println(c + ": " + Character.getName(c));
        }
        */

        String tests[] = new String[3];
        tests[0] = 
            "B?n chính là tác gi? c?a Wikipedia!\n" + 
            "M?i ng??i ??u có th? biên t?p bài ngay l?p t?c, ch? c?n nh? vài quy t?c." +
            "Có s?n r?t nhi?u trang tr? giúp nh? t?o bài, s?a bài hay t?i ?nh." + 
            "B?n c?ng ??ng ng?i ??t câu h?i.\n" +
            "Hi?n chúng ta có 1.109.446 bài vi?t và 406.782 thành viên.";

        tests[1] =
            Normalizer.normalize(tests[0], Normalizer.Form.NFD);
        /*
        for (char c: tests[1].toCharArray()) {
            System.out.printf("%04x ", (int) c);
        }
        */  
        tests[2] =
            Normalizer.normalize(tests[0], Normalizer.Form.NFC);

        try {
            Pattern p = Pattern.compile("(?:[" + VIETNAMESE_DIACRITIC_CHARACTERS + "]|[A-Z])++", Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

            for (String t: tests) {
                Matcher m = p.matcher(t);
                while (m.find()) {
                    System.out.print(m.group() + " ");
                }
                System.out.println();
            }
        } catch (Exception e) {
            System.out.println(e);
        }
    }
}
Run Code Online (Sandbox Code Playgroud)