是否有所有国际句号标点的字符集?

JDe*_*age 7 unicode parsing character-encoding punctuation string-parsing

我试图将utf-8字符串解析成"一口大小"的段.例如,我想将文本分解为"句子".

是否有一个全面的字符集(或正则表达式)对应于所有语言的句子结尾?我正在寻找可以捕捉拉丁时期,感叹号和审讯标记,中国和日本句号等的东西.

像上面这样的东西,但相当于一个逗号也会很棒.

tch*_*ist 6

您需要查看具有属性的\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}属性的代码点\p{Terminal_Punctuation}.运行单字符脚本对Unicode的V6.1中,我们了解到这些代码点满足所有这些条件:

$ unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
U+00021 ? !  GC=Po SC=Common       EXCLAMATION MARK
U+0002E ? .  GC=Po SC=Common       FULL STOP
U+0003F ? ?  GC=Po SC=Common       QUESTION MARK
U+00589 ? ?  GC=Po SC=Common       ARMENIAN FULL STOP
U+0061F ? ?  GC=Po SC=Common       ARABIC QUESTION MARK
U+006D4 ? ?  GC=Po SC=Arabic       ARABIC FULL STOP
U+00700 ? ?  GC=Po SC=Syriac       SYRIAC END OF PARAGRAPH
U+00701 ? ?  GC=Po SC=Syriac       SYRIAC SUPRALINEAR FULL STOP
U+00702 ? ?  GC=Po SC=Syriac       SYRIAC SUBLINEAR FULL STOP
U+007F9 ? ?  GC=Po SC=Nko          NKO EXCLAMATION MARK
U+00964 ? ?  GC=Po SC=Common       DEVANAGARI DANDA
U+00965 ? ?  GC=Po SC=Common       DEVANAGARI DOUBLE DANDA
U+0104A ? ?  GC=Po SC=Myanmar      MYANMAR SIGN LITTLE SECTION
U+0104B ? ?  GC=Po SC=Myanmar      MYANMAR SIGN SECTION
U+01362 ? ?  GC=Po SC=Ethiopic     ETHIOPIC FULL STOP
U+01367 ? ?  GC=Po SC=Ethiopic     ETHIOPIC QUESTION MARK
U+01368 ? ?  GC=Po SC=Ethiopic     ETHIOPIC PARAGRAPH SEPARATOR
U+0166E ? ?  GC=Po SC=Canadian_Aboriginal CANADIAN SYLLABICS FULL STOP
U+01803 ? ?  GC=Po SC=Common       MONGOLIAN FULL STOP
U+01809 ? ?  GC=Po SC=Mongolian    MONGOLIAN MANCHU FULL STOP
U+01944 ? ?  GC=Po SC=Limbu        LIMBU EXCLAMATION MARK
U+01945 ? ?  GC=Po SC=Limbu        LIMBU QUESTION MARK
U+01AA8 ? ?  GC=Po SC=Tai_Tham     TAI THAM SIGN KAAN
U+01AA9 ? ?  GC=Po SC=Tai_Tham     TAI THAM SIGN KAANKUU
U+01AAA ? ?  GC=Po SC=Tai_Tham     TAI THAM SIGN SATKAAN
U+01AAB ? ?  GC=Po SC=Tai_Tham     TAI THAM SIGN SATKAANKUU
U+01B5A ? ?  GC=Po SC=Balinese     BALINESE PANTI
U+01B5B ? ?  GC=Po SC=Balinese     BALINESE PAMADA
U+01B5E ? ?  GC=Po SC=Balinese     BALINESE CARIK SIKI
U+01B5F ? ?  GC=Po SC=Balinese     BALINESE CARIK PAREREN
U+01C3B ? ?  GC=Po SC=Lepcha       LEPCHA PUNCTUATION TA-ROL
U+01C3C ? ?  GC=Po SC=Lepcha       LEPCHA PUNCTUATION NYET THYOOM TA-ROL
U+01C7E ? ?  GC=Po SC=Ol_Chiki     OL CHIKI PUNCTUATION MUCAAD
U+01C7F ? ?  GC=Po SC=Ol_Chiki     OL CHIKI PUNCTUATION DOUBLE MUCAAD
U+0203C ? ?  GC=Po SC=Common       DOUBLE EXCLAMATION MARK
U+0203D ? ?  GC=Po SC=Common       INTERROBANG
U+02047 ? ?  GC=Po SC=Common       DOUBLE QUESTION MARK
U+02048 ? ?  GC=Po SC=Common       QUESTION EXCLAMATION MARK
U+02049 ? ?  GC=Po SC=Common       EXCLAMATION QUESTION MARK
U+02E2E ? ?  GC=Po SC=Common       REVERSED QUESTION MARK
U+03002 ? ? GC=Po SC=Common       IDEOGRAPHIC FULL STOP
U+0A4FF ? ?  GC=Po SC=Lisu         LISU PUNCTUATION FULL STOP
U+0A60E ? ?  GC=Po SC=Vai          VAI FULL STOP
U+0A60F ? ?  GC=Po SC=Vai          VAI QUESTION MARK
U+0A6F3 ? ?  GC=Po SC=Bamum        BAMUM FULL STOP
U+0A6F7 ? ?  GC=Po SC=Bamum        BAMUM QUESTION MARK
U+0A876 ? ?  GC=Po SC=Phags_Pa     PHAGS-PA MARK SHAD
U+0A877 ? ?  GC=Po SC=Phags_Pa     PHAGS-PA MARK DOUBLE SHAD
U+0A8CE ? ?  GC=Po SC=Saurashtra   SAURASHTRA DANDA
U+0A8CF ? ?  GC=Po SC=Saurashtra   SAURASHTRA DOUBLE DANDA
U+0A92F ? ?  GC=Po SC=Kayah_Li     KAYAH LI SIGN SHYA
U+0A9C8 ? ?  GC=Po SC=Javanese     JAVANESE PADA LINGSA
U+0A9C9 ? ?  GC=Po SC=Javanese     JAVANESE PADA LUNGSI
U+0AA5D ? ?  GC=Po SC=Cham         CHAM PUNCTUATION DANDA
U+0AA5E ? ?  GC=Po SC=Cham         CHAM PUNCTUATION DOUBLE DANDA
U+0AA5F ? ?  GC=Po SC=Cham         CHAM PUNCTUATION TRIPLE DANDA
U+0AAF0 ? ?  GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHAN
U+0AAF1 ? ?  GC=Po SC=Meetei_Mayek MEETEI MAYEK AHANG KHUDAM
U+0ABEB ? ?  GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHEI
U+0FE52 ? ? GC=Po SC=Common       SMALL FULL STOP
U+0FE56 ? ? GC=Po SC=Common       SMALL QUESTION MARK
U+0FE57 ? ? GC=Po SC=Common       SMALL EXCLAMATION MARK
U+0FF01 ? ? GC=Po SC=Common       FULLWIDTH EXCLAMATION MARK
U+0FF0E ? ? GC=Po SC=Common       FULLWIDTH FULL STOP
U+0FF1F ? ? GC=Po SC=Common       FULLWIDTH QUESTION MARK
U+0FF61 ? ?  GC=Po SC=Common       HALFWIDTH IDEOGRAPHIC FULL STOP
U+11047 ?   GC=Po SC=Brahmi       BRAHMI DANDA
U+11048 ?   GC=Po SC=Brahmi       BRAHMI DOUBLE DANDA
U+110BE ?   GC=Po SC=Kaithi       KAITHI SECTION MARK
U+110BF ?   GC=Po SC=Kaithi       KAITHI DOUBLE SECTION MARK
U+110C0 ?   GC=Po SC=Kaithi       KAITHI DANDA
U+110C1 ?   GC=Po SC=Kaithi       KAITHI DOUBLE DANDA
U+11141 ?   GC=Po SC=Chakma       CHAKMA DANDA
U+11142 ?   GC=Po SC=Chakma       CHAKMA DOUBLE DANDA
U+11143 ?   GC=Po SC=Chakma       CHAKMA QUESTION MARK
U+111C5 ?   GC=Po SC=Sharada      SHARADA DANDA
U+111C6 ?   GC=Po SC=Sharada      SHARADA DOUBLE DANDA
Run Code Online (Sandbox Code Playgroud)

反过来说 - 也就是说,找到给定代码点的属性而不是在给定一组属性的情况下找到代码点 - 使用随附的uniprops脚本,它会拉出给定代码点的所有属性:

$ uniprops -a . \? \!
U+002E ‹.› \N{FULL STOP}
    \pP \p{Po}
    All Any ASCII Assigned Basic_Latin Case_Ignorable CI Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
       Pattern_Syntax PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print
       X_POSIX_Punct
    Age=1.1 Block=Basic_Latin Bidi_Class=Common_Separator BC=CS Bidi_Class=CS Block=ASCII BLK=ASCII Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
       East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
       Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
       Line_Break=Infix_Numeric LB=IS Line_Break=IS Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
       Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
       IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=AT Sentence_Break=ATerm SB=AT
       Word_Break=MB Word_Break=MidNumLet WB=MB _Case_Ignorable _X_Begin
U+003F ‹?› \N{QUESTION MARK}
    \pP \p{Po}
    All Any ASCII Assigned Basic_Latin Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn
       POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
    Age=1.1 Block=Basic_Latin Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=ASCII BLK=ASCII Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
       East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
       Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
       Line_Break=EX Line_Break=Exclamation LB=EX Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
       Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
       IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=ST Sentence_Break=STerm SB=ST
       Word_Break=Other WB=XX Word_Break=XX _X_Begin
U+0021 ‹!› \N{EXCLAMATION MARK}
    \pP \p{Po}
    All Any ASCII Assigned Basic_Latin Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn
       POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
    Age=1.1 Block=Basic_Latin Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=ASCII BLK=ASCII Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
       East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
       Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
       Line_Break=EX Line_Break=Exclamation LB=EX Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
       Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
       IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=ST Sentence_Break=STerm SB=ST
       Word_Break=Other WB=XX Word_Break=XX _X_Begin
Run Code Online (Sandbox Code Playgroud)

我怀疑你应该检查整个句子中断属性.

套件中还有一个第三个脚本,uninames,它执行以下操作:

$ uninames sentence
 ;  037E        GREEK QUESTION MARK
        = erotimatiko
        * sentence-final punctuation
        * 003B is the preferred character
        x (question mark - 003F)
        : 003B semicolon
 ?  205A        TWO DOT PUNCTUATION
        * historically used to indicate the end of a sentence or change of speaker
        * extends from baseline to cap height
        x (presentation form for vertical two dot leader - FE30)
        x (greek acrophonic epidaurean two - 1015B)
   110BE       KAITHI SECTION MARK
        * marks end of sentence
Run Code Online (Sandbox Code Playgroud)

我发现这三个程序对于探索Unicode属性是不可或缺的.您可以使用CPAN Unicode :: Tussle套件安装它们,或者在这里单独检查它们.

  • Sentence_Break属性根据字符**是否*终止句子或其他语法结构来对字符进行分类.这些信息对语言不敏感,一种语言中的句子终止符可能只是另一种语言中的单词分隔符.UAX#29 http://unicode.org/reports/tr29/包含有关使用信息进行文本分段的一些信息以及相当大的限制. (3认同)

Juk*_*ela 3

我还没有遇到过任何此类信息的汇编,我希望收集它需要付出很大的努力。对于一些广泛使用的语言,您可以从《芝加哥风格手册》中获取信息。有关不同语言中常用的标点符号的一些信息,请访问http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters-other.html但仅涵盖一小部分语言,不区分句子终止字符。

\n\n

仅使用字符 \xe2\x80\x99t 就足够了,因为例如在英语中,句号 \xe2\x80\x9c.\xe2\x80\x9d 出现在许多不终止句子的上下文中,如 \xe2 \x80\x9ce.g.\xe2\x80\x9d 或 \xe2\x80\x9c1.5\xe2\x80\x9d 中。

\n