将带编号的拼音转换为带有音标的拼音

Vil*_*age 17 python bash cjk

是否有任何脚本,库,或使用程序PythonBASH工具(例如awk,perl,sed),它可以正确地转换编号的拼音(如dian4 nao3)为UTF-8带声调的拼音(如厂甸nǎo)?

我找到了以下示例,但它们需要PHP#C:

我也发现了各种在线工具,但它们无法处理大量的转换.

Gre*_*ill 19

我有一些Python 3代码可以做到这一点,并且它足够小,可以直接放在这里的答案中.

PinyinToneMark = {
    0: "aoeiuv\u00fc",
    1: "\u0101\u014d\u0113\u012b\u016b\u01d6\u01d6",
    2: "\u00e1\u00f3\u00e9\u00ed\u00fa\u01d8\u01d8",
    3: "\u01ce\u01d2\u011b\u01d0\u01d4\u01da\u01da",
    4: "\u00e0\u00f2\u00e8\u00ec\u00f9\u01dc\u01dc",
}

def decode_pinyin(s):
    s = s.lower()
    r = ""
    t = ""
    for c in s:
        if c >= 'a' and c <= 'z':
            t += c
        elif c == ':':
            assert t[-1] == 'u'
            t = t[:-1] + "\u00fc"
        else:
            if c >= '0' and c <= '5':
                tone = int(c) % 5
                if tone != 0:
                    m = re.search("[aoeiuv\u00fc]+", t)
                    if m is None:
                        t += c
                    elif len(m.group(0)) == 1:
                        t = t[:m.start(0)] + PinyinToneMark[tone][PinyinToneMark[0].index(m.group(0))] + t[m.end(0):]
                    else:
                        if 'a' in t:
                            t = t.replace("a", PinyinToneMark[tone][0])
                        elif 'o' in t:
                            t = t.replace("o", PinyinToneMark[tone][1])
                        elif 'e' in t:
                            t = t.replace("e", PinyinToneMark[tone][2])
                        elif t.endswith("ui"):
                            t = t.replace("i", PinyinToneMark[tone][3])
                        elif t.endswith("iu"):
                            t = t.replace("u", PinyinToneMark[tone][4])
                        else:
                            t += "!"
            r += t
            t = ""
    r += t
    return r
Run Code Online (Sandbox Code Playgroud)

这处理ü,u:v,我所遇到的所有.Python 2兼容性需要进行少量修改.

  • 谢谢你!仅供参考,Python 2.x所需的更改只是在任何带有`\ u ....字符的字符串前面添加一个`u`(用于unicode)字符,它为我修复了它. (4认同)

cbu*_*mer 5

cjklib库确实涉及您的需求:

使用Python shell:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert('Bei3jing1', 'Pinyin', 'Pinyin', sourceOptions={'toneMarkType': 'numbers'})
B?ij?ng
Run Code Online (Sandbox Code Playgroud)

或者只是命令行:

$ cjknife -m Bei3jing1
B?ij?ng
Run Code Online (Sandbox Code Playgroud)

免责声明:我开发了该库.


dan*_*i_l 5

我编写了另一个执行此操作的Python函数,它不区分大小写并保留空格,标点符号和其他文本(当然,除非有误报):

# -*- coding: utf-8 -*-
import re

pinyinToneMarks = {
    u'a': u'?á?à', u'e': u'?é?è', u'i': u'?í?ì',
    u'o': u'?ó?ò', u'u': u'?ú?ù', u'ü': u'????',
    u'A': u'?Á?À', u'E': u'?É?È', u'I': u'?Í?Ì',
    u'O': u'?Ó?Ò', u'U': u'?Ú?Ù', u'Ü': u'????'
}

def convertPinyinCallback(m):
    tone=int(m.group(3))%5
    r=m.group(1).replace(u'v', u'ü').replace(u'V', u'Ü')
    # for multple vowels, use first one if it is a/e/o, otherwise use second one
    pos=0
    if len(r)>1 and not r[0] in 'aeoAEO':
        pos=1
    if tone != 0:
        r=r[0:pos]+pinyinToneMarks[r[pos]][tone-1]+r[pos+1:]
    return r+m.group(2)

def convertPinyin(s):
    return re.sub(ur'([aeiouüvÜ]{1,3})(n?g?r?)([012345])', convertPinyinCallback, s, flags=re.IGNORECASE)

print convertPinyin(u'Ni3 hao3 ma0?')
Run Code Online (Sandbox Code Playgroud)


Eze*_*hez 5

更新的代码:请注意@Lakedaemon 的 Kotlin 代码不会考虑音调放置规则。

\n\n
    \n
  • A 和 e 胜过所有其他元音,并且始终采用声调标记。汉语拼音中没有同时包含 a 和 e 的普通话音节。
  • \n
  • 在组合 ou 中,o 取标记。
  • \n
  • 在所有其他情况下,最后的元音占据标记。
  • \n
\n\n

我最初将 @Lakedaemon\ 的 Kotlin 代码移植到 Java,现在我修改了它并敦促使用此或 @Lakedaemon\ 的 Kotlin 代码的人更新它。

\n\n

我添加了一个额外的辅助函数来获得正确的音标位置。

\n\n
\n    private static int getTonePosition(String r) {\n        String lowerCase = r.toLowerCase();\n\n        // exception to the rule\n        if (lowerCase.equals("ou")) return 0;\n\n        // higher precedence, both never go together\n        int preferencePosition = lowerCase.indexOf(\'a\');\n        if (preferencePosition >= 0) return preferencePosition;\n        preferencePosition = lowerCase.indexOf(\'e\');\n        if (preferencePosition >= 0) return preferencePosition;\n\n        // otherwise the last one takes the tone mark\n        return lowerCase.length() - 1;\n    }\n\n    static public String getCharacter(String string, int position) {\n        char[] characters = string.toCharArray();\n        return String.valueOf(characters[position]);\n    }\n\n    static public String toPinyin(String asciiPinyin) {\n        Map<String, String> pinyinToneMarks = new HashMap<>();\n        pinyinToneMarks.put("a", "\xc4\x81\xc3\xa1\xc7\x8e\xc3\xa0"); pinyinToneMarks.put("e", "\xc4\x93\xc3\xa9\xc4\x9b\xc3\xa8");\n        pinyinToneMarks.put("i", "\xc4\xab\xc3\xad\xc7\x90\xc3\xac"); pinyinToneMarks.put("o",  "\xc5\x8d\xc3\xb3\xc7\x92\xc3\xb2");\n        pinyinToneMarks.put("u", "\xc5\xab\xc3\xba\xc7\x94\xc3\xb9"); pinyinToneMarks.put("\xc3\xbc", "\xc7\x96\xc7\x98\xc7\x9a\xc7\x9c");\n        pinyinToneMarks.put("A",  "\xc4\x80\xc3\x81\xc7\x8d\xc3\x80"); pinyinToneMarks.put("E", "\xc4\x92\xc3\x89\xc4\x9a\xc3\x88");\n        pinyinToneMarks.put("I", "\xc4\xaa\xc3\x8d\xc7\x8f\xc3\x8c"); pinyinToneMarks.put("O", "\xc5\x8c\xc3\x93\xc7\x91\xc3\x92");\n        pinyinToneMarks.put("U", "\xc5\xaa\xc3\x9a\xc7\x93\xc3\x99"); pinyinToneMarks.put("\xc3\x9c",  "\xc7\x95\xc7\x97\xc7\x99\xc7\x9b");\n\n        Pattern pattern = Pattern.compile("([aeiou\xc3\xbcv\xc3\x9c]{1,3})(n?g?r?)([012345])");\n        Matcher matcher = pattern.matcher(asciiPinyin);\n        StringBuilder s = new StringBuilder();\n        int start = 0;\n\n        while (matcher.find(start)) {\n            s.append(asciiPinyin, start, matcher.start(1));\n            int tone = Integer.parseInt(matcher.group(3)) % 5;\n            String r = matcher.group(1).replace("v", "\xc3\xbc").replace("V", "\xc3\x9c");\n            if (tone != 0) {\n                int pos = getTonePosition(r);\n                s.append(r, 0, pos).append(getCharacter(pinyinToneMarks.get(getCharacter(r, pos)),tone - 1)).append(r, pos + 1, r.length());\n            } else {\n                s.append(r);\n            }\n            s.append(matcher.group(2));\n            start = matcher.end(3);\n        }\n        if (start != asciiPinyin.length()) {\n            s.append(asciiPinyin, start, asciiPinyin.length());\n        }\n        return s.toString();\n    }\n\n
Run Code Online (Sandbox Code Playgroud)\n