Python:如何将字符串'ub'添加到字符串中的每个发音元音?

Sah*_*bov 7 python regex string nlp

示例:说出 - > Spubeak,这里有更多信息

不要给我一个解决方案,但指出我正确的方向或告诉我可以使用哪个python库?我正在考虑正则表达式,因为我必须找到一个元音,但是我可以使用哪种方法在元音前插入'ub'?

jfs*_*jfs 9

它比一个简单的正则表达式更复杂,例如,

"Hi, how are you?" ? "Hubi, hubow ubare yubou?"
Run Code Online (Sandbox Code Playgroud)

简单的正则表达式不会捕获e不发音的内容are.

您需要一个提供发音词典的库,例如nltk.corpus.cmudict:

from nltk.corpus import cmudict # $ pip install nltk
# $ python -c "import nltk; nltk.download('cmudict')"

def spubeak(word, pronunciations=cmudict.dict()):
    istitle = word.istitle() # remember, to preserve titlecase
    w = word.lower() #note: ignore Unicode case-folding
    for syllables in pronunciations.get(w, []):
        parts = []
        for syl in syllables:
            if syl[:1] == syl[1:2]:
                syl = syl[1:] # remove duplicate
            isvowel = syl[-1].isdigit()
            # pronounce the word
            parts.append('ub'+syl[:-1] if isvowel else syl)
        result = ''.join(map(str.lower, parts))
        return result.title() if istitle else result
    return word # word not found in the dictionary
Run Code Online (Sandbox Code Playgroud)

例:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re

sent = "Hi, how are you?"
subent = " ".join(["".join(map(spubeak, re.split("(\W+)", nonblank)))
                   for nonblank in sent.split()])
print('"{}" ? "{}"'.format(sent, subent))
Run Code Online (Sandbox Code Playgroud)

产量

"Hi, how are you?" ? "Hubay, hubaw ubar yubuw?"

注意:它与第一个示例不同:每个单词都替换为其音节.

  • 我还没有看到任何方法来正确识别*spY,AcRe,fIRe,fIeRY,lIttle,rhYthM,queUe,Nth,pSst,yEarlY*,但这些都是测试它的好东西. (2认同)