Javascript-标准化重音希腊字符

tgo*_*gos 5 javascript regex character normalization

我正在尝试对希腊文本进行某种形式的标准化(使用小写字母,删除重音符号并将?替换为?)。例如,我想要“ ?????????” (希腊语多调性)和“ ?????????” (现代希腊语)成为“ ?????????”。我浏览了unicode-table.com,记下了应该替换哪些字符。

Greek and Coptic (Range: 0370— 03FF) 
??? -> ?
??? -> ?
??? -> ?
????? -> ?
??? -> ?
?????? -> ?
??? -> ?

Greek Extended (Range: 1F00— 1FFF)
?????????????????????????????????????????????? -> ?
???????????????? -> ?
?????????????????????????????????????????? -> ?
???????????????????????????? -> ?
???????????????? -> ?
???????????????????????? -> ?
?????????????????????????????????????????? -> ?
??? -> ?
Run Code Online (Sandbox Code Playgroud)

我想知道是否有一种聪明的方法来进行这些替换,并避免逐字符检查字符串。

第一次尝试(感谢@Tyblitz)

Greek and Coptic (Range: 0370— 03FF) 
??? -> ?
??? -> ?
??? -> ?
????? -> ?
??? -> ?
?????? -> ?
??? -> ?

Greek Extended (Range: 1F00— 1FFF)
?????????????????????????????????????????????? -> ?
???????????????? -> ?
?????????????????????????????????????????? -> ?
???????????????????????????? -> ?
???????????????? -> ?
???????????????????????? -> ?
?????????????????????????????????????????? -> ?
??? -> ?
Run Code Online (Sandbox Code Playgroud)

第二次尝试:

请在下面检查我的答案,该答案将利用String.prototype.normalize()并阻止您保存unicode表中所有带有希腊重音符号的列表。

tgo*_*gos 9

我还发现了以下利用方法: String.prototype.normalize()

normal = '???????? ?? ???? ??? ?????? ? ??????????, ?? Lorem Ipsum ??? ????? ???? ??? ?????? ???????. ?? ????? ??? ?????????? ?? ??? ??????? ????????? ??????????? ??? 45 ?.?., ????????? ??? ?????? ??? ???? ??? 2000 ???.';

pol = '??????? ??? ???? ??? ??? ??? ????? ????????? ??? ??????;';

console.log(normalizeGreek(normal));
console.log(normalizePolytonicGreek(pol));

function normalizeGreek(text) {
    return text.normalize('NFD').replace(/[\u0300-\u036f]/g, "");
}


function normalizePolytonicGreek(text) {
    return text.normalize('NFD').replace(/[\u0300-\u036f]/g, "");
}
Run Code Online (Sandbox Code Playgroud)

运作方式+范例:

在内部.normalize('NFD'),带重音符号的内容分解为:

  • 角色本身
  • 随后是等效的组合变音标记(请参阅:范围[0300-036f]

使用以下方法很容易删除这些标记: .replace(/[\u0300-\u036f]/g, "")

a = "?"
console.log(a);             // prints: ?
console.log(Array.from(a)); // prints: [ "?" ]

b = a.normalize('NFD')
console.log(b);             // prints: ??? 
console.log(Array.from(b)); // prints: [ "?", "?", "?" ]

c = a.normalize('NFD').replace(/[\u0300-\u036f]/g, "")
console.log(c);             // prints: ?
console.log(Array.from(c)); // prints: [ "?" ]
Run Code Online (Sandbox Code Playgroud)

有趣的链接:


Tyb*_*itz 1

我认为除了检查每个字母之外,你没有其他方法可以做到这一点,但这并不会让事情变得更糟。

\n\n

.replace像这样简单地链接你的函数:

\n\n\n\n
result = string.replace(/\xce\x86|\xce\x91|\xce\xac/g,'\xce\xb1')\n  .replace(/\xce\x88|\xce\x95|\xce\xad/g,'\xce\xb5')\n  .replace(/\xce\x89|\xce\x97|\xce\xae/g,'\xce\xb7');\n// & so on...   \n
Run Code Online (Sandbox Code Playgroud)\n\n

或者,如果您宁愿对其进行循环(如果您有更多的字符需要检查,那么您可能会这样做,并且这对于代码可维护性也更好),将字符匹配存储在对象/数组的数组中。\n例如。与一个对象:

\n\n\n\n
var cvtValues =  [ /* from = chars to convert; to = conversion output */\n  {from:['\xce\x86','\xce\x91','\xce\xac'], to: '\xce\xb1'}\n  {from:['\xce\x88','\xce\x95','\xce\xad'], to: '\xce\xb5'}\n  {from:['\xce\x89','\xce\x97','\xce\xae'], to: '\xce\xb7'}];\n/* loop over all from-to containers */\nfor ( var i = 0; i < cvtValues.length; i++ ) {\n  /* loop over all characters in the 'from' array & replace them with 'to' value*/\n  for ( var x = 0; x < cvtValues[i].from.length; x++ ) {\n    string = string.replace(new RegExp(cvtValues[i].from[x],'g'), cvtValues[i].to);\n    /* You could assign this to another variable, eg. result if you wated */\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n