Javascript-标准化重音希腊字符

Question

Javascript-标准化重音希腊字符

tgo*_*gos 5 javascript regex character normalization

我正在尝试对希腊文本进行某种形式的标准化（使用小写字母，删除重音符号并将？替换为？）。例如，我想要“ ?????????” （希腊语多调性）和“ ?????????” （现代希腊语）成为“ ?????????”。我浏览了unicode-table.com，记下了应该替换哪些字符。

Greek and Coptic (Range: 0370— 03FF) 
??? -> ?
??? -> ?
??? -> ?
????? -> ?
??? -> ?
?????? -> ?
??? -> ?

Greek Extended (Range: 1F00— 1FFF)
?????????????????????????????????????????????? -> ?
???????????????? -> ?
?????????????????????????????????????????? -> ?
???????????????????????????? -> ?
???????????????? -> ?
???????????????????????? -> ?
?????????????????????????????????????????? -> ?
??? -> ?

Run Code Online (Sandbox Code Playgroud)

我想知道是否有一种聪明的方法来进行这些替换，并避免逐字符检查字符串。

第一次尝试（感谢@Tyblitz）

Greek and Coptic (Range: 0370— 03FF) 
??? -> ?
??? -> ?
??? -> ?
????? -> ?
??? -> ?
?????? -> ?
??? -> ?

Greek Extended (Range: 1F00— 1FFF)
?????????????????????????????????????????????? -> ?
???????????????? -> ?
?????????????????????????????????????????? -> ?
???????????????????????????? -> ?
???????????????? -> ?
???????????????????????? -> ?
?????????????????????????????????????????? -> ?
??? -> ?

Run Code Online (Sandbox Code Playgroud)

第二次尝试：

请在下面检查我的答案，该答案将利用String.prototype.normalize()并阻止您保存unicode表中所有带有希腊重音符号的列表。

Answer 1

tgo*_*gos 9

我还发现了以下利用方法： String.prototype.normalize()

normal = '???????? ?? ???? ??? ?????? ? ??????????, ?? Lorem Ipsum ??? ????? ???? ??? ?????? ???????. ?? ????? ??? ?????????? ?? ??? ??????? ????????? ??????????? ??? 45 ?.?., ????????? ??? ?????? ??? ???? ??? 2000 ???.';

pol = '??????? ??? ???? ??? ??? ??? ????? ????????? ??? ??????;';

console.log(normalizeGreek(normal));
console.log(normalizePolytonicGreek(pol));

function normalizeGreek(text) {
    return text.normalize('NFD').replace(/[\u0300-\u036f]/g, "");
}


function normalizePolytonicGreek(text) {
    return text.normalize('NFD').replace(/[\u0300-\u036f]/g, "");
}

Run Code Online (Sandbox Code Playgroud)

运作方式+范例：

在内部.normalize('NFD')，带重音符号的内容分解为：

角色本身
随后是等效的组合变音标记（请参阅：范围[0300-036f]）

使用以下方法很容易删除这些标记： .replace(/[\u0300-\u036f]/g, "")

a = "?"
console.log(a);             // prints: ?
console.log(Array.from(a)); // prints: [ "?" ]

b = a.normalize('NFD')
console.log(b);             // prints: ??? 
console.log(Array.from(b)); // prints: [ "?", "?", "?" ]

c = a.normalize('NFD').replace(/[\u0300-\u036f]/g, "")
console.log(c);             // prints: ?
console.log(Array.from(c)); // prints: [ "?" ]

Run Code Online (Sandbox Code Playgroud)

有趣的链接：

Answer 2

Tyb*_*itz 1

我认为除了检查每个字母之外，你没有其他方法可以做到这一点，但这并不会让事情变得更糟。

\n\n

.replace像这样简单地链接你的函数：

\n\n\n\n

result = string.replace(/\xce\x86|\xce\x91|\xce\xac/g,'\xce\xb1')\n  .replace(/\xce\x88|\xce\x95|\xce\xad/g,'\xce\xb5')\n  .replace(/\xce\x89|\xce\x97|\xce\xae/g,'\xce\xb7');\n// & so on...   \n

Run Code Online (Sandbox Code Playgroud)\n\n

或者，如果您宁愿对其进行循环（如果您有更多的字符需要检查，那么您可能会这样做，并且这对于代码可维护性也更好），将字符匹配存储在对象/数组的数组中。\n例如。与一个对象：

\n\n\n\n

var cvtValues =  [ /* from = chars to convert; to = conversion output */\n  {from:['\xce\x86','\xce\x91','\xce\xac'], to: '\xce\xb1'}\n  {from:['\xce\x88','\xce\x95','\xce\xad'], to: '\xce\xb5'}\n  {from:['\xce\x89','\xce\x97','\xce\xae'], to: '\xce\xb7'}];\n/* loop over all from-to containers */\nfor ( var i = 0; i < cvtValues.length; i++ ) {\n  /* loop over all characters in the 'from' array & replace them with 'to' value*/\n  for ( var x = 0; x < cvtValues[i].from.length; x++ ) {\n    string = string.replace(new RegExp(cvtValues[i].from[x],'g'), cvtValues[i].to);\n    /* You could assign this to another variable, eg. result if you wated */\n  }\n}\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	11 年，7 月前
查看次数：	2401 次
最近记录：	6 年，10 月前