tgo*_*gos 5 javascript regex character normalization
我正在尝试对希腊文本进行某种形式的标准化(使用小写字母,删除重音符号并将?替换为?)。例如,我想要“ ?????????” (希腊语多调性)和“ ?????????” (现代希腊语)成为“ ?????????”。我浏览了unicode-table.com,记下了应该替换哪些字符。
Greek and Coptic (Range: 0370— 03FF)
??? -> ?
??? -> ?
??? -> ?
????? -> ?
??? -> ?
?????? -> ?
??? -> ?
Greek Extended (Range: 1F00— 1FFF)
?????????????????????????????????????????????? -> ?
???????????????? -> ?
?????????????????????????????????????????? -> ?
???????????????????????????? -> ?
???????????????? -> ?
???????????????????????? -> ?
?????????????????????????????????????????? -> ?
??? -> ?
Run Code Online (Sandbox Code Playgroud)
我想知道是否有一种聪明的方法来进行这些替换,并避免逐字符检查字符串。
Greek and Coptic (Range: 0370— 03FF)
??? -> ?
??? -> ?
??? -> ?
????? -> ?
??? -> ?
?????? -> ?
??? -> ?
Greek Extended (Range: 1F00— 1FFF)
?????????????????????????????????????????????? -> ?
???????????????? -> ?
?????????????????????????????????????????? -> ?
???????????????????????????? -> ?
???????????????? -> ?
???????????????????????? -> ?
?????????????????????????????????????????? -> ?
??? -> ?
Run Code Online (Sandbox Code Playgroud)
请在下面检查我的答案,该答案将利用String.prototype.normalize()并阻止您保存unicode表中所有带有希腊重音符号的列表。
我还发现了以下利用方法: String.prototype.normalize()
normal = '???????? ?? ???? ??? ?????? ? ??????????, ?? Lorem Ipsum ??? ????? ???? ??? ?????? ???????. ?? ????? ??? ?????????? ?? ??? ??????? ????????? ??????????? ??? 45 ?.?., ????????? ??? ?????? ??? ???? ??? 2000 ???.';
pol = '??????? ??? ???? ??? ??? ??? ????? ????????? ??? ??????;';
console.log(normalizeGreek(normal));
console.log(normalizePolytonicGreek(pol));
function normalizeGreek(text) {
return text.normalize('NFD').replace(/[\u0300-\u036f]/g, "");
}
function normalizePolytonicGreek(text) {
return text.normalize('NFD').replace(/[\u0300-\u036f]/g, "");
}Run Code Online (Sandbox Code Playgroud)
在内部.normalize('NFD'),带重音符号的内容分解为:
使用以下方法很容易删除这些标记: .replace(/[\u0300-\u036f]/g, "")
a = "?"
console.log(a); // prints: ?
console.log(Array.from(a)); // prints: [ "?" ]
b = a.normalize('NFD')
console.log(b); // prints: ???
console.log(Array.from(b)); // prints: [ "?", "?", "?" ]
c = a.normalize('NFD').replace(/[\u0300-\u036f]/g, "")
console.log(c); // prints: ?
console.log(Array.from(c)); // prints: [ "?" ]Run Code Online (Sandbox Code Playgroud)
我认为除了检查每个字母之外,你没有其他方法可以做到这一点,但这并不会让事情变得更糟。
\n\n.replace像这样简单地链接你的函数:
result = string.replace(/\xce\x86|\xce\x91|\xce\xac/g,'\xce\xb1')\n .replace(/\xce\x88|\xce\x95|\xce\xad/g,'\xce\xb5')\n .replace(/\xce\x89|\xce\x97|\xce\xae/g,'\xce\xb7');\n// & so on... \nRun Code Online (Sandbox Code Playgroud)\n\n或者,如果您宁愿对其进行循环(如果您有更多的字符需要检查,那么您可能会这样做,并且这对于代码可维护性也更好),将字符匹配存储在对象/数组的数组中。\n例如。与一个对象:
\n\n\n\nvar cvtValues = [ /* from = chars to convert; to = conversion output */\n {from:['\xce\x86','\xce\x91','\xce\xac'], to: '\xce\xb1'}\n {from:['\xce\x88','\xce\x95','\xce\xad'], to: '\xce\xb5'}\n {from:['\xce\x89','\xce\x97','\xce\xae'], to: '\xce\xb7'}];\n/* loop over all from-to containers */\nfor ( var i = 0; i < cvtValues.length; i++ ) {\n /* loop over all characters in the 'from' array & replace them with 'to' value*/\n for ( var x = 0; x < cvtValues[i].from.length; x++ ) {\n string = string.replace(new RegExp(cvtValues[i].from[x],'g'), cvtValues[i].to);\n /* You could assign this to another variable, eg. result if you wated */\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n