Jam*_*all 415 .net string diacritics
我正在尝试转换一些法语加拿大语的字符串,基本上,我希望能够在保留字母的同时取出字母中的法语重音符号.(例如转换é为e,所以crème brûlée会变成creme brulee)
实现这一目标的最佳方法是什么?
Bla*_*rad 506
我没有使用过这种方法,但迈克尔·卡普兰在他的博客文章(带有令人困惑的标题)中描述了这样做的方法,该文章讨论剥离变音符号:剥离是一项有趣的工作(又名无意义的意思,又名所有Mn字符)是非间距的,但有些比其他的更不间距)
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
Run Code Online (Sandbox Code Playgroud)
请注意,这是他早期帖子的后续内容:剥离变音符....
该方法使用String.Normalize将输入字符串拆分为组成字形(基本上将"基本"字符与变音符号分开),然后扫描结果并仅保留基本字符.这有点复杂,但实际上你正在研究一个复杂的问题.
当然,如果你限制自己使用法语,你可能会按照@David Dibben的建议,在如何删除C++ std :: string中的重音符号和波形符号中使用简单的基于表格的方法.
小智 149
这对我有用...
string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);
Run Code Online (Sandbox Code Playgroud)
快速和短暂!
cdi*_*die 38
接受的答案是完全正确的,但现在,应该更新为使用Rune类而不是CharUnicodeInfo,因为 C# 和 .NET 更新了最新版本中分析字符串的方式(Rune 类已在 .NET Core 3.0 中添加)。
现在建议使用以下适用于 .NET 5+ 的代码,因为它对非拉丁字符更进一步:
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString.EnumerateRunes())
{
var unicodeCategory = Rune.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
Run Code Online (Sandbox Code Playgroud)
Luk*_*Luk 29
如果有人感兴趣,我正在寻找类似的东西并结束了以下内容:
public static string NormalizeStringForUrl(string name)
{
String normalizedString = name.Normalize(NormalizationForm.FormD);
StringBuilder stringBuilder = new StringBuilder();
foreach (char c in normalizedString)
{
switch (CharUnicodeInfo.GetUnicodeCategory(c))
{
case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.UppercaseLetter:
case UnicodeCategory.DecimalDigitNumber:
stringBuilder.Append(c);
break;
case UnicodeCategory.SpaceSeparator:
case UnicodeCategory.ConnectorPunctuation:
case UnicodeCategory.DashPunctuation:
stringBuilder.Append('_');
break;
}
}
string result = stringBuilder.ToString();
return String.Join("_", result.Split(new char[] { '_' }
, StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}
Run Code Online (Sandbox Code Playgroud)
CIR*_*CLE 24
我需要一些可以转换所有主要unicode字符的东西,并且投票的答案留了一些,所以我在convert_accented_characters($str)C#中创建了一个可以轻松定制的CodeIgniter版本:
using System;
using System.Text;
using System.Collections.Generic;
public static class Strings
{
static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
{
{ "äæ?", "ae" },
{ "öœ", "oe" },
{ "ü", "ue" },
{ "Ä", "Ae" },
{ "Ü", "Ue" },
{ "Ö", "Oe" },
{ "ÀÁÂÃÄÅ???????????????????", "A" },
{ "àáâãå?????ª???????????????", "a" },
{ "?", "B" },
{ "?", "b" },
{ "Ç????", "C" },
{ "ç????", "c" },
{ "?", "D" },
{ "?", "d" },
{ "Ð???", "Dj" },
{ "ð???", "dj" },
{ "ÈÉÊË?????????????????", "E" },
{ "èéêë?????????????????", "e" },
{ "?", "F" },
{ "?", "f" },
{ "???????", "G" },
{ "???????", "g" },
{ "??", "H" },
{ "??", "h" },
{ "ÌÍÎÏ???????????????", "I" },
{ "ìíîï????????????????", "i" },
{ "?", "J" },
{ "?", "j" },
{ "???", "K" },
{ "???", "k" },
{ "???????", "L" },
{ "???????", "l" },
{ "?", "M" },
{ "?", "m" },
{ "Ñ?????", "N" },
{ "ñ??????", "n" },
{ "ÒÓÔÕ?????Ø??????????????????", "O" },
{ "òóôõ?????ø?º?????????????????", "o" },
{ "?", "P" },
{ "?", "p" },
{ "?????", "R" },
{ "?????", "r" },
{ "????Š??", "S" },
{ "????š????", "s" },
{ "??????", "T" },
{ "?????", "t" },
{ "ÙÚÛ?????????????????????", "U" },
{ "ùúû???????????????????????", "u" },
{ "ÝŸ?????????", "Y" },
{ "ýÿ??????", "y" },
{ "?", "V" },
{ "?", "v" },
{ "?", "W" },
{ "?", "w" },
{ "??Ž??", "Z" },
{ "??ž??", "z" },
{ "Æ?", "AE" },
{ "ß", "ss" },
{ "?", "IJ" },
{ "?", "ij" },
{ "Œ", "OE" },
{ "ƒ", "f" },
{ "?", "ks" },
{ "?", "p" },
{ "?", "v" },
{ "?", "m" },
{ "?", "ps" },
{ "?", "Yo" },
{ "?", "yo" },
{ "?", "Ye" },
{ "?", "ye" },
{ "?", "Yi" },
{ "?", "Zh" },
{ "?", "zh" },
{ "?", "Kh" },
{ "?", "kh" },
{ "?", "Ts" },
{ "?", "ts" },
{ "?", "Ch" },
{ "?", "ch" },
{ "?", "Sh" },
{ "?", "sh" },
{ "?", "Shch" },
{ "?", "shch" },
{ "????", "" },
{ "?", "Yu" },
{ "?", "yu" },
{ "?", "Ya" },
{ "?", "ya" },
};
public static char RemoveDiacritics(this char c){
foreach(KeyValuePair<string, string> entry in foreign_characters)
{
if(entry.Key.IndexOf (c) != -1)
{
return entry.Value[0];
}
}
return c;
}
public static string RemoveDiacritics(this string s)
{
//StringBuilder sb = new StringBuilder ();
string text = "";
foreach (char c in s)
{
int len = text.Length;
foreach(KeyValuePair<string, string> entry in foreign_characters)
{
if(entry.Key.IndexOf (c) != -1)
{
text += entry.Value;
break;
}
}
if (len == text.Length) {
text += c;
}
}
return text;
}
}
Run Code Online (Sandbox Code Playgroud)
用法
// for strings
"crème brûlée".RemoveDiacritics (); // creme brulee
// for chars
"Ã"[0].RemoveDiacritics (); // A
Run Code Online (Sandbox Code Playgroud)
Ken*_*enE 15
如果有人感兴趣,这里是java等价物:
import java.text.Normalizer;
public class MyClass
{
public static String removeDiacritics(String input)
{
String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
StringBuilder stripped = new StringBuilder();
for (int i=0;i<nrml.length();++i)
{
if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
{
stripped.append(nrml.charAt(i));
}
}
return stripped.toString();
}
}
Run Code Online (Sandbox Code Playgroud)
rea*_*art 13
我经常使用基于我在这里找到的另一个版本的扩展方法(请参阅替换C#中的字符(ascii))快速解释:
码:
using System.Linq;
using System.Text;
using System.Globalization;
// namespace here
public static class Utility
{
public static string RemoveDiacritics(this string str)
{
if (null == str) return null;
var chars =
from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
let uc = CharUnicodeInfo.GetUnicodeCategory(c)
where uc != UnicodeCategory.NonSpacingMark
select c;
var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
return cleanStr;
}
// or, alternatively
public static string RemoveDiacritics2(this string str)
{
if (null == str) return null;
var chars = str
.Normalize(NormalizationForm.FormD)
.ToCharArray()
.Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
.ToArray();
return new string(chars).Normalize(NormalizationForm.FormC);
}
}
Run Code Online (Sandbox Code Playgroud)
希腊语(ISO)的CodePage 可以做到这一点
有关此代码页的信息是System.Text.Encoding.GetEncodings().了解:https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
希腊语(ISO)的代码页为28597,名称为iso-8859-7.
转到代码...\o /
string text = "Você está numa situação lamentável";
string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
//result: "Voce+esta+numa+situacao+lamentavel"
string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
//result: "Voce esta numa situacao lamentavel"
Run Code Online (Sandbox Code Playgroud)
所以,写这个功能......
public string RemoveAcentuation(string text)
{
return
System.Web.HttpUtility.UrlDecode(
System.Web.HttpUtility.UrlEncode(
text, Encoding.GetEncoding("iso-8859-7")));
}
Run Code Online (Sandbox Code Playgroud)
请注意...... Encoding.GetEncoding("iso-8859-7")相当于Encoding.GetEncoding(28597)因为first是名称,第二个是Encoding的代码页.
与接受的答案相同,但速度更快,使用Span而不是StringBuilder.
\n需要 .NET Core 3.1 或更高版本的 .NET。
static string RemoveDiacritics(string text) \n{\n ReadOnlySpan<char> normalizedString = text.Normalize(NormalizationForm.FormD);\n int i = 0;\n Span<char> span = text.Length < 1000\n ? stackalloc char[text.Length]\n : new char[text.Length];\n\n foreach (char c in normalizedString)\n {\n if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)\n span[i++] = c;\n }\n\n return new string(span).Normalize(NormalizationForm.FormC);\n}\nRun Code Online (Sandbox Code Playgroud)\n此外,这还可以扩展用于其他字符替换,例如波兰语 \xc5\x81。
\nspan[i++] = c switch\n{\n \'\xc5\x81\' => \'L\',\n \'\xc5\x82\' => \'l\',\n _ => c\n};\nRun Code Online (Sandbox Code Playgroud)\n一个小注意事项:堆栈分配stackalloc比堆分配要快得多new,并且它减少了垃圾收集器的工作量。1000是避免在堆栈上分配大型结构的阈值,这可能会导致StackOverflowException. 虽然 1000 是一个非常安全的值,但在大多数情况下,10000 甚至 100000 也可以工作(100k 在堆栈上分配最多 200kB,而默认堆栈大小为 1 MB)。然而 100k 对我来说有点危险。
TL;DR - C# 字符串扩展方法
我想保存字符串的含义最好的解决办法是将字符,而不是转化剥夺他们,这是在本例中很好的说明中crème brûlée,以crme brle对creme brulee。
我查看了上面 Alexander 的评论,看到 Lucene.Net 代码是 Apache 2.0 许可的,因此我将该类修改为简单的字符串扩展方法。你可以这样使用它:
var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength);
// "creme brulee"
Run Code Online (Sandbox Code Playgroud)
该函数太长,无法在 StackOverflow 答案中发布(允许 30k 的约 139k 个字符,哈哈)所以我做了一个要点并归因于作者:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/// <summary>
/// This class converts alphabetic, numeric, and symbolic Unicode characters
/// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode
/// block) into their ASCII equivalents, if one exists.
/// <para/>
/// Characters from the following Unicode blocks are converted; however, only
/// those characters with reasonable ASCII alternatives are converted:
///
/// <ul>
/// <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item>
/// <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item>
/// <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item>
/// <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item>
/// <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item>
/// <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item>
/// <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item>
/// <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item>
/// <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item>
/// <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item>
/// <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item>
/// <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item>
/// <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item>
/// <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item>
/// <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item>
/// <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item>
/// </ul>
/// <para/>
/// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a>
/// <para/>
/// For example, '&agrave;' will be replaced by 'a'.
/// </summary>
public static partial class StringExtensions
{
/// <summary>
/// Converts characters above ASCII to their ASCII equivalents. For example,
/// accents are removed from accented characters.
/// </summary>
/// <param name="input"> The string of characters to fold </param>
/// <param name="length"> The length of the folded return string </param>
/// <returns> length of output </returns>
public static string FoldToASCII(this string input, int? length = null)
{
// See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c
}
}
Run Code Online (Sandbox Code Playgroud)
希望对其他人有帮助,这是我找到的最强大的解决方案!
小智 6
为了简单地删除法语加拿大重音符号作为原始问题的要求,这里有一个替代方法,它使用正则表达式而不是硬编码转换和 For/Next 循环。根据您的需要,它可以压缩为一行代码;但是,我将其添加到扩展类中以方便重用。
\n视觉基础
\nImports System.Text\nImports System.Text.RegularExpressions\n\nPublic MustInherit Class StringExtension\n Public Shared Function RemoveDiacritics(Text As String) As String\n Return New Regex("\\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), String.Empty)\n End Function\nEnd Class\nRun Code Online (Sandbox Code Playgroud)\n执行
\n Private Shared Sub DoStuff()\n MsgBox(StringExtension.RemoveDiacritics(inputString))\n End Sub\nRun Code Online (Sandbox Code Playgroud)\nC#
\nusing System.Text;\nusing System.Text.RegularExpressions;\n\nnamespace YourApplication\n{\n public abstract class StringExtension\n {\n public static string RemoveDiacritics(string Text)\n {\n return new Regex(@"\\p{Mn}", RegexOptions.Compiled).Replace(Text.Normalize(NormalizationForm.FormD), string.Empty);\n }\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n执行
\n private static void DoStuff()\n {\n MessageBox.Show(StringExtension.RemoveDiacritics(inputString));\n }\nRun Code Online (Sandbox Code Playgroud)\n输入:\xc2\xa0\xc2\xa0\xc2\xa0\xc3\xa4\xc3\xa1\xc4\x8d\xc4\x8f\xc4\x9b\xc3\xa9\xc3\xad\xc4\xbe\xc4\xbe\xc5\x88\xc3\xb4\xc3\xb3\xc5\x99\xc5\x95\xc5\xa1\xc5\xa5\xc3\xba\xc5\xaf\xc3\xbd\xc5\xbe \xc3\x84\xc3\x81\xc4\x8c\xc4\x8e\xc4\x9a\xc3\x89\xc3\x8d\xc4\xbd\xc4\xbd\xc5\x87\xc3\x94\xc3\x93\xc5\x98\xc5\x94\xc5\xa0\xc5\xa4\xc3\x9a\xc5\xae\xc3\x9d\xc5\xbd \xc3\x96\xc3\x9c\xc3\x8b \xc5\x82\xc5\x81\xc4\x91\xc4\x90 \xc5\xa3\xc5\xa2\xc5\x9f\xc5\x9e\xc3\xa7\xc3\x87 \xc3\xb8\xc4\xb1
输出:aacdeeillnoorrstuuyz AACDEEILLNOORRSTUUYZ OUE \xc5\x82\xc5\x81\xc4\x91\xc4\x90 tTsScC \xc3\xb8\xc4\xb1
我添加了不会被转换的字符,以帮助可视化收到意外输入时会发生的情况。
\n如果您还需要它来转换其他类型的字符,例如波兰语 \xc5\x82 和 \xc5\x81,那么根据您的需求,考虑将使用的这个答案(.NET Core 友好)合并CodePagesEncodingProvider到您的解决方案中。
有趣的是,这样一个问题可以得到这么多答案,但又没有一个适合我的要求:)周围有太多语言,一种完全与语言无关的解决方案真的不可能AFAIK,因为其他人提到FormC或FormD正在发出问题。
由于原始问题与法语有关,因此最简单的工作答案确实是
public static string ConvertWesternEuropeanToASCII(this string str)
{
return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
}
Run Code Online (Sandbox Code Playgroud)
1251应该替换为输入语言的编码代码。
但是,这只能用一个字符替换一个字符。由于我也使用德语作为输入,因此我进行了手动转换
public static string LatinizeGermanCharacters(this string str)
{
StringBuilder sb = new StringBuilder(str.Length);
foreach (char c in str)
{
switch (c)
{
case 'ä':
sb.Append("ae");
break;
case 'ö':
sb.Append("oe");
break;
case 'ü':
sb.Append("ue");
break;
case 'Ä':
sb.Append("Ae");
break;
case 'Ö':
sb.Append("Oe");
break;
case 'Ü':
sb.Append("Ue");
break;
case 'ß':
sb.Append("ss");
break;
default:
sb.Append(c);
break;
}
}
return sb.ToString();
}
Run Code Online (Sandbox Code Playgroud)
它可能无法提供最佳性能,但至少它很容易阅读和扩展。正则表达式是行不通的,比任何char / string东西都要慢得多。
我也有一个非常简单的方法来删除空间:
public static string RemoveSpace(this string str)
{
return str.Replace(" ", string.Empty);
}
Run Code Online (Sandbox Code Playgroud)
最终,我使用了以上所有三个扩展的组合:
public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
{
str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();
return keepSpace ? str : str.RemoveSpace();
}
Run Code Online (Sandbox Code Playgroud)
并通过了一次小型单元测试(并非详尽无遗)。
[TestMethod()]
public void LatinizeAndConvertToASCIITest()
{
string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
string actual = europeanStr.LatinizeAndConvertToASCII();
Assert.AreEqual(expected, actual);
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
180849 次 |
| 最近记录: |