Dav*_*d G 4 java unicode character-encoding
我需要能够在Java字符串中检测日语字符.
目前我正在获取UnicodeBlock并检查它是否等于Character.UnicodeBlock.KATAKANA或Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS,但我不是100%将覆盖所有内容.
有什么建议?
我使用以下java方法.可能不会完全满足您的要求.
<!-- language: lang-java -->
/**
* Returns if a character is one of Chinese-Japanese-Korean characters.
*
* @param c
* the character to be tested
* @return true if CJK, false otherwise
*/
private boolean isCharCJK(final char c) {
if ((Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS)) {
return true;
}
return false;
}
Run Code Online (Sandbox Code Playgroud)
此外,这些似乎应该适用于平假名和片假名字符:
private boolean isHiragana(final char c)
{
return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.HIRAGANA);
}
private boolean isKatakana(final char c)
{
return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.KATAKANA);
}
Run Code Online (Sandbox Code Playgroud)
根据regular-expressions.info,日语不是由一个脚本组成的:"没有日语Unicode脚本.相反,Unicode提供日语文档通常由平面组成的平假名,片假名,汉语和拉丁语脚本."
在这种情况下,这个正则表达式应该做的伎俩:
yourString.matches("[\\p{Hiragana}\\p{Katakana}\\p{Han}\\p{Latin}]*+")
Run Code Online (Sandbox Code Playgroud)