如何从字符串中提取表情符号和字母字符

Question

如何从字符串中提取表情符号和字母字符

Kis*_*nga 5 java android character utf-8 emoji

我想从字符串中提取表情符号和字母字符到集合中，只需字符串具有任何类型的表情符号字符，例如活动、家庭、旗帜、动物符号，并且还具有字母字符。当我从中得到字符串时，EditText类似于“ABCD\xe2\x80\x8d\xe2\x80\x8d\xe2\x80\x8dE\xef\xb8\x8f\xe2\x80\x8d\xe2\x80\x8d”。我尝试过，但不幸的是，获取集合数组并不像我的预期，所以，任何人都可以建议我，我需要为预期的集合数组做什么？

\n\n

使用 Eclipse，我尝试了这段代码，如果我错了，请纠正我

\n\n

public class CodePoints {\n\n    public static void main(String []args){\n        List<String> list = new ArrayList<>();\n        for(int codePoint : codePoints("ABCD\xe2\x80\x8d\xe2\x80\x8d\xe2\x80\x8dE\xef\xb8\x8f\xe2\x80\x8d\xe2\x80\x8d")) {\n            list.add(String.valueOf(Character.toChars(codePoint)));\n        }\n\n        System.out.println(Arrays.toString(list.toArray()));\n    }\n\n    public static Iterable<Integer> codePoints(final String string) {\n     return new Iterable<Integer>() {\n       public Iterator<Integer> iterator() {\n         return new Iterator<Integer>() {\n           int nextIndex = 0;\n           public boolean hasNext() {\n             return nextIndex < string.length();\n           }\n           public Integer next() {\n             int result = string.codePointAt(nextIndex);\n             nextIndex += Character.charCount(result);\n             return result;\n           }\n           public void remove() {\n             throw new UnsupportedOperationException();\n           }\n         };\n       }\n     };\n   }\n}\n

Run Code Online (Sandbox Code Playgroud)\n\n

输出：
\n[A、B、、C、、D、、\xe2\x80\x8d、、\xe2\x80\x8d、、\xe2\x80\x8d、、E、、\xef\xb8\x8f、 \xe2\x80\x8d, , \xe2\x80\x8d, ]\n \n
\
n预期：
\n[A, B, , C, , D, \xe2\x80\x8d\xe2\x80\x8d\ xe2\x80\x8d，E，\xef\xb8\x8f\xe2\x80\x8d\xe2\x80\x8d，]

\n

Answer 1

lov*_*343 0

问题是您的字符串包含不可见的字符。
\n它们是：
\nUnicode 字符 \'ZERO WIDTH JOINER\' (U+200D)
\nUnicode 字符 \'VARIATION SELECTOR-16\' (U+FE0F)
\n其他类似的有：
\nUnicode 字符 \'SOFT HYPHEN\' (U+00AD)
\n...

\n\n

java字符是utf16编码的，参见： https: //en.wikipedia.org/wiki/UTF-16
\n https://docs.oracle.com/javase/7/docs/api/java/lang/String。 html

\n\n

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

\n\n

这是一种迭代字符串中各个 unicode 字符的方法。

\n\n

public static List<String> getUnicodeCharacters(String str) {\n    List<String> result = new ArrayList<>();\n    char charArray[] = str.toCharArray();\n    for (int i = 0; i < charArray.length; ) {\n        if (Character.isHighSurrogate(charArray[i])\n                && (i + 1) < charArray.length\n                && Character.isLowSurrogate(charArray[i + 1])) {\n            result.add(new String(new char[]{charArray[i], charArray[i + 1]}));\n            i += 2;\n        } else {\n            result.add(new String(new char[]{charArray[i]}));\n            i++;\n        }\n    }\n    return result;\n}\n\n@Test\nvoid getUnicodeCharacters() {\n    String str = "ABCD\xe2\x80\x8d\xe2\x80\x8d\xe2\x80\x8dE\xef\xb8\x8f\xe2\x80\x8d\xe2\x80\x8d";\n    System.out.println(str.codePointCount(0, str.length()));\n    for (String unicodeCharacter : UTF_16.getUnicodeCharacters(str)) {\n        if ("\\u200D".equals(unicodeCharacter)\n                || "\\uFE0F".equals(unicodeCharacter))\n            continue;\n        System.out.println(unicodeCharacter);\n    }\n}\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	7 年，6 月前
查看次数：	1228 次
最近记录：	7 年前