我有一个程序正在从twitter流api实时解析推文.在存储它们之前,我将它们编码为utf8.某些字符最终出现在字符串中?,??或??? 而不是他们各自的unicode代码并导致问题.经过进一步调查,我发现有问题的字符来自"表情符号"块,U + 1F600 - U + 1F64F,以及"其他符号和象形文字"块,U + 1F300 - U + 1F5FF.我尝试删除,但是不成功,因为匹配器最终替换了字符串中的几乎每个字符,而不仅仅是我想要的unicode范围.
String utf8tweet = "";
try {
byte[] utf8Bytes = status.getText().getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[\\u1f300-\\u1f64f]", Pattern.UNICODE_CASE | Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
Run Code Online (Sandbox Code Playgroud)
我该怎么做才能删除这些字符?
jua*_*rro 32
在正则表达式模式中添加否定运算符^.要过滤可打印字符,您可以使用以下表达式[^\\x00-\\x7F],您应该得到所需的结果.
import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class UTF8 {
public static void main(String[] args) {
String utf8tweet = "";
try {
byte[] utf8Bytes = "#Hello twitter ? How are you?".getBytes("UTF-8");
utf8tweet = new String(utf8Bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Pattern unicodeOutliers = Pattern.compile("[^\\x00-\\x7F]",
Pattern.UNICODE_CASE | Pattern.CANON_EQ
| Pattern.CASE_INSENSITIVE);
Matcher unicodeOutlierMatcher = unicodeOutliers.matcher(utf8tweet);
System.out.println("Before: " + utf8tweet);
utf8tweet = unicodeOutlierMatcher.replaceAll(" ");
System.out.println("After: " + utf8tweet);
}
}
Run Code Online (Sandbox Code Playgroud)
结果如下:
Before: #Hello twitter ? How are you?
After: #Hello twitter How are you?
Run Code Online (Sandbox Code Playgroud)
编辑
为了进一步解释,您还可以\u通过以下方式继续使用表单表达范围[^\\u0000-\\u007F],这将匹配所有不是前128个UNICODE字符的字符(与之前相同).如果要扩展范围以支持额外字符,可以使用此处的UNICODE字符列表.
例如,如果您想要包含带重音的元音(用西班牙语),您应该将范围扩展到\u00FF,所以你有[^\\u0000-\\u00FF]或[^\\x00-\\xFF]:
Before: #Hello twitter ? How are you? á é í ó ú
After: #Hello twitter How are you? á é í ó ú
Run Code Online (Sandbox Code Playgroud)
Joo*_*gen 24
首先,有关的unicode块在java(严格遵循标准)中指定为Character.UnicodeBlock MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS.在正则表达式中:
s = s.replaceAll("\\p{So}+", "");
Run Code Online (Sandbox Code Playgroud)
小智 7
我试过这个.unicode范围来自表情符号范围
class EmojiEraser{
private static final String EMOJI_RANGE_REGEX =
"[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|[\u2600-\u26FF]|[\u2700-\u27BF]";
private static final Pattern PATTERN = Pattern.compile(EMOJI_RANGE_REGEX);
/**
* Finds and removes emojies from @param input
*
* @param input the input string potentially containing emojis (comes as unicode stringfied)
* @return input string with emojis replaced
*/
public String eraseEmojis(String input) {
if (Strings.isNullOrEmpty(input)) {
return input;
}
Matcher matcher = PATTERN.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
return sb.toString();
}
}
Run Code Online (Sandbox Code Playgroud)