ste*_*bot 9 java string truncate
我创建了以下内容,用于将java中的字符串截断为具有给定字节数的新字符串.
String truncatedValue = "";
String currentValue = string;
int pivotIndex = (int) Math.round(((double) string.length())/2);
while(!truncatedValue.equals(currentValue)){
currentValue = string.substring(0,pivotIndex);
byte[] bytes = null;
bytes = currentValue.getBytes(encoding);
if(bytes==null){
return string;
}
int byteLength = bytes.length;
int newIndex = (int) Math.round(((double) pivotIndex)/2);
if(byteLength > maxBytesLength){
pivotIndex = newIndex;
} else if(byteLength < maxBytesLength){
pivotIndex = pivotIndex + 1;
} else {
truncatedValue = currentValue;
}
}
return truncatedValue;
Run Code Online (Sandbox Code Playgroud)
这是我想到的第一件事,我知道我可以改进它.我看到另一篇帖子在那里问了一个类似的问题,但他们使用字节而不是String.substring截断字符串.我想我宁愿在我的情况下使用String.substring.
编辑:我刚刚删除了UTF8引用,因为我宁愿能够为不同的存储类型执行此操作.
Rex*_*err 13
为什么不转换为字节并向前走 - 在执行时遵循UTF8字符边界 - 直到获得最大数量,然后将这些字节转换回字符串?
或者,如果您跟踪切割应该发生的位置,您可以剪切原始字符串:
// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
public static String cut(String s, int n) {
byte[] utf8 = s.getBytes();
if (utf8.length < n) n = utf8.length;
int n16 = 0;
int advance = 1;
int i = 0;
while (i < n) {
advance = 1;
if ((utf8[i] & 0x80) == 0) i += 1;
else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
else { i += 4; advance = 2; }
if (i <= n) n16 += advance;
}
return s.substring(0,n16);
}
}
Run Code Online (Sandbox Code Playgroud)
注意:已编辑以修复2014-08-25中的错误
更合理的解决方案是使用解码器:
final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
final byte[] bytes = inputString.getBytes(CHARSET);
final CharsetDecoder decoder = CHARSET.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.reset();
final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
final String outputString = decoded.toString();
Run Code Online (Sandbox Code Playgroud)
我认为Rex Kerr的解决方案有2个错误.
请在下面找到我的更正版本:
public String cut(String s, int charLimit) throws UnsupportedEncodingException {
byte[] utf8 = s.getBytes("UTF-8");
if (utf8.length <= charLimit) {
return s;
}
int n16 = 0;
boolean extraLong = false;
int i = 0;
while (i < charLimit) {
// Unicode characters above U+FFFF need 2 words in utf16
extraLong = ((utf8[i] & 0xF0) == 0xF0);
if ((utf8[i] & 0x80) == 0) {
i += 1;
} else {
int b = utf8[i];
while ((b & 0x80) > 0) {
++i;
b = b << 1;
}
}
if (i <= charLimit) {
n16 += (extraLong) ? 2 : 1;
}
}
return s.substring(0, n16);
}
Run Code Online (Sandbox Code Playgroud)
我仍然认为这远非有效.因此,如果您不需要结果的String表示形式并且字节数组将执行,您可以使用:
private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
byte[] utf8 = s.getBytes("UTF-8");
if (utf8.length <= charLimit) {
return utf8;
}
if ((utf8[charLimit] & 0x80) == 0) {
// the limit doesn't cut an UTF-8 sequence
return Arrays.copyOf(utf8, charLimit);
}
int i = 0;
while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
++i;
}
if ((utf8[charLimit-i-1] & 0x80) > 0) {
// we have to skip the starter UTF-8 byte
return Arrays.copyOf(utf8, charLimit-i-1);
} else {
// we passed all UTF-8 bytes
return Arrays.copyOf(utf8, charLimit-i);
}
}
Run Code Online (Sandbox Code Playgroud)
有趣的是,在实际的20-500字节限制下,它们执行的几乎相同,如果你再次从字节数组创建一个字符串.
请注意,这两种方法都假设有效的utf-8输入,这是使用Java的getBytes()函数后的有效假设.
| 归档时间: |
|
| 查看次数: |
17101 次 |
| 最近记录: |