按字节截断字符串

Question

按字节截断字符串

我创建了以下内容,用于将java中的字符串截断为具有给定字节数的新字符串.

        String truncatedValue = "";
        String currentValue = string;
        int pivotIndex = (int) Math.round(((double) string.length())/2);
        while(!truncatedValue.equals(currentValue)){
            currentValue = string.substring(0,pivotIndex);
            byte[] bytes = null;
            bytes = currentValue.getBytes(encoding);
            if(bytes==null){
                return string;
            }
            int byteLength = bytes.length;
            int newIndex =  (int) Math.round(((double) pivotIndex)/2);
            if(byteLength > maxBytesLength){
                pivotIndex = newIndex;
            } else if(byteLength < maxBytesLength){
                pivotIndex = pivotIndex + 1;
            } else {
                truncatedValue = currentValue;
            }
        }
        return truncatedValue;

Run Code Online (Sandbox Code Playgroud)

这是我想到的第一件事,我知道我可以改进它.我看到另一篇帖子在那里问了一个类似的问题,但他们使用字节而不是String.substring截断字符串.我想我宁愿在我的情况下使用String.substring.

编辑:我刚刚删除了UTF8引用,因为我宁愿能够为不同的存储类型执行此操作.

Answer 1

Rex*_*err 13

为什么不转换为字节并向前走 - 在执行时遵循UTF8字符边界 - 直到获得最大数量,然后将这些字节转换回字符串？

或者,如果您跟踪切割应该发生的位置,您可以剪切原始字符串:

// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
  public static String cut(String s, int n) {
    byte[] utf8 = s.getBytes();
    if (utf8.length < n) n = utf8.length;
    int n16 = 0;
    int advance = 1;
    int i = 0;
    while (i < n) {
      advance = 1;
      if ((utf8[i] & 0x80) == 0) i += 1;
      else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
      else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
      else { i += 4; advance = 2; }
      if (i <= n) n16 += advance;
    }
    return s.substring(0,n16);
  }
}

Run Code Online (Sandbox Code Playgroud)

^{注意:已编辑以修复2014-08-25中的错误}

Answer 2

kan*_*kan 7

更合理的解决方案是使用解码器:

final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
final byte[] bytes = inputString.getBytes(CHARSET);
final CharsetDecoder decoder = CHARSET.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.reset();
final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
final String outputString = decoded.toString();

Run Code Online (Sandbox Code Playgroud)

在任意字节索引处切割可能会创建无效的编码数据，因为单个字符可能使用多个字节（尤其是 UTF-8）。更糟糕的是，使用其他编码可能会产生错误的有效字符，而这些字符不会被忽略。您可以通过首先分配所需大小的“ByteBuffer”，然后将其与“CharsetEncoder”一起使用，它会自动编码适合缓冲区的有效字符，然后将缓冲区解码为“String”，从而轻松避免这种情况。类似的方法，但没有错误，甚至更有效，因为它不会对超出预期限制的字符进行编码。 (2认同)
是的，对于使用 CodingErrorAction.IGNORE 的 UTF-8 会做正确的事情。但是 OP 说“我宁愿能够为不同的存储类型也这样做”，对于其他编码，将多字节序列分开可能会导致有效（但错误）的字符。 (2认同)

Answer 3

Zso*_*kai 5

我认为Rex Kerr的解决方案有2个错误.

首先,如果非ASCII字符恰好在限制之前,它将截断为限制+ 1.截断"123456789á1"将产生"123456789á",其以UTF-8中的11个字符表示.
其次,我认为他误解了UTF标准.https://en.wikipedia.org/wiki/UTF-8#Description显示UTF序列开头的110xxxxx告诉我们表示长度为2个字符(而不是3个).这就是他的实施通常不会耗尽所有可用空间的原因(正如Nissim Avitan所说).

请在下面找到我的更正版本:

public String cut(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return s;
    }
    int n16 = 0;
    boolean extraLong = false;
    int i = 0;
    while (i < charLimit) {
        // Unicode characters above U+FFFF need 2 words in utf16
        extraLong = ((utf8[i] & 0xF0) == 0xF0);
        if ((utf8[i] & 0x80) == 0) {
            i += 1;
        } else {
            int b = utf8[i];
            while ((b & 0x80) > 0) {
                ++i;
                b = b << 1;
            }
        }
        if (i <= charLimit) {
            n16 += (extraLong) ? 2 : 1;
        }
    }
    return s.substring(0, n16);
}

Run Code Online (Sandbox Code Playgroud)

我仍然认为这远非有效.因此,如果您不需要结果的String表示形式并且字节数组将执行,您可以使用:

private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return utf8;
    }
    if ((utf8[charLimit] & 0x80) == 0) {
        // the limit doesn't cut an UTF-8 sequence
        return Arrays.copyOf(utf8, charLimit);
    }
    int i = 0;
    while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
        ++i;
    }
    if ((utf8[charLimit-i-1] & 0x80) > 0) {
        // we have to skip the starter UTF-8 byte
        return Arrays.copyOf(utf8, charLimit-i-1);
    } else {
        // we passed all UTF-8 bytes
        return Arrays.copyOf(utf8, charLimit-i);
    }
}

Run Code Online (Sandbox Code Playgroud)

有趣的是,在实际的20-500字节限制下,它们执行的几乎相同,如果你再次从字节数组创建一个字符串.

请注意,这两种方法都假设有效的utf-8输入,这是使用Java的getBytes()函数后的有效假设.

归档时间：	15 年，3 月前
查看次数：	17101 次
最近记录：	9 年前