在java中拆分具有字节长度限制的字符串

KYH*_*ode 5 java string character-encoding

我想将a拆分String为一个String[]数组,其元素符合以下条件.

  • s.getBytes(encoding).length不应该超过maxsize(int).

  • 如果我用StringBuilder+运算符连接拆分的字符串,结果应该是原始字符串.

  • 输入字符串可以具有unicode字符,当以例如UTF-8编码时可以具有多个字节.

所需的原型如下所示.

public static String[] SplitStringByByteLength(String src,String encoding, int maxsize)
Run Code Online (Sandbox Code Playgroud)

和测试代码:

public boolean isNice(String str, String encoding, int max)
{
    //boolean success=true;
    StringBuilder b=new StringBuilder();
    String[] splitted= SplitStringByByteLength(str,encoding,max);
    for(String s: splitted)
    {
        if(s.getBytes(encoding).length>max)
            return false;
        b.append(s);
    }
    if(str.compareTo(b.toString()!=0)
        return false;
    return true;
}
Run Code Online (Sandbox Code Playgroud)

虽然输入字符串只有ASCII字符似乎很容易,但它可以共存多字节字符的事实让我感到困惑.

先感谢您.

编辑:我添加了我的代码实现.(低效)

public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) throws UnsupportedEncodingException
{
    ArrayList<String> splitted=new ArrayList<String>();
    StringBuilder builder=new StringBuilder();
    //int l=0;
    int i=0;
    while(true)
    {
        String tmp=builder.toString();
        char c=src.charAt(i);
        if(c=='\0')
            break;
        builder.append(c);
        if(builder.toString().getBytes(encoding).length>maxsize)
        {
            splitted.add(new String(tmp));
            builder=new StringBuilder();
        }
        ++i;
    }
    return splitted.toArray(new String[splitted.size()]);
}
Run Code Online (Sandbox Code Playgroud)

这是解决这个问题的唯一方法吗?

Ser*_*sta 8

该课程CharsetEncode已满足您的要求.从Encode方法的Javadoc中提取:

public final CoderResult encode(CharBuffer in,
                            ByteBuffer out,
                            boolean endOfInput)
Run Code Online (Sandbox Code Playgroud)

从给定的输入缓冲区中编码尽可能多的字符,将结果写入给定的输出缓冲区...

除了从输入缓冲区读取字符并将字节写入输出缓冲区之外,此方法还返回CoderResult对象以描述其终止原因:

...

CoderResult.OVERFLOW表示输出缓冲区中没有足够的空间来编码更多字符.应该使用具有更多剩余字节的输出缓冲区再次调用此方法.这通常通过从输出缓冲区中排出任何编码字节来完成.

可能的代码可能是:

public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) {
    Charset cs = Charset.forName(encoding);
    CharsetEncoder coder = cs.newEncoder();
    ByteBuffer out = ByteBuffer.allocate(maxsize);  // output buffer of required size
    CharBuffer in = CharBuffer.wrap(src);
    List<String> ss = new ArrayList<>();            // a list to store the chunks
    int pos = 0;
    while(true) {
        CoderResult cr = coder.encode(in, out, true); // try to encode as much as possible
        int newpos = src.length() - in.length();
        String s = src.substring(pos, newpos);
        ss.add(s);                                  // add what has been encoded to the list
        pos = newpos;                               // store new input position
        out.rewind();                               // and rewind output buffer
        if (! cr.isOverflow()) {
            break;                                  // everything has been encoded
        }
    }
    return ss.toArray(new String[0]);
}
Run Code Online (Sandbox Code Playgroud)

这将以块的形式分割原始字符串,当以字节编码时,尽可能多地匹配给定大小的字节数组(当然假设maxsize不是非常小).