在包含 UTF-8 数据的字节数组中查找最近的安全拆分

Siz*_*e43 3 c# string encoding utf-8

我想拆分大量 UTF-8 编码数据,以便可以并行化将其解码为字符。

似乎没有办法找出Encoding.GetCharCount读取了多少字节。我也无法使用,GetByteCount(GetChars(...))因为它无论如何都会解码整个数组,这是我试图避免的。

410*_*one 6

UTF-8 具有明确定义的字节序列,被认为是自同步的,这意味着给定任何位置,bytes您都可以找到该位置的字符开始的位置。

UTF-8 规范(维基百科是最简单的链接)定义了以下字节序列:

0_______ : ASCII (0-127) char
10______ : Continuation
110_____ : Two-byte character
1110____ : Three-byte character
11110___ : Four-byte character
Run Code Online (Sandbox Code Playgroud)

因此,以下方法(或类似方法)应该会得到您的结果:

  1. 获取bytes( bytes.Length, et. al.)的字节数
  2. 确定要分成多少个部分
  3. 选择字节 byteCount / sectionCount
  4. 针对表的测试字节:
    1. 如果byte & 0x80 == 0x00那么您可以将此字节作为任一部分的一部分
    2. 如果byte & 0xE0 == 0xC0那么您需要向前查找一个字节,并将其保留在当前部分
    3. 如果byte & 0xF0 == 0xE0那么您需要向前搜索两个字节,并将其保留在当前部分
    4. 如果byte & 0xF8 == 0xF0那么您需要向前搜索三个字节,并将其保留在当前部分
    5. 如果byte & 0xC0 == 0x80那时您处于延续状态,并且应该向前搜索直到第一个不适合的字节val & 0xB0 == 0x80,则在当前部分中保持(但不包括)此值
  5. 选择byteStart通过byteCount + offsetwhereoffset可以通过上面的测试来定义
  6. 对每个部分重复。

当然,如果我们将测试重新定义为返回当前字符的起始位置,我们有两种情况: 1. 如果(byte[i] & 0xC0) == 0x80我们需要在数组中移动 2. 否则,返回当前i(因为它不是延续)

这为我们提供了以下方法:

public static int GetCharStart(ref byte[] arr, int index) =>
    (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
Run Code Online (Sandbox Code Playgroud)

接下来,我们要获取每个部分。最简单的方法是使用状态机(或滥用,取决于您如何看待它)来返回部分:

public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
    var sectionStart = 0;
    var sectionEnd = 0;

    for (var i = 0; i < sectionCount; i++)
    {
        sectionEnd = i == (sectionCount - 1) ? utf8Array.Length : GetCharStart(ref utf8Array, (int)Math.Round((double)utf8Array.Length / sectionCount * i));
        yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
        sectionStart = sectionEnd;
    }
}
Run Code Online (Sandbox Code Playgroud)

现在我以这种方式构建它是因为我想用它Parallel.ForEach来演示结果,如果我们有一个,这使得它变得非常容易IEnumerable,并且它还允许我在处理时非常懒惰:我们只在需要时继续收集部分,这意味着我们可以懒惰地处理它并按需进行,这是一件好事,不是吗?

最后,我们需要能够获得一段字节,所以我们有这个GetSection方法:

public static byte[] GetSection(ref byte[] array, int start, int end)
{
    var result = new byte[end - start];
    for (var i = 0; i < result.Length; i++)
    {
        result[i] = array[i + start];
    }
    return result;
}
Run Code Online (Sandbox Code Playgroud)

最后,演示:

var sourceText = "Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.";
var source = Encoding.UTF8.GetBytes(sourceText);
Console.WriteLine(sourceText);

var results = new ConcurrentBag<string>();
Parallel.ForEach(GetByteSections(source, 10),
                    new ParallelOptions { MaxDegreeOfParallelism = 1 },
                    x => { Console.WriteLine(Encoding.UTF8.GetString(x)); results.Add(Encoding.UTF8.GetString(x)); });

Console.WriteLine();
Console.WriteLine("Assemble the result: ");
Console.WriteLine(string.Join("", results.Reverse()));
Console.ReadLine();
Run Code Online (Sandbox Code Playgroud)

结果:

Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.

Some test ???, ??
?? string that should b
e decoded in parallel, thi
s demonstrates that we work
 flawlessly with Parallel.
ForEach. The only downside
to using `Parallel.ForEach`
 the way I demonstrate is
that it doesn't take order into account, but oh-well.

Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.
Run Code Online (Sandbox Code Playgroud)

不完美,但它可以完成工作。如果我们更改MaxDegreesOfParallelism为更高的值,我们的字符串会变得混乱:

Some test ???, ??
e decoded in parallel, thi
 flawlessly with Parallel.
?? string that should b
to using `Parallel.ForEach`
ForEach. The only downside
that it doesn't take order into account, but oh-well.
s demonstrates that we work
 the way I demonstrate is
Run Code Online (Sandbox Code Playgroud)

所以,如你所见,超级简单。您将需要进行修改以允许正确的顺序重新组装,但这应该证明了这个技巧。

如果我们修改GetByteSections方法如下,最后一部分不再是剩余部分的 2 倍大小:

public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
    var sectionStart = 0;
    var sectionEnd = 0;
    var sectionSize = (int)Math.Ceiling((double)utf8Array.Length / sectionCount);

    for (var i = 0; i < sectionCount; i++)
    {
        if (i == (sectionCount - 1))
        {
            var lengthRem = utf8Array.Length - i * sectionSize;
            sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
            sectionStart = sectionEnd;
            sectionEnd = utf8Array.Length;
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
        }
        else
        {
            sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
            sectionStart = sectionEnd;
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

结果:

Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.

Some test ???, ???? string that should be de
coded in parallel, this demonstrates that we work flawless
ly with Parallel.ForEach. The only downside to using `Para
llel.ForEach` the way I demonstrate is that it doesn't tak
e order into account, but oh-well. We can continue to incr
ease the length of this string to demonstrate that the las
t section is usually about double the size of the other se
ctions, we could fix that if we really wanted to. In fact,
 with a small modification it does so, we just have to rem
ember that we'll end up with `sectionCount + 1` results.

Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.
Run Code Online (Sandbox Code Playgroud)

最后,如果由于某种原因,与输入大小相比,您分成了异常多的部分(我的输入大小为 250 个字符的 ~578 字节证明了这一点),您将遇到IndexOutOfRangeExceptionin GetCharStart,以下版本修复了该问题:

public static int GetCharStart(ref byte[] arr, int index)
{
    if (index > arr.Length)
    {
        index = arr.Length - 1;
    }

    return (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
}
Run Code Online (Sandbox Code Playgroud)

当然,这会给您留下一堆空结果,但是当您重新组合字符串时不会改变,所以我什至不打算在这里发布完整的场景测试。(我把它留给你去试验。)