在字节数组中查找模式的最有效方法

Question

在字节数组中查找模式的最有效方法

我有以下代码：

var file = //Memory stream with a file in it
var bytes = file.ToArray();

Run Code Online (Sandbox Code Playgroud)

我需要搜索bytes指定字节序列的第一次出现（如果有的话）：0xff, 0xd8。（这样做的目的是找到嵌入在文件中的图像）

因此，如果例如bytes[6501]contains0xff和bytes[6502] contains 0xd8，那是一个匹配项，我需要返回的位置的索引（6501），或者一个新数组，它是字节数组的副本，除非它没有低于 6501 的键来自旧数组。

我目前的解决方案是循环：

 for (var index = 0; index < bytes.Length; index++)
 {
     if((new byte[] {0xff, 0xd8}).SequenceEqual(bytes.Skip(index).Take(2))
    ...

Run Code Online (Sandbox Code Playgroud)

但是在处理更大的文件时它非常慢。

有没有更有效的方法来处理这个问题？

Answer 1

Cyg*_*gon 9

如果这是对时间要求严格的代码，我发现 C# 编译器（Mono 的实现和 Microsoft 的）具有特殊的逻辑来优化简单的扫描循环。

因此，根据分析经验，我将使用硬编码的第一个元素搜索来实现序列搜索，如下所示：

/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
///   The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
  int end = array.Length - sequence.Length; // past here no match is possible
  byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing

  while(start <= end) {
    // scan for first byte only. compiler-friendly.
    if(array[start] == firstByte) {
      // scan for rest of sequence
      for (int offset = 1;; ++offset) {
        if(offset == sequence.Length) { // full sequence matched?
          return start;
        } else if(array[start + offset] != sequence[offset]) {
          break;
        }
      }
    }
    ++start;
  }

  // end of array reached without match
  return -1;
}

Run Code Online (Sandbox Code Playgroud)

比其他建议要长很多，并且容易出现 off-by-1 错误，但是如果您正在扫描大量数据或为频繁的设备 IO 执行此操作，则此设置将避免馈送垃圾收集器并进行很好的优化。

编辑 2019-10-03：修复了 Warren Rox 指出的问题。谢谢！测试：https : //ideone.com/mmACYj

Answer 2

Dbl*_*Dbl 2

您想使用 for 循环来检查您的数组。你的代码速度慢的原因很简单。

反编译说明原因：

public static IEnumerable<TSource> Skip<TSource>(this IEnumerable<TSource> source, int count)
{
  if (source == null)
    throw Error.ArgumentNull("source");
  else
    return Enumerable.SkipIterator<TSource>(source, count);
}

private static IEnumerable<TSource> SkipIterator<TSource>(IEnumerable<TSource> source, int count)
{
  using (IEnumerator<TSource> enumerator = source.GetEnumerator())
  {
    while (count > 0 && enumerator.MoveNext())
      --count;
    if (count <= 0)
    {
      while (enumerator.MoveNext())
        yield return enumerator.Current;
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

对于循环中的每个 for ，您都在执行跳过，基本上不必要地再次迭代数组。

一些 Linq 操作包含在可能的情况下使用索引器的优化 - 不幸的是，跳过不是其中之一。

附：

如果我是你，我会将你的代码更改为类似的内容

var search = new byte[] {0xff, 0xd8};
var current = new byte[2];
var maxSearchRange = bytes.Length -1;
for (var index = 0; index < maxSearchRange; index++)
{
   current[0] = bytes[index];
   current[1] = bytes[index+1];

   if((search).SequenceEqual(current))
       ...

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，5 月前
查看次数：	12863 次
最近记录：	5 年，9 月前