查找indexOf另一个字节数组中的字节数组

Question

查找indexOf另一个字节数组中的字节数组

CL2*_*L22 28 java search bytearray

给定一个字节数组,我怎样才能在其中找到(较小的)字节数组的位置？

这个文档看起来很有前途,ArrayUtils但是如果我是正确的,它只会让我在数组中找到要搜索的单个字节.

(我看不出来有问题,但以防万一:有时搜索字节数组将是常规ASCII字符,有时它将是控制字符或扩展的ASCII字符.因此使用字符串操作并不总是合适的)

大数组可以在10到大约10000字节之间,较小的数组可以在10左右.在某些情况下,我将在一次搜索中在较大的数组中找到几个较小的数组.而且我有时想要找到实例的最后一个索引而不是第一个索引.

Answer 1

mor*_*s05 36

最简单的方法是比较每个元素:

public int indexOf(byte[] outerArray, byte[] smallerArray) {
    for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
        boolean found = true;
        for(int j = 0; j < smallerArray.length; ++j) {
           if (outerArray[i+j] != smallerArray[j]) {
               found = false;
               break;
           }
        }
        if (found) return i;
     }
   return -1;  
}

Run Code Online (Sandbox Code Playgroud)

一些测试:

@Test
public void testIndexOf() {
  byte[] outer = {1, 2, 3, 4};
  assertEquals(0, indexOf(outer, new byte[]{1, 2}));
  assertEquals(1, indexOf(outer, new byte[]{2, 3}));
  assertEquals(2, indexOf(outer, new byte[]{3, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}

Run Code Online (Sandbox Code Playgroud)

当你更新你的问题时:Java Strings是UTF-16字符串,他们不关心扩展的ASCII集,所以你可以使用string.indexOf()

Answer 2

ant*_*lma 22

Google的Guava提供了Bytes.indexOf(byte []数组,byte []目标).

如果你在类路径上有它,你知道它为什么不使用它？ (8认同)
...实现为双[for-loop](https://github.com/google/guava/blob/master/guava/src/com/google/common/primitives/Bytes.java#L113). ..不需要图书馆,是吗？ (2认同)

Answer 3

Mon*_*ded 7

这是你想要的？

public class KPM {
    /**
     * Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
     * @param data
     * @param start First index in data
     * @param stop Last index in data so that stop-start = length
     * @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
     * @return
     */
    public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
        if( data == null || pattern == null) return -1;

        int[] failure = computeFailure(pattern);

        int j = 0;

        for( int i = start; i < stop; i++) {
            while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
                j = failure[j - 1];
            }
            if (pattern[j] == '*' || pattern[j] == data[i]) {
                j++;
            }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j>0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}

Run Code Online (Sandbox Code Playgroud)

Answer 4

小智 6

从java.lang.String复制几乎相同。

indexOf(char[],int,int,char[]int,int,int)

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    byte first = target[targetOffset];
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        /* Look for first character. */
        if (source[i] != first) {
            while (++i <= max && source[i] != first)
                ;
        }

        /* Found first character, now look at the rest of v2 */
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
                ;

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}

Run Code Online (Sandbox Code Playgroud)

Answer 5

184*_*615 5

为了节省测试时间:

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

给你的代码,如果你使computeFailure()静态:

public class KPM {
    /**
     * Search the data byte array for the first occurrence 
     * of the byte array pattern.
     */
    public static int indexOf(byte[] data, byte[] pattern) {
    int[] failure = computeFailure(pattern);

    int j = 0;

    for (int i = 0; i < data.length; i++) {
        while (j > 0 && pattern[j] != data[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == data[i]) { 
            j++; 
        }
        if (j == pattern.length) {
            return i - pattern.length + 1;
        }
    }
    return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
    int[] failure = new int[pattern.length];

    int j = 0;
    for (int i = 1; i < pattern.length; i++) {
        while (j>0 && pattern[j] != pattern[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == pattern[i]) {
            j++;
        }
        failure[i] = j;
    }

    return failure;
    }
}

Run Code Online (Sandbox Code Playgroud)

由于测试您借用的代码总是明智的,因此您可以从以下开始:

public class Test {
    public static void main(String[] args) {
        do_test1();
    }
    static void do_test1() {
      String[] ss = { "",
                    "\r\n\r\n",
                    "\n\n",
                    "\r\n\r\nthis is a test",
                    "this is a test\r\n\r\n",
                    "this is a test\r\n\r\nthis si a test",
                    "this is a test\r\n\r\nthis si a test\r\n\r\n",
                    "this is a test\n\r\nthis si a test",
                    "this is a test\r\nthis si a test\r\n\r\n",
                    "this is a test"
                };
      for (String s: ss) {
        System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
      }

    }
}

Run Code Online (Sandbox Code Playgroud)

Answer 6

Bul*_*aza 5

使用Knuth–Morris–Pratt algorithm是最有效的方法。

StreamSearcher.java是它的一个实现，是Twitter's elephant-birdproject 的一部分。

不建议包含这个库，因为它对于只使用一个类来说相当大。

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher
{
    private byte[] pattern_;
    private int[] borders_;

    // An upper bound on pattern length for searching. Results are undefined for longer patterns.
    @SuppressWarnings("unused")
    public static final int MAX_PATTERN_LENGTH = 1024;

    StreamSearcher(byte[] pattern)
    {
        setPattern(pattern);
    }

    /**
     * Sets a new pattern for this StreamSearcher to use.
     *
     * @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
     */
    public void setPattern(byte[] pattern)
    {
        pattern_ = Arrays.copyOf(pattern, pattern.length);
        borders_ = new int[pattern_.length + 1];
        preProcess();
    }

    /**
     * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
     * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
     * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
     * another reasonable default, i.e. leave the stream unchanged.
     *
     * @return bytes consumed if found, -1 otherwise.
     */
    long search(InputStream stream) throws IOException
    {
        long bytesRead = 0;

        int b;
        int j = 0;

        while ((b = stream.read()) != -1)
        {
            bytesRead++;

            while (j >= 0 && (byte) b != pattern_[j])
            {
                j = borders_[j];
            }
            // Move to the next character in the pattern.
            ++j;

            // If we've matched up to the full pattern length, we found it.  Return,
            // which will automatically save our position in the InputStream at the point immediately
            // following the pattern match.
            if (j == pattern_.length)
            {
                return bytesRead;
            }
        }

        // No dice, Note that the stream is now completely consumed.
        return -1;
    }

    /**
     * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
     * and aids in implementation of the Knuth-Moore-Pratt string search.
     * <p>
     * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
     */
    private void preProcess()
    {
        int i = 0;
        int j = -1;
        borders_[i] = j;
        while (i < pattern_.length)
        {
            while (j >= 0 && pattern_[i] != pattern_[j])
            {
                j = borders_[j];
            }
            borders_[++i] = ++j;
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

Answer 7

das*_*ght 0

Java 字符串由 16 位组成char，而不是由 8 位组成byte。Achar可以保存 a byte，因此您始终可以将字节数组转换为字符串，并使用indexOf: ASCII 字符、控制字符，甚至零字符都可以正常工作。

\n\n

这是一个演示：

\n\n

byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};\nbyte[] small = new byte[] {7,0,8,9,0,0,1};\nString bigStr = new String(big, StandardCharsets.UTF_8);\nString smallStr = new String(small, StandardCharsets.UTF_8);\nSystem.out.println(bigStr.indexOf(smallStr));\n

Run Code Online (Sandbox Code Playgroud)\n\n

这打印7.

\n\n

但是，考虑到您的大数组可能高达 10,000 字节，而小数组只有 10 个字节，此解决方案可能不是最有效的，原因有两个：

\n\n

它需要将大数组复制到两倍大的数组中（相同容量，但用char代替byte）。这将使您的内存需求增加三倍。
Java 的字符串搜索算法并不是最快的算法。如果您实现一种高级算法（例如Knuth\xe2\x80\x93Morris\xe2\x80\x93Pratt算法），您的速度可能会足够快。这可能会使执行速度降低多达十倍（小字符串的长度），并且需要与小字符串的长度成正比的额外内存，而不是与大字符串的长度成正比的内存。

\n

不要这样做；这不是编码的工作原理。这个答案有两个问题： (8认同)
首先：并非所有字节序列都是有效的 UTF-8 流，在这些情况下此代码将失败。例如，尝试在字符串“{-16,-112,40,-68}”中查找“{-61, 40}”：您的代码返回“0”，因为这两个都是 Java 替换的无效序列UTF-8 的默认替换字符。 (6认同)
其次，即使对于有效的 UTF-8 流，也不会返回正确的结果：例如，当您解码 ASCII 字符序列时，Java 将填充每个字节以获取字符；它不会为每个字符打包两个字节，除非它们是单个 Unicode 字符编码的一部分。这意味着当您搜索的模式在smallStr和largeStr中以不同的方式分割为多个字符时，代码将是错误的。（从概念上讲，您会在包含“xy, z”的序列中查找“x, y, z”，打包为“x, yz”）。 (6认同)
值得注意的是，单 byte[] 构造函数使用平台默认字符集。这“可能”是某种 UTF-8，但你不能太确定。最好使用[允许您指定它的其他构造函数](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#String%28byte[],% 20java.lang.String%29) IE `new String(bytes, "UTF-8")`。 (2认同)

归档时间：	11 年，9 月前
查看次数：	37757 次
最近记录：	7 年，3 月前