查找indexOf另一个字节数组中的字节数组

CL2*_*L22 28 java search bytearray

给定一个字节数组,我怎样才能在其中找到(较小的)字节数组的位置?

这个文档看起来很有前途,ArrayUtils但是如果我是正确的,它只会让我在数组中找到要搜索的单个字节.

(我看不出来有问题,但以防万一:有时搜索字节数组将是常规ASCII字符,有时它将是控制字符或扩展的ASCII字符.因此使用字符串操作并不总是合适的)

大数组可以在10到大约10000字节之间,较小的数组可以在10左右.在某些情况下,我将在一次搜索中在较大的数组中找到几个较小的数组.而且我有时想要找到实例的最后一个索引而不是第一个索引.

mor*_*s05 36

最简单的方法是比较每个元素:

public int indexOf(byte[] outerArray, byte[] smallerArray) {
    for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
        boolean found = true;
        for(int j = 0; j < smallerArray.length; ++j) {
           if (outerArray[i+j] != smallerArray[j]) {
               found = false;
               break;
           }
        }
        if (found) return i;
     }
   return -1;  
}  
Run Code Online (Sandbox Code Playgroud)

一些测试:

@Test
public void testIndexOf() {
  byte[] outer = {1, 2, 3, 4};
  assertEquals(0, indexOf(outer, new byte[]{1, 2}));
  assertEquals(1, indexOf(outer, new byte[]{2, 3}));
  assertEquals(2, indexOf(outer, new byte[]{3, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
  assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}
Run Code Online (Sandbox Code Playgroud)

当你更新你的问题时:Java Strings是UTF-16字符串,他们不关心扩展的ASCII集,所以你可以使用string.indexOf()


ant*_*lma 22

Google的Guava提供了Bytes.indexOf(byte []数组,byte []目标).

  • 如果你在类路径上有它,你知道它为什么不使用它? (8认同)
  • ...实现为双[for-loop](https://github.com/google/guava/blob/master/guava/src/com/google/common/primitives/Bytes.java#L113). ..不需要图书馆,是吗? (2认同)

Mon*_*ded 7

这是你想要的?

public class KPM {
    /**
     * Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
     * @param data
     * @param start First index in data
     * @param stop Last index in data so that stop-start = length
     * @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
     * @return
     */
    public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
        if( data == null || pattern == null) return -1;

        int[] failure = computeFailure(pattern);

        int j = 0;

        for( int i = start; i < stop; i++) {
            while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
                j = failure[j - 1];
            }
            if (pattern[j] == '*' || pattern[j] == data[i]) {
                j++;
            }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j>0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}
Run Code Online (Sandbox Code Playgroud)


小智 6

java.lang.String复制几乎相同。

indexOf(char[],int,int,char[]int,int,int)

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
    if (fromIndex >= sourceCount) {
        return (targetCount == 0 ? sourceCount : -1);
    }
    if (fromIndex < 0) {
        fromIndex = 0;
    }
    if (targetCount == 0) {
        return fromIndex;
    }

    byte first = target[targetOffset];
    int max = sourceOffset + (sourceCount - targetCount);

    for (int i = sourceOffset + fromIndex; i <= max; i++) {
        /* Look for first character. */
        if (source[i] != first) {
            while (++i <= max && source[i] != first)
                ;
        }

        /* Found first character, now look at the rest of v2 */
        if (i <= max) {
            int j = i + 1;
            int end = j + targetCount - 1;
            for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
                ;

            if (j == end) {
                /* Found whole string. */
                return i - sourceOffset;
            }
        }
    }
    return -1;
}
Run Code Online (Sandbox Code Playgroud)


184*_*615 5

为了节省测试时间:

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

给你的代码,如果你使computeFailure()静态:

public class KPM {
    /**
     * Search the data byte array for the first occurrence 
     * of the byte array pattern.
     */
    public static int indexOf(byte[] data, byte[] pattern) {
    int[] failure = computeFailure(pattern);

    int j = 0;

    for (int i = 0; i < data.length; i++) {
        while (j > 0 && pattern[j] != data[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == data[i]) { 
            j++; 
        }
        if (j == pattern.length) {
            return i - pattern.length + 1;
        }
    }
    return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
    int[] failure = new int[pattern.length];

    int j = 0;
    for (int i = 1; i < pattern.length; i++) {
        while (j>0 && pattern[j] != pattern[i]) {
            j = failure[j - 1];
        }
        if (pattern[j] == pattern[i]) {
            j++;
        }
        failure[i] = j;
    }

    return failure;
    }
}
Run Code Online (Sandbox Code Playgroud)

由于测试您借用的代码总是明智的,因此您可以从以下开始:

public class Test {
    public static void main(String[] args) {
        do_test1();
    }
    static void do_test1() {
      String[] ss = { "",
                    "\r\n\r\n",
                    "\n\n",
                    "\r\n\r\nthis is a test",
                    "this is a test\r\n\r\n",
                    "this is a test\r\n\r\nthis si a test",
                    "this is a test\r\n\r\nthis si a test\r\n\r\n",
                    "this is a test\n\r\nthis si a test",
                    "this is a test\r\nthis si a test\r\n\r\n",
                    "this is a test"
                };
      for (String s: ss) {
        System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
      }

    }
}
Run Code Online (Sandbox Code Playgroud)


Bul*_*aza 5

使用Knuth–Morris–Pratt algorithm是最有效的方法。

StreamSearcher.java是它的一个实现,是Twitter's elephant-birdproject 的一部分。

不建议包含这个库,因为它对于只使用一个类来说相当大。

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher
{
    private byte[] pattern_;
    private int[] borders_;

    // An upper bound on pattern length for searching. Results are undefined for longer patterns.
    @SuppressWarnings("unused")
    public static final int MAX_PATTERN_LENGTH = 1024;

    StreamSearcher(byte[] pattern)
    {
        setPattern(pattern);
    }

    /**
     * Sets a new pattern for this StreamSearcher to use.
     *
     * @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
     */
    public void setPattern(byte[] pattern)
    {
        pattern_ = Arrays.copyOf(pattern, pattern.length);
        borders_ = new int[pattern_.length + 1];
        preProcess();
    }

    /**
     * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
     * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
     * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
     * another reasonable default, i.e. leave the stream unchanged.
     *
     * @return bytes consumed if found, -1 otherwise.
     */
    long search(InputStream stream) throws IOException
    {
        long bytesRead = 0;

        int b;
        int j = 0;

        while ((b = stream.read()) != -1)
        {
            bytesRead++;

            while (j >= 0 && (byte) b != pattern_[j])
            {
                j = borders_[j];
            }
            // Move to the next character in the pattern.
            ++j;

            // If we've matched up to the full pattern length, we found it.  Return,
            // which will automatically save our position in the InputStream at the point immediately
            // following the pattern match.
            if (j == pattern_.length)
            {
                return bytesRead;
            }
        }

        // No dice, Note that the stream is now completely consumed.
        return -1;
    }

    /**
     * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
     * and aids in implementation of the Knuth-Moore-Pratt string search.
     * <p>
     * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
     */
    private void preProcess()
    {
        int i = 0;
        int j = -1;
        borders_[i] = j;
        while (i < pattern_.length)
        {
            while (j >= 0 && pattern_[i] != pattern_[j])
            {
                j = borders_[j];
            }
            borders_[++i] = ++j;
        }
    }
}
Run Code Online (Sandbox Code Playgroud)


das*_*ght 0

Java 字符串由 16 位组成char,而不是由 8 位组成byte。Achar可以保存 a byte,因此您始终可以将字节数组转换为字符串,并使用indexOf: ASCII 字符、控制字符,甚至零字符都可以正常工作。

\n\n

这是一个演示:

\n\n
byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};\nbyte[] small = new byte[] {7,0,8,9,0,0,1};\nString bigStr = new String(big, StandardCharsets.UTF_8);\nString smallStr = new String(small, StandardCharsets.UTF_8);\nSystem.out.println(bigStr.indexOf(smallStr));\n
Run Code Online (Sandbox Code Playgroud)\n\n

这打印7.

\n\n

但是,考虑到您的大数组可能高达 10,000 字节,而小数组只有 10 个字节,此解决方案可能不是最有效的,原因有两个:

\n\n
    \n
  • 它需要将大数组复制到两倍大的数组中(相同容量,但用char代替byte)。这将使您的内存需求增加三倍。
  • \n
  • Java 的字符串搜索算法并不是最快的算法。如果您实现一种高级算法(例如Knuth\xe2\x80\x93Morris\xe2\x80\x93Pratt算法),您的速度可能会足够快。这可能会使执行速度降低多达十倍(小字符串的长度),并且需要与小字符串的长度成正比的额外内存,而不是与大字符串的长度成正比的内存。
  • \n
\n

  • 不要这样做;这不是编码的工作原理。这个答案有两个问题: (8认同)
  • 首先:并非所有字节序列都是有效的 UTF-8 流,在这些情况下此代码将失败。例如,尝试在字符串“{-16,-112,40,-68}”中查找“{-61, 40}”:您的代码返回“0”,因为这两个都是 Java 替换的无效序列UTF-8 的默认替换字符。 (6认同)
  • 其次,即使对于有效的 UTF-8 流,也不会返回正确的结果:例如,当您解码 ASCII 字符序列时,Java 将填充每个字节以获取字符;它不会为每个字符打包两个字节,除非它们是单个 Unicode 字符编码的一部分。这意味着当您搜索的模式在smallStr和largeStr中以不同的方式分割为多个字符时,代码将是错误的。(从概念上讲,您会在包含“xy, z”的序列中查找“x, y, z”,打包为“x, yz”)。 (6认同)
  • 值得注意的是,单 byte[] 构造函数使用平台默认字符集。这“可能”是某种 UTF-8,但你不能太确定。最好使用[允许您指定它的其他构造函数](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#String%28byte[],% 20java.lang.String%29) IE `new String(bytes, "UTF-8")`。 (2认同)