CL2*_*L22 28 java search bytearray
给定一个字节数组,我怎样才能在其中找到(较小的)字节数组的位置?
这个文档看起来很有前途,ArrayUtils但是如果我是正确的,它只会让我在数组中找到要搜索的单个字节.
(我看不出来有问题,但以防万一:有时搜索字节数组将是常规ASCII字符,有时它将是控制字符或扩展的ASCII字符.因此使用字符串操作并不总是合适的)
大数组可以在10到大约10000字节之间,较小的数组可以在10左右.在某些情况下,我将在一次搜索中在较大的数组中找到几个较小的数组.而且我有时想要找到实例的最后一个索引而不是第一个索引.
mor*_*s05 36
最简单的方法是比较每个元素:
public int indexOf(byte[] outerArray, byte[] smallerArray) {
for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
boolean found = true;
for(int j = 0; j < smallerArray.length; ++j) {
if (outerArray[i+j] != smallerArray[j]) {
found = false;
break;
}
}
if (found) return i;
}
return -1;
}
Run Code Online (Sandbox Code Playgroud)
一些测试:
@Test
public void testIndexOf() {
byte[] outer = {1, 2, 3, 4};
assertEquals(0, indexOf(outer, new byte[]{1, 2}));
assertEquals(1, indexOf(outer, new byte[]{2, 3}));
assertEquals(2, indexOf(outer, new byte[]{3, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}
Run Code Online (Sandbox Code Playgroud)
当你更新你的问题时:Java Strings是UTF-16字符串,他们不关心扩展的ASCII集,所以你可以使用string.indexOf()
ant*_*lma 22
Google的Guava提供了Bytes.indexOf(byte []数组,byte []目标).
这是你想要的?
public class KPM {
/**
* Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
* @param data
* @param start First index in data
* @param stop Last index in data so that stop-start = length
* @param pattern What is being searched. '*' can be used as wildcard for "ANY character"
* @return
*/
public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
if( data == null || pattern == null) return -1;
int[] failure = computeFailure(pattern);
int j = 0;
for( int i = start; i < stop; i++) {
while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
j = failure[j - 1];
}
if (pattern[j] == '*' || pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
Run Code Online (Sandbox Code Playgroud)
小智 6
从java.lang.String复制几乎相同。
indexOf(char[],int,int,char[]int,int,int)
static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
byte first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first)
;
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
;
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}
Run Code Online (Sandbox Code Playgroud)
为了节省测试时间:
http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence
给你的代码,如果你使computeFailure()静态:
public class KPM {
/**
* Search the data byte array for the first occurrence
* of the byte array pattern.
*/
public static int indexOf(byte[] data, byte[] pattern) {
int[] failure = computeFailure(pattern);
int j = 0;
for (int i = 0; i < data.length; i++) {
while (j > 0 && pattern[j] != data[i]) {
j = failure[j - 1];
}
if (pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
Run Code Online (Sandbox Code Playgroud)
由于测试您借用的代码总是明智的,因此您可以从以下开始:
public class Test {
public static void main(String[] args) {
do_test1();
}
static void do_test1() {
String[] ss = { "",
"\r\n\r\n",
"\n\n",
"\r\n\r\nthis is a test",
"this is a test\r\n\r\n",
"this is a test\r\n\r\nthis si a test",
"this is a test\r\n\r\nthis si a test\r\n\r\n",
"this is a test\n\r\nthis si a test",
"this is a test\r\nthis si a test\r\n\r\n",
"this is a test"
};
for (String s: ss) {
System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
}
}
}
Run Code Online (Sandbox Code Playgroud)
使用Knuth–Morris–Pratt algorithm是最有效的方法。
StreamSearcher.java是它的一个实现,是Twitter's elephant-birdproject 的一部分。
不建议包含这个库,因为它对于只使用一个类来说相当大。
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher
{
private byte[] pattern_;
private int[] borders_;
// An upper bound on pattern length for searching. Results are undefined for longer patterns.
@SuppressWarnings("unused")
public static final int MAX_PATTERN_LENGTH = 1024;
StreamSearcher(byte[] pattern)
{
setPattern(pattern);
}
/**
* Sets a new pattern for this StreamSearcher to use.
*
* @param pattern the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern)
{
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}
/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* @return bytes consumed if found, -1 otherwise.
*/
long search(InputStream stream) throws IOException
{
long bytesRead = 0;
int b;
int j = 0;
while ((b = stream.read()) != -1)
{
bytesRead++;
while (j >= 0 && (byte) b != pattern_[j])
{
j = borders_[j];
}
// Move to the next character in the pattern.
++j;
// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length)
{
return bytesRead;
}
}
// No dice, Note that the stream is now completely consumed.
return -1;
}
/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
private void preProcess()
{
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length)
{
while (j >= 0 && pattern_[i] != pattern_[j])
{
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}
Run Code Online (Sandbox Code Playgroud)
Java 字符串由 16 位组成char,而不是由 8 位组成byte。Achar可以保存 a byte,因此您始终可以将字节数组转换为字符串,并使用indexOf: ASCII 字符、控制字符,甚至零字符都可以正常工作。
这是一个演示:
\n\nbyte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};\nbyte[] small = new byte[] {7,0,8,9,0,0,1};\nString bigStr = new String(big, StandardCharsets.UTF_8);\nString smallStr = new String(small, StandardCharsets.UTF_8);\nSystem.out.println(bigStr.indexOf(smallStr));\nRun Code Online (Sandbox Code Playgroud)\n\n\n\n但是,考虑到您的大数组可能高达 10,000 字节,而小数组只有 10 个字节,此解决方案可能不是最有效的,原因有两个:
\n\nchar代替byte)。这将使您的内存需求增加三倍。| 归档时间: |
|
| 查看次数: |
37757 次 |
| 最近记录: |