用于查找2个字符串之间任意长度的所有共享子串的算法,然后计算字符串2中的出现次数?

Bra*_*cal 15 c string algorithm tree nodes

我遇到了一个不寻常的挑战,到目前为止我无法确定最有效的攻击方法.


给出以下2个字符串作为示例,查找任意长度的2个字符串之间的所有共享子字符串,并计算字符串2中所有这些共享子字符串的出现次数.您的算法还需要能够计算之间的共享子字符串包含最大100MB或更大的字符串的文件.

例:

字符串1:ABCDE512ABC361EG51D

字符串2:ADE5AHDW4131EG1DG5C

给定这2个字符串,该算法将找到以下共享子串:A,C,D,E,5,1,3,G,DE,E5,EG,G5,1D,DE5,1EG

然后从这些共享的子串中,我们可以发现字符串2中每个子串的出现次数.

答:字符串2中出现2次

C:字符串2中出现1次

D:字符串2中出现3次

等等..


我用来解决这个问题的第一种方法是粗暴地强迫我使用2个嵌套for循环计算公共共享子串 - 显然效率最低但是这是一种快速而肮脏的方式来了解预期的输出应该使用较小的测试输入和最慢的运行时间,大约2分钟来计算包含ascii字符串大小为50kb的2个文件之间的所有常见共享子字符串.将大小增加到1mb使得由于计算此次必须发生的大量嵌套迭代而导致这种情况急剧停止.

接下来的方法是使用树 - 看看我可以用多少内存来优化计算时间.这种方法要快得多.使用蛮力方法花费2分钟的两个50kb文件几乎是即时的.对1mb文件运行速度非常快(秒)但是当我继续测试越来越大的文件大小时,由于树的大小,我很快就开始遇到内存问题.


注意:字符串文件只包含ASCII字符!


编辑:

我正在进一步升级,请看:

https://gist.github.com/braydo25/f7a9ce7ce7ad7c5fb11ec511887789bc

Nic*_*ler 9

这里有一些代码说明了我在上面的评论中提出的想法.虽然它是可运行的C++代码,但在使用的数据结构肯定不是最优的意义上它是更多的伪代码,但它们允许清楚地查看算法.

struct Occurrence
{
    //The vectors contain indices to the first character of the occurrence in ...
    std::vector<size_t> s1;  // ... string 1 and ...
    std::vector<size_t> s2;  // ... string 2.
};

int main()
{
    //If you cannot load the entire strings in memory, a memory-mapped file might be
    //worth considering
    std::string s1 = "ABCDE512ABC361EG51D";
    std::string s2 = "ADE5AHDW4131EG1DG5C";

    //These vectors store the occurrences of substrings for the current and next length
    std::vector<Occurrence> occurrences, nextOccurrences;
    int length = 1;

    std::map<char, Occurrence> occurrenceMap;
    //Initialize occurrences
    for (int i = 0; i < s1.length(); ++i)
        occurrenceMap[s1[i]].s1.push_back(i);
    for (int i = 0; i < s2.length(); ++i)
        occurrenceMap[s2[i]].s2.push_back(i);

    for (auto& pair : occurrenceMap)
    {
        if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
            occurrences.push_back(std::move(pair.second));
    }

    do
    {
        nextOccurrences.clear();

        std::cout << "Length " << length << std::endl;
        for(auto& o : occurrences)
        {
            std::cout << std::string(s1.c_str() + o.s1[0], length) << " occurred "
                      << o.s1.size() << " / " << o.s2.size() << " times." << std::endl;

            //Expand the occurrence
            occurrenceMap.clear();
            for (auto p : o.s1)
            {
                if (p + length < s1.length())
                    occurrenceMap[s1[p + length]].s1.push_back(p);
            }                   
            for (auto p : o.s2)
            {
                if (p + length < s2.length())
                occurrenceMap[s2[p + length]].s2.push_back(p);
            }
            for (auto& pair : occurrenceMap)
            {
                if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
                    nextOccurrences.push_back(std::move(pair.second));
            }
        }

        ++length;
        std::swap(occurrences, nextOccurrences);

    } while (!occurrences.empty());


    return 0;
}
Run Code Online (Sandbox Code Playgroud)

输出:

Length 1
1 occurred 3 / 3 times.
3 occurred 1 / 1 times.
5 occurred 2 / 2 times.
A occurred 2 / 2 times.
C occurred 2 / 1 times.
D occurred 2 / 3 times.
E occurred 2 / 2 times.
G occurred 1 / 2 times.
Length 2
1D occurred 1 / 1 times.
1E occurred 1 / 1 times.
DE occurred 1 / 1 times.
E5 occurred 1 / 1 times.
EG occurred 1 / 1 times.
G5 occurred 1 / 1 times.
Length 3
1EG occurred 1 / 1 times.
DE5 occurred 1 / 1 times.
Run Code Online (Sandbox Code Playgroud)

初始化期间将使用最多的内存,因为两个输入字符串的每个字符都有一个条目.如果您知道字符串的大致长度,则可以选择比其更合适的索引数据类型size_t.所需的内存量按输入大小的顺序排列.因此,对于普通计算机来说,两个100 MB的文件应该没问题.在初始化之后(更具体地说,在循环的第一次迭代之后),大多数这些数据将被删除,因为不再需要它们.


Dav*_*tat 5

下面是基于在最长公共前缀数组的帮助下遍历输入串联的后缀数组的 C 实现。您可以将编程竞赛级 (O(n log^2 n)) 后缀数组实现替换为真正的后缀数组实现(O(n) 或 O(n log n)),以大幅提高性能。(编辑:这样做了,还有一些其他更改反映了提问者的新要求: https: //github.com/eisenstatdavid/commonsub。)

#include <inttypes.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

typedef int_fast32_t I32;

#define Constrain(expression) _Static_assert(expression, #expression)
Constrain(CHAR_BIT == 8);
#define InputMaxBytes 80000000
Constrain(InputMaxBytes <= (INT_LEAST32_MAX - 2) / 2);
#define MaxLen (2 * InputMaxBytes + 2)
Constrain(MaxLen <= INT_FAST32_MAX / 2);

static I32 Len;
static I32 Begin2;
static signed char Buf[MaxLen];
static int_least32_t SufArr[MaxLen];
static int_least32_t SufRank[MaxLen];
static int_least32_t NewRank[MaxLen];
static int_least32_t *const LongCommPre = NewRank;  // aliased to save space
static uint_least64_t Bitmap2[(MaxLen >> 6) + 1];
static int_least32_t SparseCount2[(MaxLen >> 6) + 1];
static int_least32_t *const Stack = SufRank;  // aliased to save space

static void Slurp(const char *filename) {
  FILE *stream = fopen(filename, "r");
  if (stream == NULL) goto fail;
  I32 n = fread(Buf + Len, sizeof *Buf, InputMaxBytes + 1, stream);
  if (ferror(stream)) goto fail;
  if (n > InputMaxBytes) {
    fprintf(stderr, "%s: file is too large; increase InputMaxBytes\n",
            filename);
    exit(EXIT_FAILURE);
  }
  for (I32 i = 0; i < n; i++) {
    if (Buf[Len + i] < 0) {
      fprintf(stderr,
              "%s: file contains non-ASCII byte at offset %" PRIdFAST32 "\n",
              filename, i);
      exit(EXIT_FAILURE);
    }
  }
  Len += n;
  if (fclose(stream) == EOF) goto fail;
  return;
fail:
  perror(filename);
  exit(EXIT_FAILURE);
}

static I32 Radix;

static int CompareRankPairs(const void *iPtr, const void *jPtr) {
  I32 i = *(const int_least32_t *)iPtr;
  I32 j = *(const int_least32_t *)jPtr;
  if (SufRank[i] < SufRank[j]) return -1;
  if (SufRank[i] > SufRank[j]) return 1;
  I32 iRank = i + Radix < Len ? SufRank[i + Radix] : -2;
  I32 jRank = j + Radix < Len ? SufRank[j + Radix] : -2;
  if (iRank < jRank) return -1;
  if (iRank > jRank) return 1;
  return 0;
}

static void BuildSuffixArray(void) {
  for (I32 i = 0; i < Len; i++) {
    SufArr[i] = i;
    SufRank[i] = Buf[i];
  }
  for (Radix = 1; true; Radix *= 2) {
    qsort(SufArr, Len, sizeof *SufArr, CompareRankPairs);
    NewRank[0] = 0;
    for (I32 i = 1; i < Len; i++) {
      NewRank[i] = CompareRankPairs(&SufArr[i - 1], &SufArr[i]) == 0
                       ? NewRank[i - 1]
                       : NewRank[i - 1] + 1;
    }
    for (I32 i = 0; i < Len; i++) {
      SufRank[SufArr[i]] = NewRank[i];
    }
    if (NewRank[Len - 1] == Len - 1) break;
  }

  I32 lenCommPre = 0;
  for (I32 i = 0; i < Len; i++) {
    if (SufRank[i] == Len - 1) {
      LongCommPre[SufRank[i]] = -1;
      continue;
    }
    while (Buf[i + lenCommPre] == Buf[SufArr[SufRank[i] + 1] + lenCommPre]) {
      lenCommPre++;
    }
    LongCommPre[SufRank[i]] = lenCommPre;
    if (lenCommPre > 0) lenCommPre--;
  }
}

static I32 PopCount(uint_fast64_t x) {
  I32 v = 0;
  while (x != 0) {
    x &= x - 1;
    v++;
  }
  return v;
}

static void BuildCumCount2(void) {
  for (I32 i = 0; i < Len; i++) {
    if (SufArr[i] >= Begin2) {
      Bitmap2[i >> 6] |= UINT64_C(1) << (i & 63);
      SparseCount2[i >> 6]++;
    }
  }
  for (I32 i = 0; i < (Len >> 6); i++) {
    SparseCount2[i + 1] += SparseCount2[i];
  }
}

static I32 CumCount2(I32 i) {
  return SparseCount2[i >> 6] - PopCount(Bitmap2[i >> 6] >> (i & 63));
}

static void FindCommonStrings(void) {
  I32 lenCommPre = -1;
  for (I32 i = 0; i < Len; i++) {
    while (lenCommPre > LongCommPre[i]) {
      I32 begin = Stack[lenCommPre];
      I32 end = i + 1;
      I32 count2 = CumCount2(end) - CumCount2(begin);
      if (count2 > 0 && count2 < end - begin && lenCommPre > 0) {
        printf("%" PRIdFAST32 "\t%.*s\n", count2, (int)lenCommPre,
               Buf + SufArr[begin]);
      }
      lenCommPre--;
    }
    while (lenCommPre < LongCommPre[i]) {
      lenCommPre++;
      Stack[lenCommPre] = i;
    }
  }
}

int main(int argc, char *argv[]) {
  if (argc != 3) {
    fputs("usage: commonsub needle haystack\n", stderr);
    exit(EXIT_FAILURE);
  }
  Len = 0;
  Slurp(argv[1]);
  Buf[Len] = -1;
  Len++;
  Begin2 = Len;
  Slurp(argv[2]);
  Buf[Len] = -2;  // sentinel
  BuildSuffixArray();
  if (false) {
    for (I32 i = 0; i < Len; i++) {
      printf("%" PRIdFAST32 "\t%" PRIdLEAST32 "\t%" PRIdLEAST32 "\t%.*s\n", i,
             SufArr[i], LongCommPre[i], (int)(Len - SufArr[i]),
             Buf + SufArr[i]);
    }
  }
  BuildCumCount2();
  FindCommonStrings();
}
Run Code Online (Sandbox Code Playgroud)