pap*_*uma 2 c undefined-behavior boyer-moore
我正在尝试编写没有前缀函数的博耶摩尔算法的更简单版本。它必须打印与模式进行比较的符号的所有位置。在本地它通过了测试,但是当我将其提交到gitlab时它失败了。我在这里无法发现未定义的行为。我最后有垃圾。
#include <stdio.h>
#define MAX_PATTERN_LEN 16
#define BUF_SIZE 69
#define ALPH_SIZE 128
int read_str(int max_len, unsigned char *str_place) {
int count_read = 0;
for (int i = 0, ch; i < max_len; i++) {
if ((ch = getchar()) == '\n') {
str_place[i] = '\0';
break;
}
str_place[i] = (char)ch;
count_read++;
}
return count_read;
}
void calculate_shifts(const unsigned char *str, int len_str, int *badchar) {
for (int i = 0; i < ALPH_SIZE; i++)
badchar[i] = len_str;
for (int i = 0; i < len_str - 1; i++)
badchar[str[i]] = len_str - 1 - i;
}
void search(const unsigned char *str, const unsigned char *patt, int len_patt, int len_str) {
int badchar[ALPH_SIZE];
calculate_shifts(patt, len_patt, badchar);
int shift = 0;
while (shift <= (len_str - len_patt)) {
int j = len_patt - 1;
for (; j >= 0 && patt[j] == str[shift + j]; j--)
printf("%d ", shift + j + 1);
if (j < 0) {
shift += ((shift + len_patt) < len_str) ? badchar[patt[len_patt - 1]] : 1;
} else {
printf("%d ", shift + j + 1);
int shift_addition = badchar[str[shift + j]];
if ((shift_addition == len_patt) && (j < len_patt - 1) && (patt[len_patt - 1] == patt[0]))
shift_addition--;
shift += shift_addition;
}
}
}
int main(void) {
unsigned char str[BUF_SIZE + 1];
unsigned char patt[MAX_PATTERN_LEN + 1];
int len_patt = read_str(MAX_PATTERN_LEN + 1, patt);
int len_str = read_str(BUF_SIZE + 1, str);
if (!len_patt || !len_str)
return 0;
search(str, patt, len_patt, len_str);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
考试:
example
this is simple example
Run Code Online (Sandbox Code Playgroud)
正确的输出:
7 14 13 12 11 10 20 22 21 20 19 18 17 16
Run Code Online (Sandbox Code Playgroud)
实际输出:
7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 ..
Run Code Online (Sandbox Code Playgroud)
差异(输出末尾的额外数字 28)可能是由于\n输入最后一行末尾缺少换行符 ( ) 造成的。我能够在本地重现您的两个输出(带和不带 28)。
$ gcc -g -O2 -W -Wall -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16
$ printf 'example\nthis is simple example; ' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16 28
Run Code Online (Sandbox Code Playgroud)
TL;DR(ch = getchar()) == '\n'通过更改为(ch = getchar()) == EOF || ch == '\n'和 更改#define ALPH_SIZE 128为 来修复代码中的错误#define ALPH_SIZE 256。
可能还有别的bug,我没检查过。如果您遇到任何其他错误,请在 StackOverflow 上的单独问题中提出。
这个答案的其余部分解释了我如何诊断这个错误。
gcc -fsanitize=address如果换行符丢失,AddressSanitizer )会显示内存访问问题:
$ gcc -g -O2 -W -Wall -fsanitize=address -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16
$ printf 'example\nthis is simple example; ' | ./t; echo
ASAN:DEADLYSIGNAL
=================================================================
==19294==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffe25fbe264 (pc 0x563d6542b204 bp 0x7ffe25fbe264 sp 0x7ffe9a1bbc70 T0)
==19294==The signal is caused by a READ memory access.
#0 0x563d6542b203 in search /tmp/t.c:33
#1 0x563d6542acb1 in main /tmp/t.c:54
#2 0x7f617ff96c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#3 0x563d6542ad99 in _start (/tmp/t+0xd99)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in search
==19294==ABORTING
Run Code Online (Sandbox Code Playgroud)
请注意,我必须多次重新运行它才能触发 AddressSanitizer 错误消息。
大多数情况下,这是由程序中的错误引起的,并且通常是未定义的行为。in search /tmp/t.c:33AddressSanitizer 输出中的子句指示源代码中发生错误读取的位置(即第 33 行, function )search。
我在您的代码中添加了一些额外的printfs 来查看第 33 行中发生的情况。输出为:
$ printf 'example\nthis is simple example; ' | ./t; echo
patt[6] == 101
str[6] == 115
patt[6] == 101
str[13] == 101
patt[5] == 108
str[12] == 108
patt[4] == 112
str[11] == 112
patt[3] == 109
str[10] == 109
patt[2] == 97
str[9] == 105
patt[6] == 101
str[19] == 112
patt[6] == 101
str[21] == 101
patt[5] == 108
str[20] == 108
patt[4] == 112
str[19] == 112
patt[3] == 109
str[18] == 109
patt[2] == 97
str[17] == 97
patt[1] == 120
str[16] == 120
patt[0] == 101
str[15] == 101
patt[6] == 101
str[27] == 255
patt[6] == 101
str[-369011759] == ??
ASAN:DEADLYSIGNAL
=================================================================
==19906==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffc6e5c0c71 (pc 0x5621e2363094 bp 0x0000ea0153d1 sp 0x7ffc845ab540 T0)
==19906==The signal is caused by a READ memory access.
#0 0x5621e2363093 in gett /tmp/t.c:33
#1 0x5621e236349b in search /tmp/t.c:42
#2 0x5621e2362e61 in main /tmp/t.c:63
#3 0x7f7063ee8c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#4 0x5621e2362f49 in _start (/tmp/t+0xf49)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in gett
==19906==ABORTING
Run Code Online (Sandbox Code Playgroud)
罪魁祸首是最后一次读取访问:str[-369011759] == ??,这表明shift + j变为-369011759,这显然是不正确的,因为正确的值是小的非负整数。
请注意,我必须多次重新运行它才能触发 Address Sanitizer 错误消息,并且数字 -369011759 每次都不同。
它看起来j有一个合理的值 6。要找出实际的错误,下一步是检查从哪里shift得到它的大负值。
此时输出线str[27] == 255对我来说是可疑的。我检查了你的代码是否正确处理这么大的值str,然后我发现了一个错误。
修复这个新发现的错误的一种方法是ALPH_SIZE从更改128为256. (我已经验证它可以使地址消毒器错误消失。)原因如下。第 39 行包含 的读取badchar[str[shift + j]]。为了使其有效,我们需要0 <= str[shift + j] && str[shift + j] < ALPH_SIZE,因为ALPH_SIZE是数组的大小badchar。然而, 的最大值str[...]是 255(因为它是一个无符号字符),因此我们需要ALPH_SIZE >= 256.
事实上,越界数组访问(读或写)在 C 中是未定义的行为。为了证明这确实发生,我将第 39 行更改为
printf("%d ", shift + j + 1); fprintf(stderr, "RBC %d == %d\n", shift + j, str[shift + j]);
Run Code Online (Sandbox Code Playgroud)
,我还更改了第 28 行以使用 malloc:
int *badchar = malloc(sizeof(int) * ALPH_SIZE);
Run Code Online (Sandbox Code Playgroud)
,我free(badchar)后来也补充了。凭借我所拥有的一切:
$ gcc -g -O2 -W -Wall -include stdlib.h -fsanitize=address -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
RBC 6 == 115
RBC 9 == 105
RBC 19 == 112
7 14 13 12 11 10 20 22 21 20 19 18 17 16
$ printf 'example\nthis is simple example; ' | ./t; echo
RBC 6 == 115
RBC 9 == 105
RBC 19 == 112
RBC 27 == 255
=================================================================
==23463==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61500000047c at pc 0x563915b3754a bp 0x7fff5f340710 sp 0x7fff5f340700
READ of size 4 at 0x61500000047c thread T0
#0 0x563915b37549 in search /tmp/t.c:39
#1 0x563915b36dc1 in main /tmp/t.c:54
#2 0x7f1a63d6dc86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#3 0x563915b36ea9 in _start (/tmp/t+0xea9)
Address 0x61500000047c is a wild pointer.
SUMMARY: AddressSanitizer: heap-buffer-overflow /tmp/t.c:39 in search
Shadow bytes around the buggy address:
0x0c2a7fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c2a7fff8080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
0x0c2a7fff8090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==23463==ABORTING
Run Code Online (Sandbox Code Playgroud)
事实上,AddressSanitizer 的输出表明了第 39 行中的错误,并且额外调用的输出fprintf确认了它是越界数组访问:读取数组的第 255 个元素badchar,该数组只有 128 个元素。
我们最终如何str[27] == 255(因此阅读badchar[255])?这很简单:如果读取 时没有\n在文件末尾终止str,则getchar()返回 EOF,当分配给unsigned char. (如上所述,明显的修复方法是在 EOF 处停止,并更改ALPH_SIZE为 256。)
badchar为什么在更改为使用(报告为堆缓冲区溢出)之前,AddressSanitizer 没有在第 39 行报告读取越界数组索引问题( malloc作为堆栈缓冲区溢出)?我还不太明白这一点。AddressSanitizer 应该能够可靠地报告堆栈缓冲区溢出。可能这取决于 GCC 和 Clang 版本,因为更改命令gcc可以clang使堆栈缓冲区溢出错误可靠地出现。升级到最新的 GCC 和 Clang 可能会有所帮助。Valgrind 在检测堆栈缓冲区溢出读取方面不如 AddressSanitizer 可靠。