调试C程序(Redis服务器)

ant*_*rez 24 c debugging assembly gdb redis

前提

嗨,

I received multiple reports from a Redis user that experienced server crashes, using a Redis stable release (latest, 2.4.6). The bug is strange since the user is not doing esoteric things, just working a lot with the sorted set type, and only with the ZADD, ZREM, and ZREVRANK commands. However it is strange that a bug like that, causing crashes after a few billion operations executed, was only experienced by a single user. Fortunately the user in question is extremely helpful and collaborated a lot in the tracking of the issue, so I was able to obtain many times logs with the exact sequence of operations performed by Redis, that I re-played locally without result, I also tried to write scripts to closely mimic the kind of work load, to perform in-depth code reviews of the skip list implementation, and so forth.

Even after all this efforts no way to reproduce the issue locally. It is also worth to mention that at some point the the user started sending the exact same traffic to another box running the same Redis version, but compiled with another gcc, and running in different hardware: so far no issues in this second instance. Still I want to understand what is happening.

So finally I setup a different strategy with the user and asked him to run Redis using gdb, in order to obtain a core file. Finally Redis crashed again and I now have both the core file and the executable. Unfortunately I forgot to ask the user to compile Redis without optimizations.

I need the stack overflow community help since with GDB I reach some conclusion but I've no really idea what could be happening here: at some point a function computes a pointer, and when it calls another function magically that pointer is different, pointing to a memory location that does not hold the right kind of data.

GDB session

The original executable was compiled with GCC 4.4.5-8, this is a GDB session that shows my investigation:

gdb ./redis-server core.16525
GNU gdb (GDB) 7.1-ubuntu
[snip]
Program terminated with signal 11, Segmentation fault.
#0  0x00007f3d9ecd216c in __pthread_rwlock_tryrdlock (rwlock=0x1)
    at pthread_rwlock_tryrdlock.c:46
46      pthread_rwlock_tryrdlock.c: No such file or directory.
        in pthread_rwlock_tryrdlock.c
Run Code Online (Sandbox Code Playgroud)

Actually the strack trace shown is about a secondary thread doing nothing (you can safely consider Redis a single-threaded app, the other threads are only used to perform things like fsync() against a file descriptor without blocking), let's select the right one.

(gdb) info threads
  3 Thread 16525  zslGetRank (zsl=0x7f3d8d71c360, score=19.498544884710096, 
    o=0x7f3d4cab5760) at t_zset.c:335
  2 Thread 16527  0x00007f3d9ecd216c in __pthread_rwlock_tryrdlock (
    rwlock=0x6b7f5) at pthread_rwlock_tryrdlock.c:46
* 1 Thread 16526  0x00007f3d9ecd216c in __pthread_rwlock_tryrdlock (rwlock=0x1)
    at pthread_rwlock_tryrdlock.c:46
(gdb) thread 3
[Switching to thread 3 (Thread 16525)]#0  zslGetRank (zsl=0x7f3d8d71c360, 
    score=19.498544884710096, o=0x7f3d4cab5760) at t_zset.c:335
335     t_zset.c: No such file or directory.
        in t_zset.c
(gdb) bt
#0  zslGetRank (zsl=0x7f3d8d71c360, score=19.498544884710096, o=0x7f3d4cab5760)
    at t_zset.c:335
#1  0x000000000042818b in zrankGenericCommand (c=0x7f3d9dcdc000, reverse=1)
    at t_zset.c:2046
#2  0x00000000004108d4 in call (c=0x7f3d9dcdc000) at redis.c:1024
#3  0x0000000000410c1c in processCommand (c=0x7f3d9dcdc000) at redis.c:1130
#4  0x0000000000419d3f in processInputBuffer (c=0x7f3d9dcdc000)
    at networking.c:865
#5  0x0000000000419e1c in readQueryFromClient (el=<value optimized out>, 
    fd=<value optimized out>, privdata=0x7f3d9dcdc000, 
    mask=<value optimized out>) at networking.c:908
#6  0x000000000040d4a3 in aeProcessEvents (eventLoop=0x7f3d9dc47000, 
    flags=<value optimized out>) at ae.c:342
#7  0x000000000040d6ee in aeMain (eventLoop=0x7f3d9dc47000) at ae.c:387
#8  0x0000000000412a4f in main (argc=2, argv=<value optimized out>)
    at redis.c:1719
Run Code Online (Sandbox Code Playgroud)

We also generated a backtrace. As you can see call() is dispatching the ZREVRANK command, so the zrankGenericCommand() is called with the client structure and the reverse=1 (since it is REV rank) argument. We can easily investigate to check what are the arguments of the ZREVRANK command.

(gdb) up
#1  0x000000000042818b in zrankGenericCommand (c=0x7f3d9dcdc000, reverse=1)
    at t_zset.c:2046
2046    in t_zset.c
(gdb) print c->argc
$8 = 3
(gdb) print (redisClient*)c->argc
$9 = (redisClient *) 0x3
(gdb) print (char*)(redisClient*)c->argv[0]->ptr
$10 = 0x7f3d8267ce28 "zrevrank"
(gdb) print (char*)(redisClient*)c->argv[1]->ptr
$11 = 0x7f3d8267ce48 "pc_stat.hkperc"
(gdb) print (long)(redisClient*)c->argv[2]->ptr
$12 = 282472606
Run Code Online (Sandbox Code Playgroud)

So the actual command generating the crash was: ZREVRANK pc_stat.hkperc 282472606 This is consistent with the client logs obtained by the user. Note that I casted the pointer to a long integer for the latest argument, since Redis encodes integers this way to save memory when possible.

Now that's fine, it is now time to investigate the zrankGenericCommand() that called zslGetRan() that caused the actual crash. This is the C source code of zrankGenericCommand around like 2046:

  2036      } else if (zobj->encoding == REDIS_ENCODING_SKIPLIST) {
  2037          zset *zs = zobj->ptr;
  2038          zskiplist *zsl = zs->zsl;
  2039          dictEntry *de;
  2040          double score;
  2041  
  2042          ele = c->argv[2] = tryObjectEncoding(c->argv[2]);
  2043          de = dictFind(zs->dict,ele);
  2044          if (de != NULL) {
  2045              score = *(double*)dictGetEntryVal(de);
  2046              rank = zslGetRank(zsl,score,ele);
  2047              redisAssert(rank); /* Existing elements always have a rank. */
  2048              if (reverse)
  2049                  addReplyLongLong(c,llen-rank);
  2050              else
  2051                  addReplyLongLong(c,rank-1);
  2052          } else {
  2053              addReply(c,shared.nullbulk);
  2054          }
  2055      }
Run Code Online (Sandbox Code Playgroud)

Ok this is how it works:

  • We lookup a Redis key, containing a sorted set data type (lookup not included in the code). The Redis Object associated with the key is stored in the zobj local variable.
  • The zobj ptr field is a pointer to a structure of type zset representing the sorted set.
  • In turn the zset structure has two pointers, one points to an hash table, and one to a skip list. This is needed since We both provide element-to-score lookups in O(1) for which we need an hash table, but also we take the elements ordered so we use a skip list. In line 2038 the pointer to the skip list (represented by a zskiplist structure) is assigned to the zsl variable.
  • Later we encode the third argument (line 2042), this is why we casted the value to a long to print it from the client structure.
  • In line 2043 we lookup the element from the dictionary, and the operation succeeds since we know that the function zslGetRank() side the if branch gets called.
  • Finally in line 2046 we call zslGetRank() with three arguments: the pointer to the skip list, the score of the element, and the element itself.

Fine... now what is the pointer that zslGetRank() should receive in theory? We can easily investigate this manually looking up the Redis hash table. I hashed manually the key and it maps to bucket 62 of the hash table, let's see if it is true:

(gdb) print (char*)c->db->dict->ht->table[62]->key
$13 = 0x7f3d9dc0f6c8 "pc_stat.hkperc"
Run Code Online (Sandbox Code Playgroud)

Exactly as expected. Let's check the object associated:

(gdb) print *(robj*)c->db->dict->ht->table[62]->val
$16 = {type = 3, storage = 0, encoding = 7, lru = 557869, refcount = 1, 
  ptr = 0x7f3d9de574b0}
Run Code Online (Sandbox Code Playgroud)

Type = 3, Encoding = 7, it means: it is a sorted set, encoded as a skip list. Fine again. The sorted set address (ptr field) is 0x7f3d9de574b0, so we can inspect this as well:

(gdb) print *(zset*)0x7f3d9de574b0
$17 = {dict = 0x7f3d9dcf6c20, zsl = 0x7f3d9de591c0}
Run Code Online (Sandbox Code Playgroud)

So we have:

  • The object associated to the key that points to a sorted set that is stored at address 0x7f3d9de574b0
  • In turn this sorted set is implemented with a skiplist, at address 0x7f3d9de591c0 (zsl field)

Now let's check if our two variables are set to the right values:

2037            zset *zs = zobj->ptr;
2038            zskiplist *zsl = zs->zsl;

(gdb) info locals
zs = 0x7f3d9de574b0
zsl = 0x7f3d9de591c0
de = <value optimized out>
ele = <value optimized out>
zobj = <value optimized out>
llen = 165312
rank = <value optimized out>
Run Code Online (Sandbox Code Playgroud)

Everything is perfect so far: the variable zs is set to 0x7f3d9de574b0 as expected, and so is the variable zsl pointing to the skiplist, that is set to 0x7f3d9de591c0.

Now this variables are no touched in the course of the code execution:

This are the only lines of code between the assignment of the variables and the call to the zslGetRank() function:

2042            ele = c->argv[2] = tryObjectEncoding(c->argv[2]);
2043            de = dictFind(zs->dict,ele);
2044            if (de != NULL) {
2045                score = *(double*)dictGetEntryVal(de);
2046                rank = zslGetRank(zsl,score,ele);
Run Code Online (Sandbox Code Playgroud)

Nobody is touching zsl, however if we check the stack trace we see that the zslGetRank() function gets called not with the address 0x7f3d9de591c0 as first argument, but with a different one:

#0  zslGetRank (zsl=0x7f3d8d71c360, score=19.498544884710096, o=0x7f3d4cab5760)
    at t_zset.c:335
Run Code Online (Sandbox Code Playgroud)

If you read all this you are an hero, and the reward is very small, consisting in this question: do you have an idea, even considering that hardware failure is an option, about how this argument gets modified? It seems very unlikely that the object encoding function or the hash table lookup can corrupt the stack of the caller (but apparently the argument is inside registers already at the time of the call). My assembler is not great, so if you have some clue... it is very welcomed. I'll left you with an info registers output and a disassemble:

(gdb) info registers
rax            0x6      6
rbx            0x7f3d9dcdc000   139902617239552
rcx            0xf742d0b6       4148351158
rdx            0x7f3d95efada0   139902485245344
rsi            0x7f3d4cab5760   139901256030048
rdi            0x7f3d8d71c360   139902342775648
rbp            0x7f3d4cab5760   0x7f3d4cab5760
rsp            0x7fffe61a8040   0x7fffe61a8040
r8             0x7fffe61a7fd9   140737053884377
r9             0x1      1
r10            0x7f3d9dcd4ff0   139902617210864
r11            0x6      6
r12            0x1      1
r13            0x7f3d9de574b0   139902618793136
r14            0x7f3d9de591c0   139902618800576
r15            0x7f3d8267c9e0   139902157572576
rip            0x42818b 0x42818b <zrankGenericCommand+251>
eflags         0x10206  [ PF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb) disassemble zrankGenericCommand
Dump of assembler code for function zrankGenericCommand:
   0x0000000000428090 <+0>:     mov    %rbx,-0x30(%rsp)
   0x0000000000428095 <+5>:     mov    %r12,-0x20(%rsp)
   0x000000000042809a <+10>:    mov    %esi,%r12d
   0x000000000042809d <+13>:    mov    %r14,-0x10(%rsp)
   0x00000000004280a2 <+18>:    mov    %rbp,-0x28(%rsp)
   0x00000000004280a7 <+23>:    mov    %rdi,%rbx
   0x00000000004280aa <+26>:    mov    %r13,-0x18(%rsp)
   0x00000000004280af <+31>:    mov    %r15,-0x8(%rsp)
   0x00000000004280b4 <+36>:    sub    $0x58,%rsp
   0x00000000004280b8 <+40>:    mov    0x28(%rdi),%rax
   0x00000000004280bc <+44>:    mov    0x23138d(%rip),%rdx        # 0x659450 <shared+80>
   0x00000000004280c3 <+51>:    mov    0x8(%rax),%rsi
   0x00000000004280c7 <+55>:    mov    0x10(%rax),%rbp
   0x00000000004280cb <+59>:    callq  0x41d370 <lookupKeyReadOrReply>
   0x00000000004280d0 <+64>:    test   %rax,%rax
   0x00000000004280d3 <+67>:    mov    %rax,%r14
   0x00000000004280d6 <+70>:    je     0x4280ec <zrankGenericCommand+92>
   0x00000000004280d8 <+72>:    mov    $0x3,%edx
   0x00000000004280dd <+77>:    mov    %rax,%rsi
   0x00000000004280e0 <+80>:    mov    %rbx,%rdi
   0x00000000004280e3 <+83>:    callq  0x41b270 <checkType>
   0x00000000004280e8 <+88>:    test   %eax,%eax
   0x00000000004280ea <+90>:    je     0x428110 <zrankGenericCommand+128>
   0x00000000004280ec <+92>:    mov    0x28(%rsp),%rbx
   0x00000000004280f1 <+97>:    mov    0x30(%rsp),%rbp
   0x00000000004280f6 <+102>:   mov    0x38(%rsp),%r12
   0x00000000004280fb <+107>:   mov    0x40(%rsp),%r13
   0x0000000000428100 <+112>:   mov    0x48(%rsp),%r14
   0x0000000000428105 <+117>:   mov    0x50(%rsp),%r15
   0x000000000042810a <+122>:   add    $0x58,%rsp
   0x000000000042810e <+126>:   retq   
   0x000000000042810f <+127>:   nop
   0x0000000000428110 <+128>:   mov    %r14,%rdi
   0x0000000000428113 <+131>:   callq  0x426250 <zsetLength>
   0x0000000000428118 <+136>:   testw  $0x3c0,0x0(%rbp)
   0x000000000042811e <+142>:   jne    0x4282b7 <zrankGenericCommand+551>
   0x0000000000428124 <+148>:   mov    %eax,%eax
   0x0000000000428126 <+150>:   mov    %rax,0x8(%rsp)
   0x000000000042812b <+155>:   movzwl (%r14),%eax
   0x000000000042812f <+159>:   and    $0x3c0,%ax
   0x0000000000428133 <+163>:   cmp    $0x140,%ax
   0x0000000000428137 <+167>:   je     0x4281c8 <zrankGenericCommand+312>
   0x000000000042813d <+173>:   cmp    $0x1c0,%ax
   0x0000000000428141 <+177>:   jne    0x428299 <zrankGenericCommand+521>
   0x0000000000428147 <+183>:   mov    0x28(%rbx),%r15
   0x000000000042814b <+187>:   mov    0x8(%r14),%r13
   0x000000000042814f <+191>:   mov    0x10(%r15),%rdi
   0x0000000000428153 <+195>:   mov    0x8(%r13),%r14
   0x0000000000428157 <+199>:   callq  0x41bcc0 <tryObjectEncoding>
   0x000000000042815c <+204>:   mov    0x0(%r13),%rdi
   0x0000000000428160 <+208>:   mov    %rax,0x10(%r15)
   0x0000000000428164 <+212>:   mov    %rax,%rsi
   0x0000000000428167 <+215>:   mov    %rax,%rbp
   0x000000000042816a <+218>:   callq  0x40ede0 <dictFind>
   0x000000000042816f <+223>:   test   %rax,%rax
   0x0000000000428172 <+226>:   je     0x428270 <zrankGenericCommand+480>
   0x0000000000428178 <+232>:   mov    0x8(%rax),%rax
   0x000000000042817c <+236>:   mov    %rbp,%rsi
   0x000000000042817f <+239>:   mov    %r14,%rdi
   0x0000000000428182 <+242>:   movsd  (%rax),%xmm0
   0x0000000000428186 <+246>:   callq  0x427fd0 <zslGetRank>
=> 0x000000000042818b <+251>:   test   %rax,%rax
   0x000000000042818e <+254>:   je     0x4282d5 <zrankGenericCommand+581>
   0x0000000000428194 <+260>:   test   %r12d,%r12d
   0x0000000000428197 <+263>:   je     0x4281b0 <zrankGenericCommand+288>
   0x0000000000428199 <+265>:   mov    0x8(%rsp),%rsi
   0x000000000042819e <+270>:   mov    %rbx,%rdi
   0x00000000004281a1 <+273>:   sub    %rax,%rsi
   0x00000000004281a4 <+276>:   callq  0x41a430 <addReplyLongLong>
   0x00000000004281a9 <+281>:   jmpq   0x4280ec <zrankGenericCommand+92>
   0x00000000004281ae <+286>:   xchg   %ax,%ax
   0x00000000004281b0 <+288>:   lea    -0x1(%rax),%rsi
   0x00000000004281b4 <+292>:   mov    %rbx,%rdi
   0x00000000004281b7 <+295>:   callq  0x41a430 <addReplyLongLong>
   0x00000000004281bc <+300>:   nopl   0x0(%rax)
   0x00000000004281c0 <+304>:   jmpq   0x4280ec <zrankGenericCommand+92>
   0x00000000004281c5 <+309>:   nopl   (%rax)
   0x00000000004281c8 <+312>:   mov    0x8(%r14),%r14
   0x00000000004281cc <+316>:   xor    %esi,%esi
   0x00000000004281ce <+318>:   mov    %r14,%rdi
   0x00000000004281d1 <+321>:   callq  0x417600 <ziplistIndex>
   0x00000000004281d6 <+326>:   test   %rax,%rax
   0x00000000004281d9 <+329>:   mov    %rax,0x18(%rsp)
   0x00000000004281de <+334>:   je     0x428311 <zrankGenericCommand+641>
   0x00000000004281e4 <+340>:   mov    %rax,%rsi
   0x00000000004281e7 <+343>:   mov    %r14,%rdi
   0x00000000004281ea <+346>:   callq  0x4175c0 <ziplistNext>
   0x00000000004281ef <+351>:   test   %rax,%rax
   0x00000000004281f2 <+354>:   mov    %rax,0x10(%rsp)
   0x00000000004281f7 <+359>:   je     0x4282f3 <zrankGenericCommand+611>
   0x00000000004281fd <+365>:   mov    0x18(%rsp),%rdi
   0x0000000000428202 <+370>:   mov    $0x1,%r13d
   0x0000000000428208 <+376>:   lea    0x10(%rsp),%r15
   0x000000000042820d <+381>:   test   %rdi,%rdi
   0x0000000000428210 <+384>:   jne    0x428236 <zrankGenericCommand+422>
   0x0000000000428212 <+386>:   jmp    0x428270 <zrankGenericCommand+480>
   0x0000000000428214 <+388>:   nopl   0x0(%rax)
   0x0000000000428218 <+392>:   lea    0x18(%rsp),%rsi
   0x000000000042821d <+397>:   mov    %r14,%rdi
   0x0000000000428220 <+400>:   mov    %r15,%rdx
   0x0000000000428223 <+403>:   callq  0x425610 <zzlNext>
   0x0000000000428228 <+408>:   mov    0x18(%rsp),%rdi
   0x000000000042822d <+413>:   test   %rdi,%rdi
   0x0000000000428230 <+416>:   je     0x428270 <zrankGenericCommand+480>
   0x0000000000428232 <+418>:   add    $0x1,%r13
   0x0000000000428236 <+422>:   mov    0x8(%rbp),%rsi
   0x000000000042823a <+426>:   movslq -0x8(%rsi),%rdx
   0x000000000042823e <+430>:   callq  0x417a40 <ziplistCompare>
   0x0000000000428243 <+435>:   test   %eax,%eax
   0x0000000000428245 <+437>:   je     0x428218 <zrankGenericCommand+392>
   0x0000000000428247 <+439>:   cmpq   $0x0,0x18(%rsp)
   0x000000000042824d <+445>:   je     0x428270 <zrankGenericCommand+480>
   0x000000000042824f <+447>:   test   %r12d,%r12d
   0x0000000000428252 <+450>:   je     0x428288 <zrankGenericCommand+504>
   0x0000000000428254 <+452>:   mov    0x8(%rsp),%rsi
   0x0000000000428259 <+457>:   mov    %rbx,%rdi
   0x000000000042825c <+460>:   sub    %r13,%rsi
   0x000000000042825f <+463>:   callq  0x41a430 <addReplyLongLong>
   0x0000000000428264 <+468>:   jmpq   0x4280ec <zrankGenericCommand+92>
   0x0000000000428269 <+473>:   nopl   0x0(%rax)
   0x0000000000428270 <+480>:   mov    0x2311d9(%rip),%rsi        # 0x659450 <shared+80>
   0x0000000000428277 <+487>:   mov    %rbx,%rdi
   0x000000000042827a <+490>:   callq  0x419f60 <addReply>
   0x000000000042827f <+495>:   jmpq   0x4280ec <zrankGenericCommand+92>
   0x0000000000428284 <+500>:   nopl   0x0(%rax)
   0x0000000000428288 <+504>:   lea    -0x1(%r13),%rsi
   0x000000000042828c <+508>:   mov    %rbx,%rdi
   0x000000000042828f <+511>:   callq  0x41a430 <addReplyLongLong>
   0x0000000000428294 <+516>:   jmpq   0x4280ec <zrankGenericCommand+92>
   0x0000000000428299 <+521>:   mov    $0x44939f,%edi
   0x000000000042829e <+526>:   mov    $0x808,%edx
   0x00000000004282a3 <+531>:   mov    $0x44a674,%esi
   0x00000000004282a8 <+536>:   callq  0x432010 <_redisPanic>
   0x00000000004282ad <+541>:   mov    $0x1,%edi
   0x00000000004282b2 <+546>:   callq  0x40c3a0 <_exit@plt>
   0x00000000004282b7 <+551>:   mov    $0x44a7d0,%edi
   0x00000000004282bc <+556>:   mov    $0x7da,%edx
   0x00000000004282c1 <+561>:   mov    $0x44a674,%esi
   0x00000000004282c6 <+566>:   callq  0x432090 <_redisAssert>
   0x00000000004282cb <+571>:   mov    $0x1,%edi
   0x00000000004282d0 <+576>:   callq  0x40c3a0 <_exit@plt>
   0x00000000004282d5 <+581>:   mov    $0x448982,%edi
   0x00000000004282da <+586>:   mov    $0x7ff,%edx
   0x00000000004282df <+591>:   mov    $0x44a674,%esi
   0x00000000004282e4 <+596>:   callq  0x432090 <_redisAssert>
   0x00000000004282e9 <+601>:   mov    $0x1,%edi
   0x00000000004282ee <+606>:   callq  0x40c3a0 <_exit@plt>
   0x00000000004282f3 <+611>:   mov    $0x44a6e5,%edi
   0x00000000004282f8 <+616>:   mov    $0x7e2,%edx
   0x00000000004282fd <+621>:   mov    $0x44a674,%esi
   0x0000000000428302 <+626>:   callq  0x432090 <_redisAssert>
   0x0000000000428307 <+631>:   mov    $0x1,%edi
   0x000000000042830c <+636>:   callq  0x40c3a0 <_exit@plt>
   0x0000000000428311 <+641>:   mov    $0x44a6bd,%edi
   0x0000000000428316 <+646>:   mov    $0x7e0,%edx
   0x000000000042831b <+651>:   mov    $0x44a674,%esi
   0x0000000000428320 <+656>:   callq  0x432090 <_redisAssert>
   0x0000000000428325 <+661>:   mov    $0x1,%edi
   0x000000000042832a <+666>:   callq  0x40c3a0 <_exit@plt>
   End of assembler dump.
Run Code Online (Sandbox Code Playgroud)

As requested, this is the tryObjectEncoding function:

/* Try to encode a string object in order to save space */
robj *tryObjectEncoding(robj *o) {
    long value;
    sds s = o->ptr;

    if (o->encoding != REDIS_ENCODING_RAW)
        return o; /* Already encoded */

    /* It's not safe to encode shared objects: shared objects can be shared
     * everywhere in the "object space" of Redis. Encoded objects can only
     * appear as "values" (and not, for instance, as keys) */
     if (o->refcount > 1) return o;

    /* Currently we try to encode only strings */
    redisAssert(o->type == REDIS_STRING);

    /* Check if we can represent this string as a long integer */
    if (!string2l(s,sdslen(s),&value)) return o;

    /* Ok, this object can be encoded...
     *
     * Can I use a shared object? Only if the object is inside a given
     * range and if this is the main thread, since when VM is enabled we
     * have the constraint that I/O thread should only handle non-shared
     * objects, in order to avoid race conditions (we don't have per-object
     * locking).
     *
     * Note that we also avoid using shared integers when maxmemory is used
     * because very object needs to have a private LRU field for the LRU
     * algorithm to work well. */
    if (server.maxmemory == 0 && value >= 0 && value < REDIS_SHARED_INTEGERS &&
        pthread_equal(pthread_self(),server.mainthread)) {
        decrRefCount(o);
        incrRefCount(shared.integers[value]);
        return shared.integers[value];
    } else {
        o->encoding = REDIS_ENCODING_INT;
        sdsfree(o->ptr);
        o->ptr = (void*) value;
        return o;
    }
}
Run Code Online (Sandbox Code Playgroud)

ant*_*rez 10

我想我现在可以回答我自己的问题......

基本上这就是发生的事情.zslGetRank()由zrankGenericCommand()调用,第一个参数进入%rdi寄存器.但是稍后此函数将使用%rdi寄存器来设置对象(实际上%rdi寄存器设置为有效的对象):

(gdb) print *(robj*)0x7f3d8d71c360
$1 = {type = 0, storage = 0, encoding = 1, lru = 517611, refcount = 2, 
ptr = 0x1524db19}
Run Code Online (Sandbox Code Playgroud)

在崩溃时指令指针实际指向zslGetRank + 64,我在发布原始问题之前对gdb做了一些错误并修改了寄存器.

另外如何验证zslGetRank()获取正确的地址作为第一个参数?因为%r14由zslGetRank()保存在堆栈中,所以我们可以检查堆栈以检查是否有正确的位置.所以我们在堆栈指针附近转储:

0x7fffe61a8000: 0x40337fa0a3376aff      0x00007f3d9dcdc000
0x7fffe61a8010: 0x00007f3d9dcdc000      0x00007f3d4cab5760
0x7fffe61a8020: 0x0000000000000001      0x00007f3d9de574b0
---> 0x7fffe61a8030: 0x00007f3d9de591c0      0x000000000042818b
0x7fffe61a8040: 0x0000000000000000      0x00000000000285c0
0x7fffe61a8050: 0x0000000000000000      0x00007f3d9dcdc000
0x7fffe61a8060: 0x0000000000000000      0x00007f3d9dcdc000
0x7fffe61a8070: 0x0000000000000000      0x0004b6b413e12d9a
0x7fffe61a8080: 0x00000000000003d8      0x0000000000000001
Run Code Online (Sandbox Code Playgroud)

正如您所看到的,正确的地址位于堆栈中.

总而言之,使用正确的地址调用函数,只是gdb无法转储正确的堆栈跟踪,因为%rdi寄存器被修改并用于函数内部的其他内容.

所以这可能是一个内存损坏的事情.我现在要做的是手动遍历排序集,模拟zslGetRank()的工作,这样我就可以隔离破坏的节点,并希望以哪种方式检查它是否已损坏.

谢谢你的帮助.

编辑:在这里你可以找到一个手动注释的反汇编版本的zslGetRank()函数 - > https://gist.github.com/1641112(我用它来学习更多汇编程序并使我的检查更简单).


ant*_*rez 1

更新:这个盒子里的 RAM 坏了,我们在这之后发现用户的 RAM 有很多问题,现在 Redis 甚至实现了 --test-memory 选项...谢谢。