Wednesday 20 January 2016

How to debug stack corruption on FreeRTOS

When the stack is corrupted, usually the value from the overwritten LR is stored into the PC when function returns. In that case, processor enters the exception handler because it cannot execute code from let's say "0xff80dddd" or other garbage address. When debugging such issues, you usually get following useless backtrace:

(gdb) bt
#0  _hang () at startup/startup.s:136
#1  <signal handler called>
#2  0x0001c918 in prvPortStartFirstTask () at portable/GCC/ARM_CM4F/port.c:303
#3  0x0001ca02 in xPortStartScheduler () at portable/GCC/ARM_CM4F/port.c:395
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

How to approach such issues?

Let's do some assumptions before going further:
  • The processor is 32 bit ARM.
  • You're using GCC toolchain.
  • OS is the FreeRTOS.
  • The stack is configured to grow downwards (standard way).
  • You don't have memory dumping mechanism and/or true post-mortem analysis tools/scripts.
  • You can catch the exception using GDB.

Having in mind above assumptions, this is what I do to find the root cause:

1. Connect through GDB and wait until bug reproduces.

2. When GDB catches the exception, print current task name:

(gdb) p pxCurrentTCB->pcTaskName
$6 = "Bug task\000\000"

3. Find in the source code what stack size was allocated for this task. Example:

#define BUG_TASK_SIZE 256
xTaskCreate(bug_task, "BUG", BUG_TASK_SIZE, NULL, 1, NULL)

To find out how many bytes are reserved for this task's stack in FreeRTOS, the "BUG_TASK_SIZE" value must be multiplied by word size. The 32 bit ARMs have 4 byte word size, so actual stack size is 256*4 = 1kB.

4. Find the lowest possible stack address:

(gdb) p pxCurrentTCB->pxStack
$7 = (StackType_t *) 0x20010400

5. Add the stack size to get stack range:

0x20010400 + 0x400 (1kB) = 0x20010800;

The stack of this task is between 0x20010400 and 0x20010800.

6. Read current top of the stack:

(gdb) p pxCurrentTCB->;pxTopOfStack
$8 = (volatile StackType_t *) 0x2001073c

So far, we get:

0x20010800   <- beginning of stack
|
|
0x2001073c   <- current top of stack
...
0x20010400   <- end of stack

7. Calculate how many bytes of stack was used:

0x20010800 - 0x2001073c = 0xC4 (196 bytes which are 49 words)

8. Dump the stack:

(gdb) x/49wx 0x2001073c
0x2001073c:     0x2000e41c      0x200107a8      0x2000e3f8      0x00000000
0x2001074c:     0x00000000      0x20010758      0x20010758      0x00000000
0x2001075c:     0x00000001      0x200107a8      0x00000000      0x00000000
0x2001076c:     0x00000000      0x20015f90      0x2001626c      0x0102fea9
0x2001077c:     0x20015ed4      0x0000000a      0x20010798      0x20010798
0x2001078c:     0x00021869      0x00020cf8      0x01000000      0x0102fea9
0x2001079c:     0x00000000      0x00000000      0x00000000      0x200107b0
0x200107ac:     0x00020e89      0x200107b8      0x00020f99      0x00020e71
0x200107bc:     0x00020e81      0x200107c8      0x00020bf5      0x00020e71
0x200107cc:     0x00020e81      0x00000000      0x00000000      0x00020f71
0x200107dc:     0x02000000      0x200107e8      0x0002101b      0x00020e71
0x200107ec:     0x00020e81      0x00000000      0x0001c8a5      0x00000000
0x200107fc:     0x00000000

9. Pass it to arm-none-eabi-addr2line:

You can pass each value one by one, create some sort of script or format it as one column and just paste:

arm-none-eabi-addr2line -e <path-to-elf>
<paste stack data>
0x2000e41c
0x200107a8
0x2000e3f8
0x00000000
0x00000000
0x20010758
0x20010758
0x00000000
0x00000001
0x200107a8
0x00000000
0x00000000
0x00000000
0x20015f90
0x2001626c
0x0102fea9
0x20015ed4
0x0000000a
0x20010798
0x20010798
0x00021869
0x00020cf8
0x01000000
0x0102fea9
0x00000000
0x00000000
0x00000000
0x200107b0
0x00020e89
0x200107b8
0x00020f99
0x00020e71
0x00020e81
0x200107c8
0x00020bf5
0x00020e71
0x00020e81
0x00000000
0x00000000
0x00020f71
0x02000000
0x200107e8
0x0002101b
0x00020e71
0x00020e81
0x00000000
0x0001c8a5
0x00000000
0x00000000

addr2line tool will try to parse each value as code address. Some of those addresses are data, so you'll get a garbage that you can ignore. Other lines, that match one of your source files, will be printed with specific line number. All in all, as an output you'll get something like this:

heap_4.c:?
heap_4.c:?
:?
:?
heap_4.c:?
heap_4.c:?
:?
:?
heap_4.c:?
:?
:?
:?
zzzz_sd.c:?
zzzz_sd.c:?
??:0
main.c:?
:?
heap_4.c:?
heap_4.c:?
/home/yyy/devel/xxx/app/zz/src/zzzz_sd.c:331   <<--- CHECK THIS LINE
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:395
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:324
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:329
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:227
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:324
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:329
:?
:?
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:387
??:0
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:422
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:324
/home/kowyyyalmic/devel/xxx/app/zzz/src/zzz_sd.c:329
:?
/home/kowalmic/devel/xxx/portable/GCC/ARM_CM4F/port.c:269

9. Analyze:

Now, at this point there are no strict rules how to proceed. However, check the source lines parsed by addr2line tool (note, the addresses are LR values, not PC) and this should point you as close as possible to the offending function. There are pretty good chances that the top-most parsed source line is just after some kind of wrongly used memcpy/memset.

For example, one of the last LRs put on stack was parsed by addr2line as:

/home/yyy/devel/xxx/app/zz/src/zzzz_sd.c:331

If you check the source line of zzzz_sd.c you can see for instance:

328: void xxx_get_ipv4_addr_raw(char *addr)
329: {
330:     memcpy(addr, m_xxx_iface->ip_addr, IPV6_LENGTH);
331: }

The address points to the next instruction after branch to the memcpy. Now, just look into the memcpy call above and the bug becomes obvious. Got it!



No comments:

Post a Comment