When the stack is corrupted, usually the value from the overwritten LR is stored into the PC when function returns. In that case, processor enters the exception handler because it cannot execute code from let's say "0xff80dddd" or other garbage address. When debugging such issues, you usually get following useless backtrace:
(gdb) bt
#0 _hang () at startup/startup.s:136
#1 <signal handler called>
#2 0x0001c918 in prvPortStartFirstTask () at portable/GCC/ARM_CM4F/port.c:303
#3 0x0001ca02 in xPortStartScheduler () at portable/GCC/ARM_CM4F/port.c:395
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
How to approach such issues?
Let's do some assumptions before going further:
- The processor is 32 bit ARM.
- You're using GCC toolchain.
- OS is the FreeRTOS.
- The stack is configured to grow downwards (standard way).
- You don't have memory dumping mechanism and/or true post-mortem analysis tools/scripts.
- You can catch the exception using GDB.
Having in mind above assumptions, this is what I do to find the root cause:
1. Connect through GDB and wait until bug reproduces.
2. When GDB catches the exception, print current task name:
(gdb) p pxCurrentTCB->pcTaskName
$6 = "Bug task\000\000"
3. Find in the source code what stack size was allocated for this task. Example:
#define BUG_TASK_SIZE 256
xTaskCreate(bug_task, "BUG", BUG_TASK_SIZE, NULL, 1, NULL)
To find out how many bytes are reserved for this task's stack in FreeRTOS, the "BUG_TASK_SIZE" value must be multiplied by word size. The 32 bit ARMs have 4 byte word size, so actual stack size is 256*4 = 1kB.
4. Find the lowest possible stack address:
(gdb) p pxCurrentTCB->pxStack
$7 = (StackType_t *) 0x20010400
5. Add the stack size to get stack range:
0x20010400 + 0x400 (1kB) = 0x20010800;
The stack of this task is between
0x20010400 and
0x20010800.
6. Read current top of the stack:
(gdb) p pxCurrentTCB->;pxTopOfStack
$8 = (volatile StackType_t *) 0x2001073c
So far, we get:
0x20010800 <- beginning of stack
|
|
0x2001073c <- current top of stack
...
0x20010400 <- end of stack
7. Calculate how many bytes of stack was used:
0x20010800 - 0x2001073c = 0xC4 (196 bytes which are 49 words)
8. Dump the stack:
(gdb) x/49wx 0x2001073c
0x2001073c: 0x2000e41c 0x200107a8 0x2000e3f8 0x00000000
0x2001074c: 0x00000000 0x20010758 0x20010758 0x00000000
0x2001075c: 0x00000001 0x200107a8 0x00000000 0x00000000
0x2001076c: 0x00000000 0x20015f90 0x2001626c 0x0102fea9
0x2001077c: 0x20015ed4 0x0000000a 0x20010798 0x20010798
0x2001078c: 0x00021869 0x00020cf8 0x01000000 0x0102fea9
0x2001079c: 0x00000000 0x00000000 0x00000000 0x200107b0
0x200107ac: 0x00020e89 0x200107b8 0x00020f99 0x00020e71
0x200107bc: 0x00020e81 0x200107c8 0x00020bf5 0x00020e71
0x200107cc: 0x00020e81 0x00000000 0x00000000 0x00020f71
0x200107dc: 0x02000000 0x200107e8 0x0002101b 0x00020e71
0x200107ec: 0x00020e81 0x00000000 0x0001c8a5 0x00000000
0x200107fc: 0x00000000
9. Pass it to arm-none-eabi-addr2line:
You can pass each value one by one, create some sort of script or format it as one column and just paste:
arm-none-eabi-addr2line -e <path-to-elf>
<paste stack data>
0x2000e41c
0x200107a8
0x2000e3f8
0x00000000
0x00000000
0x20010758
0x20010758
0x00000000
0x00000001
0x200107a8
0x00000000
0x00000000
0x00000000
0x20015f90
0x2001626c
0x0102fea9
0x20015ed4
0x0000000a
0x20010798
0x20010798
0x00021869
0x00020cf8
0x01000000
0x0102fea9
0x00000000
0x00000000
0x00000000
0x200107b0
0x00020e89
0x200107b8
0x00020f99
0x00020e71
0x00020e81
0x200107c8
0x00020bf5
0x00020e71
0x00020e81
0x00000000
0x00000000
0x00020f71
0x02000000
0x200107e8
0x0002101b
0x00020e71
0x00020e81
0x00000000
0x0001c8a5
0x00000000
0x00000000
addr2line tool will try to parse each value as code address. Some of those addresses are data, so you'll get a garbage that you can ignore. Other lines, that match one of your source files, will be printed with specific line number. All in all, as an output you'll get something like this:
heap_4.c:?
heap_4.c:?
:?
:?
heap_4.c:?
heap_4.c:?
:?
:?
heap_4.c:?
:?
:?
:?
zzzz_sd.c:?
zzzz_sd.c:?
??:0
main.c:?
:?
heap_4.c:?
heap_4.c:?
/home/yyy/devel/xxx/app/zz/src/zzzz_sd.c:331 <<--- CHECK THIS LINE
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:395
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:324
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:329
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:227
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:324
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:329
:?
:?
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:387
??:0
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:422
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:324
/home/kowyyyalmic/devel/xxx/app/zzz/src/zzz_sd.c:329
:?
/home/kowalmic/devel/xxx/portable/GCC/ARM_CM4F/port.c:269
9. Analyze:
Now, at this point there are no strict rules how to proceed. However, check the source lines parsed by addr2line tool (note, the addresses are LR values, not PC) and this should point you as close as possible to the offending function. There are pretty good chances that the top-most parsed source line is just after some kind of wrongly used memcpy/memset.
For example, one of the last LRs put on stack was parsed by addr2line as:
/home/yyy/devel/xxx/app/zz/src/zzzz_sd.c:331
If you check the source line of zzzz_sd.c you can see for instance:
328: void xxx_get_ipv4_addr_raw(char *addr)
329: {
330: memcpy(addr, m_xxx_iface->ip_addr, IPV6_LENGTH);
331: }
The address points to the next instruction after branch to the memcpy. Now, just look into the memcpy call above and the bug becomes obvious. Got it!