Fix Bug Fix

Wednesday, 27 April 2016

Who will get up to turn off the light?

I guess there are hundreds of IoT projects that enable wireless connectivity with your lamp so you can toggle it using a smartphone. I did mine as well. The lamp itself looks like this:

Unfortunately, I don't have much time to describe the project in all details (not to mention it's not fully finished yet). Anyway, I would like to put here some information for the record.

As the WiFI chip I chose a well-known ESP8266. I have it as a module in variant "ESP-03" so there is at least one GPIO free for use and enough Flash and RAM to implement all control features directly on the ESP (no need for external MCU). As the actual control element I used triac instead of relay because of the following reasons:

In the future I would like to make a dimmer.
It's smaller.
It makes no sound.

The circuit for the triac is pretty much standard and looks more or less like this:

The prototype was made on the breadboard:

In the picture above, there is a power supply converter for breadboards connected to the bench-top power supply. However currently I supply the module directly from mains. I disassembled an old wall charger, which is 5V/500mA, and connected a 3.3V regulator to it. Works very well. I also ordered a 3.3V 600mA AC-DC Power Supply Buck Converter Step Down Module which should be a better solution (no need for breaking wall chargers). I haven't tested it though.

By the way: I connect a 60W bulb, so there is absolutely no need for a heat sink for the triac (even if there is a cover). Without the cover, in theory it's safe to connect even 180W of resistive load to it. However, for anything above 150W I would probably consider adding some piece of metal.

In terms of software: on the ESP8266 I created a _very_very_ simple HTTP server that parses GET messages for "light_on" and "light_off" resources (yep, it should be POST or PUT method). Having the HTTP server I can control the light from the browser. I also made a simple widget for Android device.

There are a lot of TODOs regarding this project:

Make a dimmer.
Fix Android widget (currently it needs to be reloaded sometimes).
Make a plastic cover.
Implement more HTTP methods.
Enable mDNS.
Implement some kind of WiFi provisioning (currently router credentials are hard-coded).

That's it, YAIP (Yet Another IoT Project)!

Wednesday, 20 January 2016

How to debug stack corruption on FreeRTOS

When the stack is corrupted, usually the value from the overwritten LR is stored into the PC when function returns. In that case, processor enters the exception handler because it cannot execute code from let's say "0xff80dddd" or other garbage address. When debugging such issues, you usually get following useless backtrace:

(gdb) bt
#0  _hang () at startup/startup.s:136
#1  <signal handler called>
#2  0x0001c918 in prvPortStartFirstTask () at portable/GCC/ARM_CM4F/port.c:303
#3  0x0001ca02 in xPortStartScheduler () at portable/GCC/ARM_CM4F/port.c:395
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

How to approach such issues?

Let's do some assumptions before going further:

The processor is 32 bit ARM.
You're using GCC toolchain.
OS is the FreeRTOS.
The stack is configured to grow downwards (standard way).
You don't have memory dumping mechanism and/or true post-mortem analysis tools/scripts.
You can catch the exception using GDB.

Having in mind above assumptions, this is what I do to find the root cause:

1. Connect through GDB and wait until bug reproduces.

2. When GDB catches the exception, print current task name:

(gdb) p pxCurrentTCB->pcTaskName
$6 = "Bug task\000\000"

3. Find in the source code what stack size was allocated for this task. Example:

#define BUG_TASK_SIZE 256
xTaskCreate(bug_task, "BUG", BUG_TASK_SIZE, NULL, 1, NULL)

To find out how many bytes are reserved for this task's stack in FreeRTOS, the "BUG_TASK_SIZE" value must be multiplied by word size. The 32 bit ARMs have 4 byte word size, so actual stack size is 256*4 = 1kB.

4. Find the lowest possible stack address:

(gdb) p pxCurrentTCB->pxStack
$7 = (StackType_t *) 0x20010400

5. Add the stack size to get stack range:

0x20010400 + 0x400 (1kB) = 0x20010800;

The stack of this task is between 0x20010400 and 0x20010800.

6. Read current top of the stack:

(gdb) p pxCurrentTCB->;pxTopOfStack
$8 = (volatile StackType_t *) 0x2001073c

So far, we get:

0x20010800   <- beginning of stack
|
|
0x2001073c   <- current top of stack
...
0x20010400   <- end of stack

7. Calculate how many bytes of stack was used:

0x20010800 - 0x2001073c = 0xC4 (196 bytes which are 49 words)

8. Dump the stack:

(gdb) x/49wx 0x2001073c
0x2001073c:     0x2000e41c      0x200107a8      0x2000e3f8      0x00000000
0x2001074c:     0x00000000      0x20010758      0x20010758      0x00000000
0x2001075c:     0x00000001      0x200107a8      0x00000000      0x00000000
0x2001076c:     0x00000000      0x20015f90      0x2001626c      0x0102fea9
0x2001077c:     0x20015ed4      0x0000000a      0x20010798      0x20010798
0x2001078c:     0x00021869      0x00020cf8      0x01000000      0x0102fea9
0x2001079c:     0x00000000      0x00000000      0x00000000      0x200107b0
0x200107ac:     0x00020e89      0x200107b8      0x00020f99      0x00020e71
0x200107bc:     0x00020e81      0x200107c8      0x00020bf5      0x00020e71
0x200107cc:     0x00020e81      0x00000000      0x00000000      0x00020f71
0x200107dc:     0x02000000      0x200107e8      0x0002101b      0x00020e71
0x200107ec:     0x00020e81      0x00000000      0x0001c8a5      0x00000000
0x200107fc:     0x00000000

9. Pass it to arm-none-eabi-addr2line:

You can pass each value one by one, create some sort of script or format it as one column and just paste:

arm-none-eabi-addr2line -e <path-to-elf>
<paste stack data>
0x2000e41c
0x200107a8
0x2000e3f8
0x00000000
0x00000000
0x20010758
0x20010758
0x00000000
0x00000001
0x200107a8
0x00000000
0x00000000
0x00000000
0x20015f90
0x2001626c
0x0102fea9
0x20015ed4
0x0000000a
0x20010798
0x20010798
0x00021869
0x00020cf8
0x01000000
0x0102fea9
0x00000000
0x00000000
0x00000000
0x200107b0
0x00020e89
0x200107b8
0x00020f99
0x00020e71
0x00020e81
0x200107c8
0x00020bf5
0x00020e71
0x00020e81
0x00000000
0x00000000
0x00020f71
0x02000000
0x200107e8
0x0002101b
0x00020e71
0x00020e81
0x00000000
0x0001c8a5
0x00000000
0x00000000

addr2line tool will try to parse each value as code address. Some of those addresses are data, so you'll get a garbage that you can ignore. Other lines, that match one of your source files, will be printed with specific line number. All in all, as an output you'll get something like this:

heap_4.c:?
heap_4.c:?
:?
:?
heap_4.c:?
heap_4.c:?
:?
:?
heap_4.c:?
:?
:?
:?
zzzz_sd.c:?
zzzz_sd.c:?
??:0
main.c:?
:?
heap_4.c:?
heap_4.c:?
/home/yyy/devel/xxx/app/zz/src/zzzz_sd.c:331   <<--- CHECK THIS LINE
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:395
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:324
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:329
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:227
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:324
/home/yyy/devel/xxx/app/zzz/src/zzzz_sd.c:329
:?
:?
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:387
??:0
heap_4.c:?
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:422
/home/yyy/devel/xxx/app/zzz/src/zzz_sd.c:324
/home/kowyyyalmic/devel/xxx/app/zzz/src/zzz_sd.c:329
:?
/home/kowalmic/devel/xxx/portable/GCC/ARM_CM4F/port.c:269

9. Analyze:

Now, at this point there are no strict rules how to proceed. However, check the source lines parsed by addr2line tool (note, the addresses are LR values, not PC) and this should point you as close as possible to the offending function. There are pretty good chances that the top-most parsed source line is just after some kind of wrongly used memcpy/memset.

For example, one of the last LRs put on stack was parsed by addr2line as:

/home/yyy/devel/xxx/app/zz/src/zzzz_sd.c:331

If you check the source line of zzzz_sd.c you can see for instance:

328: void xxx_get_ipv4_addr_raw(char *addr)
329: {
330:     memcpy(addr, m_xxx_iface->ip_addr, IPV6_LENGTH);
331: }

The address points to the next instruction after branch to the memcpy. Now, just look into the memcpy call above and the bug becomes obvious. Got it!

Sunday, 3 January 2016

The Pong Year

It just happened that every time I'd like to check out a new technology I'm trying to use pong as an example. Some time ago, when I was learning DirectX and C# I've created pong in 3D. I don't have this project anymore, but the game had an isometric view and the pads and the ball was actually moving in two dimensions.

During this year however, I've created two more pong implementations. First one is based on the ncurses library. Actually, the idea was to create a generic framework for terminal-based games and use it to implement pong game as an example.

Here are the screenshots:

Another goal of this project was to create a network gameplay. I've started it, but for now it's abandoned. Although the multiplayer mode is not ready, the project still has features valuable for me:

Simple image-to-ascii converter
Doxygen documentation
Check unit tests
API for creating simple menus

You can check out the full project here. See readme for build instructions.

Later this year, I've decided to learn Unity 3D. The goal was to create an Android game and push it into Google Play to see how the process looks like. Again, I failed with a network gameplay. I have a working implementation for WiFi based LAN, but I've decided to exclude it from the final release because of numerous bugs related to the re-connection handling.

Although the multiplayer mode wasn't released, there were some advantages of the project. At the time I was creating it, Unity 3D didn't have support for LAN discovery. I did my own module which provided ability to advertise and search for the host. The module is based on the UDP packets broadcasting. It uses one quite ugly hack for Android: it checks for wlan0 or eth0 interfaces to determine if the connection is available. In most cases, having one of those interfaces up means the WiFi (or Ethernet) is enabled, but in theory it doesn't always have to be true. You can check out the module here. I don't develop LAN discovery module anymore, because latest Unity 3D has a native support for it. On the other hand, the local discovery is planned to be only in the premium version of Unity in the future.

In terms of a single player mode, I'm quite satisfied with the AI algorithm. The speed of the pad is randomized (in range that can be chosen by parameters) and after each strike it moves back to the middle of the room. I've also created (in one or two evenings) short soundtrack music using Reaper, MT Power Drum Kit, and 4Front Bass. All of those tools are really great.

The Unity itself seems to be a good engine. However, for small Android games it has quite significant size overhead. There are two native libraries provided: one for ARM and one for Intel architecture. If both are included in a final build, the size of an empty application will be almost 20MB. If you resign from Intel targets, you'll start with a ~9MB application. AFAIK it can be tuned in a premium version.

Here is how the end result looks like:

You can try it yourself on Google Play.

The conclusion from those projects for me is: no more pongs! I'm really bored creating pong implementations, I need to come up with a different template theme :)

Tuesday, 22 December 2015

ARM: bit fields under the hood

In this article I would like to share my observations about what is happening under the hood of bit fields. I'll use ARMv6-M and ARMv7-M architectures and GCC ARM Embedded toolchain.

Firstly, let's quickly recap what main data access instructions we have on both architectures:

LDR - Loads a word from memory, and writes it to a register.
LDRH - Loads a halfword from memory, zero-extends it to form a 32-bit word, and writes it to a register.
LDRB - Loads a byte from memory, zero-extends it to form a 32-bit word, and writes it to a register.
STR - Stores a word from a register to memory.
STRH - Stores a halfword from a register to memory.
STRB - Stores a byte from a register to memory.

Other variants of those instructions exist, but for purpose of this experiment let's stick to those basic ones listed above.

Note, there is a difference between ARMv6-M and ARMv7-M related to the alignment support.

ARMv6-M:

"ARMv6-M always generates a fault when an unaligned access occurs."

ARMv7-M:

"The system architecture can choose one of two policies for alignment checking in ARMv7-M:
• Support the unaligned access
• Generate a fault when an unaligned access occurs.
The policy varies with the type of access. An implementation can be configured to force alignment faults for all unaligned accesses."

OK, so on ARMv6-M things are pretty simple:

Access using LDR/STR must be word aligned.
Access using LDRH/STRH must be halfword aligned
Byte access can be achieved using LDRB and STRB instructions.

ARMv7-M however can be configured to use hardware support for an unaligned access. When it's enabled, even LDR/STR instruction will not generate an exception while accessing the unaligned address. The drawback here will be a more complex bus access. See this article for more details (it refers to ARM compiler, not the GCC, but it's not a problem in this case). Having that in mind, let's move on.

Let's see a "normal" structure without any bit fields specified. Consider the following example:

struct {
    unsigned int a;
    unsigned int b;
    unsigned int c;
    unsigned int d;
    unsigned int e;
} data = {6, 3, 1, 6, 57672};

int _start()
{
    volatile unsigned int a = data.a;
    volatile unsigned int b = data.b;
    volatile unsigned int c = data.c;
    volatile unsigned int d = data.d;
    volatile unsigned int e = data.e;

    return 0;
}

Compile it (for now without optimizations):

arm-none-eabi-gcc -nostdlib -mthumb -O0 -march=armv6-m -nostdlib main.c -o test

And inspect:

mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: b580       push {r7, lr}
    8002: b086       sub sp, #24
    8004: af00       add r7, sp, #0
    8006: 4b0a       ldr r3, [pc, #40] ; (8030 <_start+0x30>)
    8008: 681b       ldr r3, [r3, #0]
    800a: 617b       str r3, [r7, #20]
    800c: 4b08       ldr r3, [pc, #32] ; (8030 <_start+0x30>)
    800e: 685b       ldr r3, [r3, #4]
    8010: 613b       str r3, [r7, #16]
    8012: 4b07       ldr r3, [pc, #28] ; (8030 <_start+0x30>)
    8014: 689b       ldr r3, [r3, #8]
    8016: 60fb       str r3, [r7, #12]
    8018: 4b05       ldr r3, [pc, #20] ; (8030 <_start+0x30>)
    801a: 68db       ldr r3, [r3, #12]
    801c: 60bb       str r3, [r7, #8]
    801e: 4b04       ldr r3, [pc, #16] ; (8030 <_start+0x30>)
    8020: 691b       ldr r3, [r3, #16]
    8022: 607b       str r3, [r7, #4]
    8024: 2300       movs r3, #0
    8026: 1c18       adds r0, r3, #0
    8028: 46bd       mov sp, r7
    802a: b006       add sp, #24
    802c: bd80       pop {r7, pc}
    802e: 46c0       nop   ; (mov r8, r8)
    8030: 00010034  andeq r0, r1, r4, lsr r0

Disassembly of section .data:

00010034 <__data_start>:
   10034: 00000006  
   10038: 00000003  
   1003c: 00000001  
   10040: 00000006  
   10044: 0000e148

You can do the same for ARMv7-M (by passing -march=armv7-m flag) to see minor differences between generated asm but it's not important for purpose of this consideration.

What we need to notice are two things*:

The "data" structure occupies 20 bytes (5 words) in the memory (lines 38-42).
Accessing fields (for example lines 13 or 16) is done using ldr instruction.

* Note, I'm not discussing padding between fields in a structure (not happening here because all fields are 32 bits anyway).

So far so good. All fields in the structure are integers which are 4 bytes each on both architectures. The "data" variable is a global, so it starts on a word aligned address. The whole word can be read using ldr instruction.

Now, suppose the structure represents a 32 bit register and its fields "a", "b", "c", "d" and "e" are respectively 4, 4, 1, 7 and 16 bits long:

To implement such structure we can use bit fields:

struct 
{
    unsigned int a : 4;
    unsigned int b : 4;
    unsigned int c : 1;
    unsigned int d : 7;
    unsigned int e : 16;
} data = {6, 3, 1, 6, 57672};

int _start()
{
    volatile unsigned int a = data.a;
    volatile unsigned int b = data.b;
    volatile unsigned int c = data.c;
    volatile unsigned int d = data.d;
    volatile unsigned int e = data.e;

    return 0;
}

Without optimizations, GCC will now produce the following code:

mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: b580       push {r7, lr}
    8002: b086       sub sp, #24
    8004: af00       add r7, sp, #0
    8006: 4b10       ldr r3, [pc, #64] ; (8048 <_start+0x48>)
    8008: 781b       ldrb r3, [r3, #0]
    800a: 071b       lsls r3, r3, #28
    800c: 0f1b       lsrs r3, r3, #28
    800e: b2db       uxtb r3, r3
    8010: 617b       str r3, [r7, #20]
    8012: 4b0d       ldr r3, [pc, #52] ; (8048 <_start+0x48>)
    8014: 781b       ldrb r3, [r3, #0]
    8016: 061b       lsls r3, r3, #24
    8018: 0f1b       lsrs r3, r3, #28
    801a: b2db       uxtb r3, r3
    801c: 613b       str r3, [r7, #16]
    801e: 4b0a       ldr r3, [pc, #40] ; (8048 <_start+0x48>)
    8020: 785b       ldrb r3, [r3, #1]
    8022: 07db       lsls r3, r3, #31
    8024: 0fdb       lsrs r3, r3, #31
    8026: b2db       uxtb r3, r3
    8028: 60fb       str r3, [r7, #12]
    802a: 4b07       ldr r3, [pc, #28] ; (8048 <_start+0x48>)
    802c: 785b       ldrb r3, [r3, #1]
    802e: 061b       lsls r3, r3, #24
    8030: 0e5b       lsrs r3, r3, #25
    8032: b2db       uxtb r3, r3
    8034: 60bb       str r3, [r7, #8]
    8036: 4b04       ldr r3, [pc, #16] ; (8048 <_start+0x48>)
    8038: 885b       ldrh r3, [r3, #2]
    803a: 607b       str r3, [r7, #4]
    803c: 2300       movs r3, #0
    803e: 1c18       adds r0, r3, #0
    8040: 46bd       mov sp, r7
    8042: b006       add sp, #24
    8044: bd80       pop {r7, pc}
    8046: 46c0       nop   ; (mov r8, r8)
    8048: 0001004c  andeq r0, r1, ip, asr #32

Disassembly of section .data:

0001004c <__data_start>:
   1004c: e1480d36

So, what's happening? Observations:

To get "bit parts" processor will read a smallest possible chunk of data (see for instance line 13), then it will shift it left (line 14) and right (line 15) to get rid of unwanted bits.
Generated code uses also uxtb instruction which is the "Unsigned Extend Byte" (extracts an 8-bit value from a register, zero extends it to 32 bits, and writes the result to the destination register).
If it's possible it will use instructions that can read more than one byte (see line 37).
Because the sum of our bit fields doesn't exceed the word size (4 + 4 + 1 + 7 + 16 <= 32) we use only 4 bytes of data (line 50).

According to the last observation: if we add at least one more field, we'll need a whole new word to store it:

struct 
{
    unsigned int a : 4;
    unsigned int b : 4;
    unsigned int c : 1;
    unsigned int d : 7;
    unsigned int e : 16;
    unsigned int f : 1;
} data = {6, 3, 1, 6, 57672, 1};

int _start()
{
    volatile unsigned int a = data.a;
    volatile unsigned int b = data.b;
    volatile unsigned int c = data.c;
    volatile unsigned int d = data.d;
    volatile unsigned int e = data.e;
    volatile unsigned int f = data.f;

    return 0;
}

With additional one-bit field "f" the new word is allocated (line 30):

Disassembly of section .text:

00008000 <_start>:
    8000: b580       push {r7, lr}
    8002: b086       sub sp, #24
    8004: af00       add r7, sp, #0
    8006: 4b13       ldr r3, [pc, #76] ; (8054 <_start+0x54>)
    8008: 781b       ldrb r3, [r3, #0]
    800a: 071b       lsls r3, r3, #28
    800c: 0f1b       lsrs r3, r3, #28
(..)
    803c: 4b05       ldr r3, [pc, #20] ; (8054 <_start+0x54>)
    803e: 791b       ldrb r3, [r3, #4]
    8040: 07db       lsls r3, r3, #31
    8042: 0fdb       lsrs r3, r3, #31
    8044: b2db       uxtb r3, r3
    8046: 603b       str r3, [r7, #0]
    8048: 2300       movs r3, #0
    804a: 1c18       adds r0, r3, #0
    804c: 46bd       mov sp, r7
    804e: b006       add sp, #24
    8050: bd80       pop {r7, pc}
    8052: 46c0       nop   ; (mov r8, r8)
    8054: 00010058  andeq r0, r1, r8, asr r0

Disassembly of section .data:

00010058 <__data_start>:
   10058: e1480d36  
   1005c: 00000001

OK, so this is how it works. Just for the reference, let's have a look into generated assembler with enabled optimizations :

arm-none-eabi-gcc -nostdlib -mthumb -Os -march=armv6-m -nostdlib main.c -o test

mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: 4b0c       ldr r3, [pc, #48] ; (8034 <_start+0x34>)
    8002: b086       sub sp, #24
    8004: 781a       ldrb r2, [r3, #0]
    8006: 2000       movs r0, #0
    8008: 0711       lsls r1, r2, #28
    800a: 0f09       lsrs r1, r1, #28
    800c: 0912       lsrs r2, r2, #4
    800e: 9100       str r1, [sp, #0]
    8010: 9201       str r2, [sp, #4]
    8012: 785a       ldrb r2, [r3, #1]
    8014: 07d1       lsls r1, r2, #31
    8016: 0fc9       lsrs r1, r1, #31
    8018: b2c9       uxtb r1, r1
    801a: 0852       lsrs r2, r2, #1
    801c: 9102       str r1, [sp, #8]
    801e: 9203       str r2, [sp, #12]
    8020: 885a       ldrh r2, [r3, #2]
    8022: 791b       ldrb r3, [r3, #4]
    8024: 9204       str r2, [sp, #16]
    8026: 07db       lsls r3, r3, #31
    8028: 0fdb       lsrs r3, r3, #31
    802a: b2db       uxtb r3, r3
    802c: 9305       str r3, [sp, #20]
    802e: b006       add sp, #24
    8030: 4770       bx lr
    8032: 46c0       nop   ; (mov r8, r8)
    8034: 00010038  andeq r0, r1, r8, lsr r0

Disassembly of section .data:

00010038 <__data_start>:
   10038: e1480d36  
   1003c: 00000001

Just out of curiosity, let's also compile for ARMv7-M:

arm-none-eabi-gcc -nostdlib -mthumb -Os -march=armv7-m -nostdlib main.c -o test

mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: 4b0b       ldr r3, [pc, #44] ; (8030 <_start+0x30>)
    8002: b086       sub sp, #24
    8004: 781a       ldrb r2, [r3, #0]
    8006: 2000       movs r0, #0
    8008: f002 010f  and.w r1, r2, #15
    800c: 0912       lsrs r2, r2, #4
    800e: 9100       str r1, [sp, #0]
    8010: 9201       str r2, [sp, #4]
    8012: 785a       ldrb r2, [r3, #1]
    8014: f002 0101  and.w r1, r2, #1
    8018: 0852       lsrs r2, r2, #1
    801a: 9102       str r1, [sp, #8]
    801c: 9203       str r2, [sp, #12]
    801e: 885a       ldrh r2, [r3, #2]
    8020: 791b       ldrb r3, [r3, #4]
    8022: 9204       str r2, [sp, #16]
    8024: f003 0301  and.w r3, r3, #1
    8028: 9305       str r3, [sp, #20]
    802a: b006       add sp, #24
    802c: 4770       bx lr
    802e: bf00       nop
    8030: 00010034  andeq r0, r1, r4, lsr r0

Disassembly of section .data:

00010034 <__data_start>:
   10034: e1480d36  
   10038: 00000001

No major differences between them. In both cases we see that after optimizations there will be less actual data read instructions (see bold lines), but, of course, there still will be the "bit shuffle" using shift operations.

Now, our example "data" variable was aligned by compiler. But in real life, the variable that stores bit fields can be wrongly cast or just moved to the unaligned address by mistake. Consider following dummy code:

struct data_t
{
    unsigned int a : 4;
    unsigned int b : 4;
    unsigned int c : 1;
    unsigned int d : 7;
    unsigned int e : 16;
    unsigned int f : 1;
} data = {6, 3, 1, 6, 57672, 1};

int _start()
{

    volatile struct data_t *some_mem = (struct data_t *)0x10031;
    volatile unsigned int s = some_mem->a;

    return 0;
}

We deliberately pointed to the unaligned address. Although the field is only 4 bit long, compiler will not use ldrb instruction, because it assumes the beginning of the structure is aligned. Instead, it will use ldr instruction causing hard fault exception due to unaligned access (line 11):

mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: 4b04       ldr r3, [pc, #16] ; (8014 <_start+0x14>)
    8002: b082       sub sp, #8
    8004: 681b       ldr r3, [r3, #0]
    8006: 2000       movs r0, #0
    8008: 071b       lsls r3, r3, #28
    800a: 0f1b       lsrs r3, r3, #28
    800c: 9301       str r3, [sp, #4]
    800e: b002       add sp, #8
    8010: 4770       bx lr
    8012: 46c0       nop   ; (mov r8, r8)
    8014: 00010031  andeq r0, r1, r1, lsr r0

Disassembly of section .data:

00010018 <__data_start>:
   10018: e1480d36  
   1001c: 00000001

Conclusions? It's good to be aware of data access alignment issues. I've started looking into this problem after investigation of a bug that originated from wrong casting of a structure that was using bit fields (on Cortex M0). Although the specific bit field may look like accessible by byte-aligned instruction, it's not always the case. Sometimes nothing wrong will happen because the accessed field will be luckily aligned by accident. What's more likely however: sooner or later you'll get a hard fault exception because of unaligned access. What's also worth noting is that on ARMv7-M (as opposed to ARMv6-M) the unaligned access can be hidden for the programmer and cause more bus accesses but no hard faults.

Friday, 4 December 2015

LwIP IPv6 on K64F

Couple of weeks ago I did a bring-up of IPv6 connectivity using Ethernet on Freedom K64F board. I've used FreeRTOS combined with LwIP as the main components. Generally, everything went smooth beside one thing: because IPv6 uses multicasting during Neighbor Discovery instead of broadcasting as it was in old ARP, the Ethernet Controller needs to accept specific multicast MAC address. By default (if not in promiscuous mode) all frames with destination MAC addresses that are not in the "whitelist" will be dropped by the HW network controller. We need to do an exception for multicast MAC needed by ICMPv6 protocol. Otherwise, even pings will not work because devices cannot exchange their IPv6 addresses. The whole "hey! who has >ipv6< address?" protocol will not work without it.

As a quick solution I've fixed it on a driver layer:

enet_status_t ENET_DRV_Init(enet_dev_if_t * enetIfPtr, const enet_user_config_t* userConfig)
 {   
     enet_status_t result;
     uint32_t  frequency; 
     ENET_Type * base;
     uint32_t statusMask = 0;
     enet_cur_status_t curStatus;
     const enet_mac_config_t* macCfgPtr = userConfig->macCfgPtr;
     const enet_buff_config_t* buffCfgPtr = userConfig->buffCfgPtr;
+    uint32_t hash = 0;
+    uint8_t ipv6_multicast[6] = {0};
+    ipv6_multicast[0] = 0x33;
+    ipv6_multicast[1] = 0x33;
+    ipv6_multicast[2] = 0xff;
+    ipv6_multicast[3] = macCfgPtr->macAddr[3];
+    ipv6_multicast[4] = macCfgPtr->macAddr[4];
+    ipv6_multicast[5] = macCfgPtr->macAddr[5];
     
     enet_bd_config bdConfig = {0};
     /* Check the input parameters*/
     if ((!enetIfPtr) || (!macCfgPtr) || (!buffCfgPtr))
     {
         return kStatus_ENET_InvalidInput;
     }
 #if !ENET_RECEIVE_ALL_INTERRUPT
     /* POLL mode needs the extended buffer for data buffer update*/
     if((!buffCfgPtr->extRxBuffQue) || (!buffCfgPtr->extRxBuffNum))
     {
         return kStatus_ENET_InvalidInput;
     }
 #endif
     base = g_enetBase[enetIfPtr->deviceNumber];
 
     /* Store the global ENET structure for ISR input parameter*/
     enetIfHandle[enetIfPtr->deviceNumber] = enetIfPtr;
 
     /* Turn on ENET module clock gate */
     CLOCK_SYS_EnableEnetClock( 0U);
     frequency = CLOCK_SYS_GetSystemClockFreq();
     bdConfig.rxBds = buffCfgPtr->rxBdPtrAlign;
     bdConfig.rxBuffer = buffCfgPtr->rxBufferAlign;
     bdConfig.rxBdNumber = buffCfgPtr->rxBdNumber;
     bdConfig.rxBuffSizeAlign = buffCfgPtr->rxBuffSizeAlign;
     bdConfig.txBds = buffCfgPtr->txBdPtrAlign;
     bdConfig.txBuffer = buffCfgPtr->txBufferAlign;
     bdConfig.txBdNumber = buffCfgPtr->txBdNumber;
     bdConfig.txBuffSizeAlign = buffCfgPtr->txBuffSizeAlign;
     /* Init ENET MAC to reset status*/
     ENET_HAL_Init(base);
     /* Configure MAC controller*/
     ENET_HAL_Config(base, macCfgPtr, frequency, &bdConfig);
+    /* Add IPv6 multicast */
+    ENET_DRV_AddMulticastGroup(enetIfPtr->deviceNumber, ipv6_multicast, &hash);
(..)

There were couple more minor fixes needed as well. You can see whole project here.

Monday, 30 November 2015

Minimal GCC setup for K64F

Introduction

Usually, vendor of Hardware Development Kit (for small embedded applications) delivers ready-to-use SDK. See for instance ST, Nordic or Freedom. Inside such SDK there are number of modules: startup code, drivers, libraries, toolchains, sometimes there is already ported an OS and/or network stack. Usually the same SDK is shared between variety of products of the same vendor (for instance different boards based on the same SoC). However, vendors can favor a toolchain that you're not using at all. SDKs are indispensable in general, but what if you want to just light a LED on your board and don't want to dig into details of complex SDK? What if IDE for which projects are configured by default is not your favorite? Let's see what is actually minimal possible GCC setup for running K64F (Cortex M4) without all the legacy startup code like this (example from Kinetis SDK):

116 void SystemInit (void) {
117 #if ((__FPU_PRESENT == 1) && (__FPU_USED == 1))
118   SCB->CPACR |= ((3UL << 10*2) | (3UL << 11*2));    /* set CP10, CP11 Full Access */
119 #endif /* ((__FPU_PRESENT == 1) && (__FPU_USED == 1)) */
120 #if (DISABLE_WDOG)
121   /* WDOG->UNLOCK: WDOGUNLOCK=0xC520 */
122   WDOG->UNLOCK = WDOG_UNLOCK_WDOGUNLOCK(0xC520); /* Key 1 */
123   /* WDOG->UNLOCK: WDOGUNLOCK=0xD928 */
124   WDOG->UNLOCK = WDOG_UNLOCK_WDOGUNLOCK(0xD928); /* Key 2 */
125   /* WDOG->STCTRLH: ?=0,DISTESTWDOG=0,BYTESEL=0,TESTSEL=0,TESTWDOG=0,?=0,?=1,WAITEN=1,STOPEN=1,DBGEN=0,ALLOWUPDATE=1,WINEN=0,IRQRSTEN=0,CLKSRC=1,WDOGEN=0     */
126   WDOG->STCTRLH = WDOG_STCTRLH_BYTESEL(0x00) |
127                  WDOG_STCTRLH_WAITEN_MASK |
128                  WDOG_STCTRLH_STOPEN_MASK |
129                  WDOG_STCTRLH_ALLOWUPDATE_MASK |
130                  WDOG_STCTRLH_CLKSRC_MASK |
131                  0x0100U;
132 #endif /* (DISABLE_WDOG) */
133 #ifdef CLOCK_SETUP
134   if((RCM->SRS0 & RCM_SRS0_WAKEUP_MASK) != 0x00U)
135   {
136     if((PMC->REGSC & PMC_REGSC_ACKISO_MASK) != 0x00U)
137     {
138        PMC->REGSC |= PMC_REGSC_ACKISO_MASK; /* Release hold with ACKISO:  Only has an effect if recovering from VLLSx.*/
139     }
140   } else {
141 #ifdef SYSTEM_RTC_CR_VALUE
142     SIM_SCGC6 |= SIM_SCGC6_RTC_MASK;
143     if ((RTC_CR & RTC_CR_OSCE_MASK) == 0x00U) { /* Only if the OSCILLATOR is not already enabled */
144       RTC_CR = (uint32_t)((RTC_CR & (uint32_t)~(uint32_t)(RTC_CR_SC2P_MASK | RTC_CR_SC4P_MASK | RTC_CR_SC8P_MASK | RTC_CR_SC16P_MASK)) | (uint32_t)SYSTEM    _RTC_CR_VALUE);
145       RTC_CR |= (uint32_t)RTC_CR_OSCE_MASK;
146       RTC_CR &= (uint32_t)~(uint32_t)RTC_CR_CLKO_MASK;
147     }
148 #endif
149   }
150 
151   /* Power mode protection initialization */
152 #ifdef SYSTEM_SMC_PMPROT_VALUE
153   SMC->PMPROT = SYSTEM_SMC_PMPROT_VALUE;
154 #endif
155 
156   /* System clock initialization */
157   /* Internal reference clock trim initialization */
158 #if defined(SLOW_TRIM_ADDRESS)
159   if ( *((uint8_t*)SLOW_TRIM_ADDRESS) != 0xFFU) {                              /* Skip if non-volatile flash memory is erased */
160     MCG->C3 = *((uint8_t*)SLOW_TRIM_ADDRESS);
161   #endif /* defined(SLOW_TRIM_ADDRESS) */
162   #if defined(SLOW_FINE_TRIM_ADDRESS)
163     MCG->C4 = (MCG->C4 & ~(MCG_C4_SCFTRIM_MASK)) | ((*((uint8_t*) SLOW_FINE_TRIM_ADDRESS)) & MCG_C4_SCFTRIM_MASK);
164   #endif
 155 
156   /* System clock initialization */
157   /* Internal reference clock trim initialization */
158 #if defined(SLOW_TRIM_ADDRESS)
159   if ( *((uint8_t*)SLOW_TRIM_ADDRESS) != 0xFFU) {                              /* Skip if non-volatile flash memory is erased */
160     MCG->C3 = *((uint8_t*)SLOW_TRIM_ADDRESS);
161   #endif /* defined(SLOW_TRIM_ADDRESS) */
162   #if defined(SLOW_FINE_TRIM_ADDRESS)
163     MCG->C4 = (MCG->C4 & ~(MCG_C4_SCFTRIM_MASK)) | ((*((uint8_t*) SLOW_FINE_TRIM_ADDRESS)) & MCG_C4_SCFTRIM_MASK);
164   #endif
165   #if defined(FAST_TRIM_ADDRESS)
166     MCG->C4 = (MCG->C4 & ~(MCG_C4_FCTRIM_MASK)) |((*((uint8_t*) FAST_TRIM_ADDRESS)) & MCG_C4_FCTRIM_MASK);
167   #endif
168   #if defined(FAST_FINE_TRIM_ADDRESS)
169     MCG->C2 = (MCG->C2 & ~(MCG_C2_FCFTRIM_MASK)) | ((*((uint8_t*)FAST_TRIM_ADDRESS)) & MCG_C2_FCFTRIM_MASK);
170   #endif /* defined(FAST_FINE_TRIM_ADDRESS) */
171 #if defined(SLOW_TRIM_ADDRESS)
172   }
173   #endif /* defined(SLOW_TRIM_ADDRESS) */
174 
175   /* Set system prescalers and clock sources */
176   SIM->CLKDIV1 = SYSTEM_SIM_CLKDIV1_VALUE; /* Set system prescalers */
177   SIM->SOPT1 = ((SIM->SOPT1) & (uint32_t)(~(SIM_SOPT1_OSC32KSEL_MASK))) | ((SYSTEM_SIM_SOPT1_VALUE) & (SIM_SOPT1_OSC32KSEL_MASK)); /* Set 32 kHz clock so    urce (ERCLK32K) */
178   SIM->SOPT2 = ((SIM->SOPT2) & (uint32_t)(~(SIM_SOPT2_PLLFLLSEL_MASK))) | ((SYSTEM_SIM_SOPT2_VALUE) & (SIM_SOPT2_PLLFLLSEL_MASK)); /* Selects the high fr    equency clock for various peripheral clocking options. */
179 #if ((MCG_MODE == MCG_MODE_FEI) || (MCG_MODE == MCG_MODE_FBI) || (MCG_MODE == MCG_MODE_BLPI))
180   /* Set MCG and OSC */
181 #if  ((((SYSTEM_OSC_CR_VALUE) & OSC_CR_ERCLKEN_MASK) != 0x00U) || ((((SYSTEM_MCG_C5_VALUE) & MCG_C5_PLLCLKEN0_MASK) != 0x00U) && (((SYSTEM_MCG_C7_VALUE)     & MCG_C7_OSCSEL_MASK) == 0x00U)))
182   /* SIM_SCGC5: PORTA=1 */
         183   SIM_SCGC5 |= SIM_SCGC5_PORTA_MASK;
184   /* PORTA_PCR18: ISF=0,MUX=0 */
185   PORTA_PCR18 &= (uint32_t)~(uint32_t)((PORT_PCR_ISF_MASK | PORT_PCR_MUX(0x07)));
186   if (((SYSTEM_MCG_C2_VALUE) & MCG_C2_EREFS_MASK) != 0x00U) {
187   /* PORTA_PCR19: ISF=0,MUX=0 */
188   PORTA_PCR19 &= (uint32_t)~(uint32_t)((PORT_PCR_ISF_MASK | PORT_PCR_MUX(0x07)));
189   }
190 #endif
191   MCG->SC = SYSTEM_MCG_SC_VALUE;       /* Set SC (fast clock internal reference divider) */
192   MCG->C1 = SYSTEM_MCG_C1_VALUE;       /* Set C1 (clock source selection, FLL ext. reference divider, int. reference enable etc.) */
193   /* Check that the source of the FLL reference clock is the requested one. */
194   if (((SYSTEM_MCG_C1_VALUE) & MCG_C1_IREFS_MASK) != 0x00U) {
195     while((MCG->S & MCG_S_IREFST_MASK) == 0x00U) {
196     }
197   } else {
198     while((MCG->S & MCG_S_IREFST_MASK) != 0x00U) {
199     }
200   }
201   MCG->C2 = (MCG->C2 & (uint8_t)(~(MCG_C2_FCFTRIM_MASK))) | (SYSTEM_MCG_C2_VALUE & (uint8_t)(~(MCG_C2_LP_MASK))); /* Set C2 (freq. range, ext. and int. r    eference selection etc. excluding trim bits; low power bit is set later) */
202   MCG->C4 = ((SYSTEM_MCG_C4_VALUE) & (uint8_t)(~(MCG_C4_FCTRIM_MASK | MCG_C4_SCFTRIM_MASK))) | (MCG->C4 & (MCG_C4_FCTRIM_MASK | MCG_C4_SCFTRIM_MASK)); /*     Set C4 (FLL output; trim values not changed) */
203   OSC->CR = SYSTEM_OSC_CR_VALUE;       /* Set OSC_CR (OSCERCLK enable, oscillator capacitor load) */
204   MCG->C7 = SYSTEM_MCG_C7_VALUE;       /* Set C7 (OSC Clock Select) */
205   #if (MCG_MODE == MCG_MODE_BLPI)
206   /* BLPI specific */
207   MCG->C2 |= (MCG_C2_LP_MASK);         /* Disable FLL and PLL in bypass mode */
208   #endif
209 
210 #else /* MCG_MODE */
211   /* Set MCG and OSC */
212 #if  (((SYSTEM_OSC_CR_VALUE) & OSC_CR_ERCLKEN_MASK) != 0x00U) || (((SYSTEM_MCG_C7_VALUE) & MCG_C7_OSCSEL_MASK) == 0x00U)
213   /* SIM_SCGC5: PORTA=1 */
214   SIM_SCGC5 |= SIM_SCGC5_PORTA_MASK;
215   /* PORTA_PCR18: ISF=0,MUX=0 */
216   PORTA_PCR18 &= (uint32_t)~(uint32_t)((PORT_PCR_ISF_MASK | PORT_PCR_MUX(0x07)));
217   if (((SYSTEM_MCG_C2_VALUE) & MCG_C2_EREFS_MASK) != 0x00U) {
218   /* PORTA_PCR19: ISF=0,MUX=0 */
219   PORTA_PCR19 &= (uint32_t)~(uint32_t)((PORT_PCR_ISF_MASK | PORT_PCR_MUX(0x07)));
220   }
221 #endif
222   MCG->SC = SYSTEM_MCG_SC_VALUE;       /* Set SC (fast clock internal reference divider) */
223   MCG->C2 = (MCG->C2 & (uint8_t)(~(MCG_C2_FCFTRIM_MASK))) | (SYSTEM_MCG_C2_VALUE & (uint8_t)(~(MCG_C2_LP_MASK))); /* Set C2 (freq. range, ext. and int. r    eference selection etc. excluding trim bits; low power bit is set later) */
224   OSC->CR = SYSTEM_OSC_CR_VALUE;       /* Set OSC_CR (OSCERCLK enable, oscillator capacitor load) */
225   MCG->C7 = SYSTEM_MCG_C7_VALUE;       /* Set C7 (OSC Clock Select) */
226   #if (MCG_MODE == MCG_MODE_PEE)
227   MCG->C1 = (SYSTEM_MCG_C1_VALUE) | MCG_C1_CLKS(0x02); /* Set C1 (clock source selection, FLL ext. reference divider, int. reference enable etc.) - PBE m    ode*/
228   #else
229   MCG->C1 = SYSTEM_MCG_C1_VALUE;       /* Set C1 (clock source selection, FLL ext. reference divider, int. reference enable etc.) */
230   #endif
231   if ((((SYSTEM_MCG_C2_VALUE) & MCG_C2_EREFS_MASK) != 0x00U) && (((SYSTEM_MCG_C7_VALUE) & MCG_C7_OSCSEL_MASK) == 0x00U)) {
232     while((MCG->S & MCG_S_OSCINIT0_MASK) == 0x00U) { /* Check that the oscillator is running */
233     }
234   }
235   /* Check that the source of the FLL reference clock is the requested one. */
236   if (((SYSTEM_MCG_C1_VALUE) & MCG_C1_IREFS_MASK) != 0x00U) {
237     while((MCG->S & MCG_S_IREFST_MASK) == 0x00U) {
238     }
239   } else {
240     while((MCG->S & MCG_S_IREFST_MASK) != 0x00U) {
241     }
242   }
243   MCG->C4 = ((SYSTEM_MCG_C4_VALUE)  & (uint8_t)(~(MCG_C4_FCTRIM_MASK | MCG_C4_SCFTRIM_MASK))) | (MCG->C4 & (MCG_C4_FCTRIM_MASK | MCG_C4_SCFTRIM_MASK)); /    * Set C4 (FLL output; trim values not changed) */
244 #endif /* MCG_MODE */
245 
246   /* Common for all MCG modes */
247 
248   /* PLL clock can be used to generate clock for some devices regardless of clock generator (MCGOUTCLK) mode. */
249   MCG->C5 = (SYSTEM_MCG_C5_VALUE) & (uint8_t)(~(MCG_C5_PLLCLKEN0_MASK)); /* Set C5 (PLL settings, PLL reference divider etc.) */
250   MCG->C6 = (SYSTEM_MCG_C6_VALUE) & (uint8_t)~(MCG_C6_PLLS_MASK); /* Set C6 (PLL select, VCO divider etc.) */
251   if ((SYSTEM_MCG_C5_VALUE) & MCG_C5_PLLCLKEN0_MASK) {
252     MCG->C5 |= MCG_C5_PLLCLKEN0_MASK;  /* PLL clock enable in mode other than PEE or PBE */
253   }
253   }
254   /* BLPE, PEE and PBE MCG mode specific */
255 
256 #if (MCG_MODE == MCG_MODE_BLPE)
257   MCG->C2 |= (MCG_C2_LP_MASK);         /* Disable FLL and PLL in bypass mode */
258 #elif ((MCG_MODE == MCG_MODE_PBE) || (MCG_MODE == MCG_MODE_PEE))
259   MCG->C6 |= (MCG_C6_PLLS_MASK);       /* Set C6 (PLL select, VCO divider etc.) */
260   while((MCG->S & MCG_S_LOCK0_MASK) == 0x00U) { /* Wait until PLL is locked*/
261   }
262   #if (MCG_MODE == MCG_MODE_PEE)
263   MCG->C1 &= (uint8_t)~(MCG_C1_CLKS_MASK);
264   #endif
265 #endif
266 #if ((MCG_MODE == MCG_MODE_FEI) || (MCG_MODE == MCG_MODE_FEE))
267   while((MCG->S & MCG_S_CLKST_MASK) != 0x00U) { /* Wait until output of the FLL is selected */
268   }
269   /* Use LPTMR to wait for 1ms dor FLL clock stabilization */
270   SIM_SCGC5 |= SIM_SCGC5_LPTMR_MASK;   /* Alow software control of LPMTR */
271   LPTMR0->CMR = LPTMR_CMR_COMPARE(0);  /* Default 1 LPO tick */
272   LPTMR0->CSR = (LPTMR_CSR_TCF_MASK | LPTMR_CSR_TPS(0x00));
273   LPTMR0->PSR = (LPTMR_PSR_PCS(0x01) | LPTMR_PSR_PBYP_MASK); /* Clock source: LPO, Prescaler bypass enable */
274   LPTMR0->CSR = LPTMR_CSR_TEN_MASK;    /* LPMTR enable */
275   while((LPTMR0_CSR & LPTMR_CSR_TCF_MASK) == 0u) {
276   }
277   LPTMR0_CSR = 0x00;                   /* Disable LPTMR */
278   SIM_SCGC5 &= (uint32_t)~(uint32_t)SIM_SCGC5_LPTMR_MASK;
279 #elif ((MCG_MODE == MCG_MODE_FBI) || (MCG_MODE == MCG_MODE_BLPI))
280   while((MCG->S & MCG_S_CLKST_MASK) != 0x04U) { /* Wait until internal reference clock is selected as MCG output */
281   }
282 #elif ((MCG_MODE == MCG_MODE_FBE) || (MCG_MODE == MCG_MODE_PBE) || (MCG_MODE == MCG_MODE_BLPE))
283   while((MCG->S & MCG_S_CLKST_MASK) != 0x08U) { /* Wait until external reference clock is selected as MCG output */
284   }
285 #elif (MCG_MODE == MCG_MODE_PEE)
286   while((MCG->S & MCG_S_CLKST_MASK) != 0x0CU) { /* Wait until output of the PLL is selected */
287   }
288 #endif
289 #if (((SYSTEM_SMC_PMCTRL_VALUE) & SMC_PMCTRL_RUNM_MASK) == (0x02U << SMC_PMCTRL_RUNM_SHIFT))
290   SMC->PMCTRL = (uint8_t)((SYSTEM_SMC_PMCTRL_VALUE) & (SMC_PMCTRL_RUNM_MASK)); /* Enable VLPR mode */
291   while(SMC->PMSTAT != 0x04U) {        /* Wait until the system is in VLPR mode */
292   }
293 #endif
294 
295 #if defined(SYSTEM_SIM_CLKDIV2_VALUE)
296   SIM->CLKDIV2 = ((SIM->CLKDIV2) & (uint32_t)(~(SIM_CLKDIV2_USBFRAC_MASK | SIM_CLKDIV2_USBDIV_MASK))) | ((SYSTEM_SIM_CLKDIV2_VALUE) & (SIM_CLKDIV2_USBFRA    C_MASK | SIM_CLKDIV2_USBDIV_MASK)); /* Selects the USB clock divider. */
297 #endif
253   }
254   /* BLPE, PEE and PBE MCG mode specific */
255 
256 #if (MCG_MODE == MCG_MODE_BLPE)
257   MCG->C2 |= (MCG_C2_LP_MASK);         /* Disable FLL and PLL in bypass mode */
258 #elif ((MCG_MODE == MCG_MODE_PBE) || (MCG_MODE == MCG_MODE_PEE))
259   MCG->C6 |= (MCG_C6_PLLS_MASK);       /* Set C6 (PLL select, VCO divider etc.) */
260   while((MCG->S & MCG_S_LOCK0_MASK) == 0x00U) { /* Wait until PLL is locked*/
261   }
262   #if (MCG_MODE == MCG_MODE_PEE)
263   MCG->C1 &= (uint8_t)~(MCG_C1_CLKS_MASK);
264   #endif
265 #endif
266 #if ((MCG_MODE == MCG_MODE_FEI) || (MCG_MODE == MCG_MODE_FEE))
267   while((MCG->S & MCG_S_CLKST_MASK) != 0x00U) { /* Wait until output of the FLL is selected */
268   }
269   /* Use LPTMR to wait for 1ms dor FLL clock stabilization */
270   SIM_SCGC5 |= SIM_SCGC5_LPTMR_MASK;   /* Alow software control of LPMTR */
271   LPTMR0->CMR = LPTMR_CMR_COMPARE(0);  /* Default 1 LPO tick */
272   LPTMR0->CSR = (LPTMR_CSR_TCF_MASK | LPTMR_CSR_TPS(0x00));
273   LPTMR0->PSR = (LPTMR_PSR_PCS(0x01) | LPTMR_PSR_PBYP_MASK); /* Clock source: LPO, Prescaler bypass enable */
274   LPTMR0->CSR = LPTMR_CSR_TEN_MASK;    /* LPMTR enable */
275   while((LPTMR0_CSR & LPTMR_CSR_TCF_MASK) == 0u) {
276   }
277   LPTMR0_CSR = 0x00;                   /* Disable LPTMR */
278   SIM_SCGC5 &= (uint32_t)~(uint32_t)SIM_SCGC5_LPTMR_MASK;
279 #elif ((MCG_MODE == MCG_MODE_FBI) || (MCG_MODE == MCG_MODE_BLPI))
280   while((MCG->S & MCG_S_CLKST_MASK) != 0x04U) { /* Wait until internal reference clock is selected as MCG output */
281   }
282 #elif ((MCG_MODE == MCG_MODE_FBE) || (MCG_MODE == MCG_MODE_PBE) || (MCG_MODE == MCG_MODE_BLPE))
283   while((MCG->S & MCG_S_CLKST_MASK) != 0x08U) { /* Wait until external reference clock is selected as MCG output */
284   }
285 #elif (MCG_MODE == MCG_MODE_PEE)
286   while((MCG->S & MCG_S_CLKST_MASK) != 0x0CU) { /* Wait until output of the PLL is selected */
287   }
288 #endif
289 #if (((SYSTEM_SMC_PMCTRL_VALUE) & SMC_PMCTRL_RUNM_MASK) == (0x02U << SMC_PMCTRL_RUNM_SHIFT))
290   SMC->PMCTRL = (uint8_t)((SYSTEM_SMC_PMCTRL_VALUE) & (SMC_PMCTRL_RUNM_MASK)); /* Enable VLPR mode */
291   while(SMC->PMSTAT != 0x04U) {        /* Wait until the system is in VLPR mode */
292   }
293 #endif
294 
295 #if defined(SYSTEM_SIM_CLKDIV2_VALUE)
296   SIM->CLKDIV2 = ((SIM->CLKDIV2) & (uint32_t)(~(SIM_CLKDIV2_USBFRAC_MASK | SIM_CLKDIV2_USBDIV_MASK))) | ((SYSTEM_SIM_CLKDIV2_VALUE) & (SIM_CLKDIV2_USBFRA    C_MASK | SIM_CLKDIV2_USBDIV_MASK)); /* Selects the USB clock divider. */
297 #endif
298 
299   /* PLL loss of lock interrupt request initialization */
300   if (((SYSTEM_MCG_C6_VALUE) & MCG_C6_LOLIE0_MASK) != 0U) {
301     NVIC_EnableIRQ(MCG_IRQn);          /* Enable PLL loss of lock interrupt request */
302   }
303 #endif
304 }

Seriously, this startup code scares me. I know that most of the parts are surrounded with #ifdefs but amount of "magic" values and general mess-codestyle really discourages me. Do I need all that stuff? There are big chances that for large project I do. However, I doubt I need them for lighting one LED. Let's start everything from scratch.

Getting started

Most steps below are specific to K64F, but you can find them helpful also as a general approach for bringing-up any board.

Assuming you have the hardware already:

Download Reference Manual for Freedom K64 Sub-Family.
Inspect "Table 4-1. System memory map":
Notice, program code and read-only data (including exception vectors) are located between 0x00000000 and 0x07FFFFFF. RAM is split into two regions: 0x1FFF0000 0x1FFFFFFF and 0x20000000 0x2002FFFF.
According to specs we have physically 1MB of flash and 256KB of RAM installed on the board. This gives us last actually available address for flash to be 0x00FFFFFF and indeed 0x2002FFFF as last address for RAM.
You can now read more about SRAM split in the reference manual. For purpose of this article, we stick to upper region (the one starting at 0x20000000).
Now, in many cases we would have all needed information. But in case of K64 family, we need to notice two more things. First one is "Flash configuration field". Refer to "29.3.1 Flash configuration field description" for details. In short words: addresses in flash between 0x00000400 and 0x000040C are very special. Values stored there configure other subsystems, so you cannot write it with your application data or code. Other thing is watchdog: "24.3.1 Unlocking and updating the watchdog". There is a following statement:

"Write 0xC520 followed by 0xD928 within 20 bus clock cycles to a specific unlock register (WDOG_UNLOCK)".

We'll need this information later.
Now, if you don't know what vector table is, download ARM ARM for Cortex M4 (ARMv7-m) and see "B1.5.3 The vector table":

"The vector table contains the initialization value for the stack pointer, and the entry point addresses of each exception handler."

K64F expects vector table to be at address 0x00000000 by default.

Linker script

Because we're starting the project from scratch, we need to create our own linker script. Let's name it for example k64f.ld and start editing it:

  1 MEMORY     
  2 {
  3     ROM_VECTORS (rx) : ORIGIN = 0x00000000, LENGTH = 0x00000400
  4     ROM_FLASH_CFG (rx) : ORIGIN = 0x00000400, LENGTH = 0x00000010
  5     ROM_TEXT (rx) : ORIGIN = 0x00000410, LENGTH = 512K
  6     RAM (rw) : ORIGIN = 0x20000000, LENGTH = 192K
  7 }

This part of linker script will define our target memory layout. In this example I choose 512K as size of ROM_TEXT, but remember you can increase it up to 1MB - ROM_VECTORS length - ROM_FLASH_CFG length. Generally, we see in the layout three regions in flash (vectors, config and code) and one region in RAM. This matches our observations from K64 Reference Manual. The names "ROM_VECTORS", "ROM_FLASH_CFG" etc. are chosen arbitrarily.

Now, we need to define which input section from input files will go to which output section of ELF:

"You use input section descriptions to tell the linker how to map the input files into your memory layout."

By default, compiler implicitly will create following input sections:

text - for program code.
rodata - for read-only data like constants.
data - for initialized global variables.
bss and COMMON - for uninitialized global variables.

Those are very basic sections that we can expect when we're not linking with standard library. Of course we can add explicitly our own custom sections for special purposes (we'll see later how). Our custom input sections will be:

vectors - for storing vector table.
flash_config - for storing K64F specific configuration data.

The main task we can do in the linker script is mapping input sections to output sections and creating our own symbols. Example linker script that places input sections "text" and "rodata" in the output section called "text" looks like this:

 25     .text :
 26     {
 27         . = ALIGN(4);
 28         *(.text*)
 29         *(.rodata*)
 30         . = ALIGN(4);
 31     } > ROM_TEXT

This script also tells that output section "text" should be mapped into ROM_TEXT address (which was defined by us already). Above example shows also that we can align our counter to 4 bytes before processing input sections. Counter (dot) will be explained later. The main conclusion from above example is that all input sections named text* and rodata* will be placed in text output section.

Besides mapping input sections into output sections we can also declare global symbols in the linker script. Those symbols can be very useful. See for instance following example:

 43     .bss :
 44     {
 45         . = ALIGN(4);
 46         __bss_start__ = . ;
 47         *(.bss*)
 48         *(COMMON)
 49         __bss_end__ = . ;
 50         . = ALIGN(4);
 51     } > RAM

We see here two custom symbols created: __bss_start__ and __bss_end__ (names chosen arbitrarily). Those symbols can be accessed from C code. The value of them is undetermined. However, the address of those symbols is defined and is equal to the value assigned to them in the linker script. In above example, assigned value was "." (dot). Dot is a special character in linker script syntax that holds current address of the processed memory layout. For example, before bss section was processed by linker, dot could be equal to 0x20000000. After that, depending of how many uninitialized global variables were in the input files, the bss and COMMON sections will "stretch" the address space accordingly. Let's say bss was 256 bytes long and COMMON was 256 bytes long as well. After processing those two sections, dot will have value 0x20000200 (512 bytes from 0x20000000). It means that symbol __bss_start__ will be created at address 0x20000000 and symbol __bss_end__ will be created at address 0x20000200. In C code (or through debugger) if you print value of __bss_start__you'll get garbage. If you print &__bss_start__ you'll get 0x20000000.

OK, so we know that in the linker script we can map input sections to output sections and that we can create our own symbols. We also know how can we create a memory layout. Here's complete linker script for our example project:

  1 MEMORY                                                                      
  2 {
  3     ROM_VECTORS (rx) : ORIGIN = 0x00000000, LENGTH = 0x00000400
  4     ROM_FLASH_CFG (rx) : ORIGIN = 0x00000400, LENGTH = 0x00000010
  5     ROM_TEXT (rx) : ORIGIN = 0x00000410, LENGTH = 512K
  6     RAM (rw) : ORIGIN = 0x20000000, LENGTH = 192K
  7 }
  8 
  9 SECTIONS
 10 {
 11     .vectors :
 12     {
 13         . = ALIGN(4);
 14         *(.vectors)
 15         . = ALIGN(4);
 16     } > ROM_VECTORS
 17 
 18     .flash_cfg :
 19     {
 20         . = ALIGN(4);
 21         *(.flash_config)
 22         . = ALIGN(4);
 23     } > ROM_FLASH_CFG
 24 
 25     .text :
 26     {
 27         . = ALIGN(4);
 28         *(.text*)
 29         *(.rodata*)
 30         . = ALIGN(4);
 31     } > ROM_TEXT
 32 
 33     _sfdata = LOADADDR(.data);
 34     .data :
 35     {
 36         . = ALIGN(4);
 37         _sdata = .;
 38         *(.data*)
 39         _edata = .;
 40         . = ALIGN(4);
 41     } > RAM AT> ROM_TEXT
 42 
 43     .bss :
 44     {
 45         . = ALIGN(4);
 46         __bss_start__ = . ;
 47         *(.bss*)
 48         *(COMMON)
 49         __bss_end__ = . ;
 50         . = ALIGN(4);
 51     } > RAM
 52 
 53     _stack_top = ORIGIN(RAM) + LENGTH(RAM);
 54 }

It's very simplified linker script, without sections needed by standard library. What is also worth mentioning: because we need to create binary file as our output and because there is no ELF bootloader on the board, section data cannot be placed by linker directly at RAM. If we instruct it to do this, the output file will have ~512MB of size. This is because whole address space between code (around 0x00000000) and RAM (around 0x20000000) would be included as well. This is why we redirect it "AT > ROM_TEXT" (line 41). This is also why we create _sfdata symbol. The whole concept is to store data on the flash and copy it into RAM at startup. We don't need to do the same with bss section because it's actually an empty section (always). Instead, we will need to zero address space between _bss_start__ and __bss_end__ on startup manually. Last thing worth mentioning is that we have created _stack_top symbol at last accessible RAM address (at the end of RAM). We'll need it later.

Startup code

As we know from ARM ARM, processor will do two things upon starting:

Load Stack Pointer with value stored at the beginning of vector table.
Execute reset handler. Address to the reset handler is stored just after stack pointer in the vector table.

Our task is to prepare vector table and reset handler. In our example, we don't care about any exceptions beside the reset. In real-life scenario whole vector table must be implemented. Let's create file startup.s:

.cpu cortex-m4
.thumb

.section .vectors, "a"
    .word _stack_top
    .word _reset

.section .flash_config, "a"
    .long 0xFFFFFFFF
    .long 0xFFFFFFFF
    .long 0xFFFFFFFF
    .long 0xFFFFFFFE

.section .text
.thumb_func
.global _reset
_reset:
    bl init
    bl main

Above we can see how to create custom input sections (that we were talking about earlier), We've created vectors (line 4) and flash_config (line 8) input sections. As we see, vector table contains only two entries. The first one is an address of initial SP and will be generated by our linker script. The second one is an address of reset handler and is defined in the same file at line 17. Section flash_config contains values specific for K64F. You can decode them using Reference Manual. Note, last byte in this configuration is FE (line 12).

So, after connecting power-supply, processor will write into SP address of _stack_top symbol and will branch into init function. Let's create startup.c file:

#define WDOG_STCTRLH (*(volatile short *)0x40052000u)
#define WDOG_UNLOCK (*(volatile short *)0x4005200Eu)

#define WDOG_UNLOCK_WDOGUNLOCK_MASK 0xFFFFu
#define WDOG_UNLOCK_WDOGUNLOCK_SHIFT 0
#define WDOG_UNLOCK_WDOGUNLOCK_WIDTH 16
#define WDOG_UNLOCK_WDOGUNLOCK(x) (((short)(((short)(x))<<WDOG_UNLOCK_WDOGUNLOCK_SHIFT))&WDOG_UNLOCK_WDOGUNLOCK_MASK)

#define WDOG_STCTRLH_WAITEN_MASK 0x80u
#define WDOG_STCTRLH_STOPEN_MASK 0x40u
#define WDOG_STCTRLH_ALLOWUPDATE_MASK 0x10u
#define WDOG_STCTRLH_CLKSRC_MASK 0x2u

#define WDOG_STCTRLH_BYTESEL_MASK 0x3000u
#define WDOG_STCTRLH_BYTESEL_SHIFT 12
#define WDOG_STCTRLH_BYTESEL(x) (((short)(((short)(x))<<WDOG_STCTRLH_BYTESEL_SHIFT))&WDOG_STCTRLH_BYTESEL_MASK)

extern unsigned int _sfdata;
extern unsigned int _edata;
extern unsigned int _sdata;
extern unsigned int __bss_start__;
extern unsigned int __bss_end__;

void init()
{
    unsigned int *src, *dst;

    WDOG_UNLOCK = WDOG_UNLOCK_WDOGUNLOCK(0xC520);
    WDOG_UNLOCK = WDOG_UNLOCK_WDOGUNLOCK(0xD928);
    WDOG_STCTRLH = WDOG_STCTRLH_BYTESEL(0x00) |
        WDOG_STCTRLH_WAITEN_MASK |
        WDOG_STCTRLH_STOPEN_MASK |
        WDOG_STCTRLH_ALLOWUPDATE_MASK |
        WDOG_STCTRLH_CLKSRC_MASK |
        0x0100U;

    src = &_sfdata;

    for(dst = &_sdata; dst < &_edata;)
    {
        *(dst++) = *(src++);
    }

    for(src = &__bss_start__; src < &__bss_end__;)
    {
        *(src++) = 0;
    }

    return;
}

What's happening here? Three things:

Disable watchdog
Copy data sections to RAM
Zero bss section

That's all we need.

Values from lines 1-16 can be found in the Reference Manual. There are just a bunch of registers which need to be written in specific order to deactivate the watchdog. I've mentioned about it at the beginning.

Next thing we're doing is using _sfdata, _sdata and _edata symbols. All those symbols we've created in linker script. _sfdata is placed at address in flash where data section begins. _sdata is a symbol at address where data section should be placed in RAM. _edata is a symbol at address when data section should end.

Other symbols created in linker script (__bss_start__ and __bss_end__) are used as markers for address range which need to be zeroed. If we don't do this, our uninitialized global variables will have random values instead of expected zeros.

Application

As we see at line 31 of startup.s file, after init function returns we branch to the main function. Create main.c file:

#define SIM_SCGC5 (*(volatile int *)0x40048038)
#define SIM_SCGC5_PORTB 10

#define PORTB_PCR21 (*(volatile int *)0x4004A054)
#define PORTB_PCR21_MUX 8

#define GPIOB_PDDR (*(volatile int *)0x400FF054)
#define PIN_N 21

int main()
{
    /* Enable clocks. */
    SIM_SCGC5 |= 1 << SIM_SCGC5_PORTB;
    /* Configure pin 21 as GPIO. */
    PORTB_PCR21 |= 1 << PORTB_PCR21_MUX;
    /* Configure GPIO pin 21 as output.
     * It will have a default output value set
     * to 0, so LED will light (negative logic).
     */
    GPIOB_PDDR |= 1 << PIN_N;

    while(1);

    return 0;
}

Here we actually light the LED. Instead of using includes from SDK we just defined register addresses in place. Note, in this particular example volatile keyword is not crucial. However, in general use case you expect from compiler to always generate direct load/store instructions to those addresses instead of trying to keep them in registers. This is because this memory could be modified from exception handler.

Makefile

So, we have almost everything done. We have following files: k64f.ld, startup.s, startup.c and main.c. Now, let's use k64f.ld as our linker script and compile together startup.s, startup.c and main.c. K64F expects the output file to be in binary format. We'll create the ELF file, and then we'll translate it into bin using tool called objcopy (I assume you have installed GCC ARM Embedded). Create Makefile:

CC=arm-none-eabi-gcc
OBJCPY=arm-none-eabi-objcopy
CFLAGS=-Wall -Wextra -mthumb -mcpu=cortex-m4 -nostdlib -g

all:
 $(CC) startup.s startup.c main.c $(CFLAGS) -T k64f.ld -o simple.elf
 $(OBJCPY) simple.elf simple.bin -O binary

clean:
 rm simple.*

Option nostdlib will instruct linker to not include standard library. Using option -T we can point to our custom linker script. We'll have two files as output: simple.elf which can be used during debugging and simple.bin which can be uploaded to the board using standard OpenSDA interface. I encourage you to check by yourself how the generated simple.elf file looks internally by issuing arm-none-eabi-objdump -D simple.elf command.

Summary

That's it. The minimal working GCC setup for K64F consists of just couple small files. It's a good starting point for developing more complex projects as well as a good exercise before analyzing large SDKs.

The project is available on bitbucket.