Tuesday 22 December 2015

ARM: bit fields under the hood

In this article I would like to share my observations about what is happening under the hood of bit fields. I'll use ARMv6-M and ARMv7-M architectures and GCC ARM Embedded toolchain.

Firstly, let's quickly recap what main data access instructions we have on both architectures:
  • LDR - Loads a word from memory, and writes it to a register.
  • LDRH - Loads a halfword from memory, zero-extends it to form a 32-bit word, and writes it to a register.
  • LDRB - Loads a byte from memory, zero-extends it to form a 32-bit word, and writes it to a register.
  • STR - Stores a word from a register to memory.
  • STRH - Stores a halfword from a register to memory.
  • STRB - Stores a byte from a register to memory.
Other variants of those instructions exist, but for purpose of this experiment let's stick to those basic ones listed above.

Note, there is a difference between ARMv6-M and ARMv7-M related to the alignment support.

ARMv6-M:
"ARMv6-M always generates a fault when an unaligned access occurs."
ARMv7-M
"The system architecture can choose one of two policies for alignment checking in ARMv7-M:
    • Support the unaligned access
    • Generate a fault when an unaligned access occurs.
The policy varies with the type of access. An implementation can be configured to force alignment faults for all unaligned accesses."
OK, so on ARMv6-M things are pretty simple:
  • Access using LDR/STR must be word aligned.
  • Access using LDRH/STRH must be halfword aligned
  • Byte access can be achieved using LDRB and STRB instructions.
ARMv7-M however can be configured to use hardware support for an unaligned access. When it's enabled, even LDR/STR instruction will not generate an exception while accessing the unaligned address. The drawback here will be a more complex bus access. See this article for more details (it refers to ARM compiler, not the GCC, but it's not a problem in this case). Having that in mind, let's move on.

Let's see a "normal" structure without any bit fields specified. Consider the following example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
struct {
    unsigned int a;
    unsigned int b;
    unsigned int c;
    unsigned int d;
    unsigned int e;
} data = {6, 3, 1, 6, 57672};

int _start()
{
    volatile unsigned int a = data.a;
    volatile unsigned int b = data.b;
    volatile unsigned int c = data.c;
    volatile unsigned int d = data.d;
    volatile unsigned int e = data.e;

    return 0;
}

Compile it (for now without optimizations):

arm-none-eabi-gcc -nostdlib -mthumb -O0 -march=armv6-m -nostdlib main.c -o test

And inspect:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: b580       push {r7, lr}
    8002: b086       sub sp, #24
    8004: af00       add r7, sp, #0
    8006: 4b0a       ldr r3, [pc, #40] ; (8030 <_start+0x30>)
    8008: 681b       ldr r3, [r3, #0]
    800a: 617b       str r3, [r7, #20]
    800c: 4b08       ldr r3, [pc, #32] ; (8030 <_start+0x30>)
    800e: 685b       ldr r3, [r3, #4]
    8010: 613b       str r3, [r7, #16]
    8012: 4b07       ldr r3, [pc, #28] ; (8030 <_start+0x30>)
    8014: 689b       ldr r3, [r3, #8]
    8016: 60fb       str r3, [r7, #12]
    8018: 4b05       ldr r3, [pc, #20] ; (8030 <_start+0x30>)
    801a: 68db       ldr r3, [r3, #12]
    801c: 60bb       str r3, [r7, #8]
    801e: 4b04       ldr r3, [pc, #16] ; (8030 <_start+0x30>)
    8020: 691b       ldr r3, [r3, #16]
    8022: 607b       str r3, [r7, #4]
    8024: 2300       movs r3, #0
    8026: 1c18       adds r0, r3, #0
    8028: 46bd       mov sp, r7
    802a: b006       add sp, #24
    802c: bd80       pop {r7, pc}
    802e: 46c0       nop   ; (mov r8, r8)
    8030: 00010034  andeq r0, r1, r4, lsr r0

Disassembly of section .data:

00010034 <__data_start>:
   10034: 00000006  
   10038: 00000003  
   1003c: 00000001  
   10040: 00000006  
   10044: 0000e148  

You can do the same for ARMv7-M (by passing -march=armv7-m flag) to see minor differences between generated asm but it's not important for purpose of this consideration.

What we need to notice are two things*:
  • The "data" structure occupies 20 bytes (5 words) in the memory (lines 38-42).
  • Accessing fields (for example lines 13 or 16) is done using ldr instruction.
* Note, I'm not discussing padding between fields in a structure (not happening here because all fields are 32 bits anyway).

So far so good. All fields in the structure are integers which are 4 bytes each on both architectures. The "data" variable is a global, so it starts on a word aligned address. The whole word can be read using ldr instruction.

Now, suppose the structure represents a 32 bit register and its fields "a", "b", "c", "d" and "e" are respectively 4, 4, 1, 7 and 16 bits long:


To implement such structure we can use bit fields:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
struct 
{
    unsigned int a : 4;
    unsigned int b : 4;
    unsigned int c : 1;
    unsigned int d : 7;
    unsigned int e : 16;
} data = {6, 3, 1, 6, 57672};

int _start()
{
    volatile unsigned int a = data.a;
    volatile unsigned int b = data.b;
    volatile unsigned int c = data.c;
    volatile unsigned int d = data.d;
    volatile unsigned int e = data.e;

    return 0;
}

Without optimizations, GCC will now produce the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: b580       push {r7, lr}
    8002: b086       sub sp, #24
    8004: af00       add r7, sp, #0
    8006: 4b10       ldr r3, [pc, #64] ; (8048 <_start+0x48>)
    8008: 781b       ldrb r3, [r3, #0]
    800a: 071b       lsls r3, r3, #28
    800c: 0f1b       lsrs r3, r3, #28
    800e: b2db       uxtb r3, r3
    8010: 617b       str r3, [r7, #20]
    8012: 4b0d       ldr r3, [pc, #52] ; (8048 <_start+0x48>)
    8014: 781b       ldrb r3, [r3, #0]
    8016: 061b       lsls r3, r3, #24
    8018: 0f1b       lsrs r3, r3, #28
    801a: b2db       uxtb r3, r3
    801c: 613b       str r3, [r7, #16]
    801e: 4b0a       ldr r3, [pc, #40] ; (8048 <_start+0x48>)
    8020: 785b       ldrb r3, [r3, #1]
    8022: 07db       lsls r3, r3, #31
    8024: 0fdb       lsrs r3, r3, #31
    8026: b2db       uxtb r3, r3
    8028: 60fb       str r3, [r7, #12]
    802a: 4b07       ldr r3, [pc, #28] ; (8048 <_start+0x48>)
    802c: 785b       ldrb r3, [r3, #1]
    802e: 061b       lsls r3, r3, #24
    8030: 0e5b       lsrs r3, r3, #25
    8032: b2db       uxtb r3, r3
    8034: 60bb       str r3, [r7, #8]
    8036: 4b04       ldr r3, [pc, #16] ; (8048 <_start+0x48>)
    8038: 885b       ldrh r3, [r3, #2]
    803a: 607b       str r3, [r7, #4]
    803c: 2300       movs r3, #0
    803e: 1c18       adds r0, r3, #0
    8040: 46bd       mov sp, r7
    8042: b006       add sp, #24
    8044: bd80       pop {r7, pc}
    8046: 46c0       nop   ; (mov r8, r8)
    8048: 0001004c  andeq r0, r1, ip, asr #32

Disassembly of section .data:

0001004c <__data_start>:
   1004c: e1480d36  

So, what's happening? Observations:
  • To get "bit parts" processor will read a smallest possible chunk of data (see for instance line 13), then it will shift it left (line 14) and right (line 15) to get rid of unwanted bits.
  • Generated code uses also uxtb instruction which is the "Unsigned Extend Byte" (extracts an 8-bit value from a register, zero extends it to 32 bits, and writes the result to the destination register).
  • If it's possible it will use instructions that can read more than one byte (see line 37).
  • Because the sum of our bit fields doesn't exceed the word size (4  + 4 + 1 + 7 + 16 <= 32) we use only 4 bytes of data (line 50).
According to the last observation: if we add at least one more field, we'll need a whole new word to store it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
struct 
{
    unsigned int a : 4;
    unsigned int b : 4;
    unsigned int c : 1;
    unsigned int d : 7;
    unsigned int e : 16;
    unsigned int f : 1;
} data = {6, 3, 1, 6, 57672, 1};

int _start()
{
    volatile unsigned int a = data.a;
    volatile unsigned int b = data.b;
    volatile unsigned int c = data.c;
    volatile unsigned int d = data.d;
    volatile unsigned int e = data.e;
    volatile unsigned int f = data.f;

    return 0;
}

With additional one-bit field "f" the new word is allocated (line 30):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Disassembly of section .text:

00008000 <_start>:
    8000: b580       push {r7, lr}
    8002: b086       sub sp, #24
    8004: af00       add r7, sp, #0
    8006: 4b13       ldr r3, [pc, #76] ; (8054 <_start+0x54>)
    8008: 781b       ldrb r3, [r3, #0]
    800a: 071b       lsls r3, r3, #28
    800c: 0f1b       lsrs r3, r3, #28
(..)
    803c: 4b05       ldr r3, [pc, #20] ; (8054 <_start+0x54>)
    803e: 791b       ldrb r3, [r3, #4]
    8040: 07db       lsls r3, r3, #31
    8042: 0fdb       lsrs r3, r3, #31
    8044: b2db       uxtb r3, r3
    8046: 603b       str r3, [r7, #0]
    8048: 2300       movs r3, #0
    804a: 1c18       adds r0, r3, #0
    804c: 46bd       mov sp, r7
    804e: b006       add sp, #24
    8050: bd80       pop {r7, pc}
    8052: 46c0       nop   ; (mov r8, r8)
    8054: 00010058  andeq r0, r1, r8, asr r0

Disassembly of section .data:

00010058 <__data_start>:
   10058: e1480d36  
   1005c: 00000001  

OK, so this is how it works. Just for the reference, let's have a look into generated assembler with enabled optimizations :


arm-none-eabi-gcc -nostdlib -mthumb -Os -march=armv6-m -nostdlib main.c -o test


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: 4b0c       ldr r3, [pc, #48] ; (8034 <_start+0x34>)
    8002: b086       sub sp, #24
    8004: 781a       ldrb r2, [r3, #0]
    8006: 2000       movs r0, #0
    8008: 0711       lsls r1, r2, #28
    800a: 0f09       lsrs r1, r1, #28
    800c: 0912       lsrs r2, r2, #4
    800e: 9100       str r1, [sp, #0]
    8010: 9201       str r2, [sp, #4]
    8012: 785a       ldrb r2, [r3, #1]
    8014: 07d1       lsls r1, r2, #31
    8016: 0fc9       lsrs r1, r1, #31
    8018: b2c9       uxtb r1, r1
    801a: 0852       lsrs r2, r2, #1
    801c: 9102       str r1, [sp, #8]
    801e: 9203       str r2, [sp, #12]
    8020: 885a       ldrh r2, [r3, #2]
    8022: 791b       ldrb r3, [r3, #4]
    8024: 9204       str r2, [sp, #16]
    8026: 07db       lsls r3, r3, #31
    8028: 0fdb       lsrs r3, r3, #31
    802a: b2db       uxtb r3, r3
    802c: 9305       str r3, [sp, #20]
    802e: b006       add sp, #24
    8030: 4770       bx lr
    8032: 46c0       nop   ; (mov r8, r8)
    8034: 00010038  andeq r0, r1, r8, lsr r0

Disassembly of section .data:

00010038 <__data_start>:
   10038: e1480d36  
   1003c: 00000001  

Just out of curiosity, let's also compile for ARMv7-M:


arm-none-eabi-gcc -nostdlib -mthumb -Os -march=armv7-m -nostdlib main.c -o test


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: 4b0b       ldr r3, [pc, #44] ; (8030 <_start+0x30>)
    8002: b086       sub sp, #24
    8004: 781a       ldrb r2, [r3, #0]
    8006: 2000       movs r0, #0
    8008: f002 010f  and.w r1, r2, #15
    800c: 0912       lsrs r2, r2, #4
    800e: 9100       str r1, [sp, #0]
    8010: 9201       str r2, [sp, #4]
    8012: 785a       ldrb r2, [r3, #1]
    8014: f002 0101  and.w r1, r2, #1
    8018: 0852       lsrs r2, r2, #1
    801a: 9102       str r1, [sp, #8]
    801c: 9203       str r2, [sp, #12]
    801e: 885a       ldrh r2, [r3, #2]
    8020: 791b       ldrb r3, [r3, #4]
    8022: 9204       str r2, [sp, #16]
    8024: f003 0301  and.w r3, r3, #1
    8028: 9305       str r3, [sp, #20]
    802a: b006       add sp, #24
    802c: 4770       bx lr
    802e: bf00       nop
    8030: 00010034  andeq r0, r1, r4, lsr r0

Disassembly of section .data:

00010034 <__data_start>:
   10034: e1480d36  
   10038: 00000001  

No major differences between them. In both cases we see that after optimizations there will be less actual data read instructions (see bold lines), but, of course, there still will be the "bit shuffle" using shift operations.

Now, our example "data" variable was aligned by compiler. But in real life, the variable that stores bit fields can be wrongly cast or just moved to the unaligned address by mistake. Consider following dummy code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
struct data_t
{
    unsigned int a : 4;
    unsigned int b : 4;
    unsigned int c : 1;
    unsigned int d : 7;
    unsigned int e : 16;
    unsigned int f : 1;
} data = {6, 3, 1, 6, 57672, 1};

int _start()
{

    volatile struct data_t *some_mem = (struct data_t *)0x10031;
    volatile unsigned int s = some_mem->a;

    return 0;
}

We deliberately pointed to the unaligned address. Although the field is only 4 bit long, compiler will not use ldrb instruction, because it assumes the beginning of the structure is aligned. Instead, it will use ldr instruction causing hard fault exception due to unaligned access (line 11):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
mk@mk-VirtualBox:~/test/bitfields$ arm-none-eabi-objdump -D test 

test:     file format elf32-littlearm


Disassembly of section .text:

00008000 <_start>:
    8000: 4b04       ldr r3, [pc, #16] ; (8014 <_start+0x14>)
    8002: b082       sub sp, #8
    8004: 681b       ldr r3, [r3, #0]
    8006: 2000       movs r0, #0
    8008: 071b       lsls r3, r3, #28
    800a: 0f1b       lsrs r3, r3, #28
    800c: 9301       str r3, [sp, #4]
    800e: b002       add sp, #8
    8010: 4770       bx lr
    8012: 46c0       nop   ; (mov r8, r8)
    8014: 00010031  andeq r0, r1, r1, lsr r0

Disassembly of section .data:

00010018 <__data_start>:
   10018: e1480d36  
   1001c: 00000001  

Conclusions? It's good to be aware of data access alignment issues. I've started looking into this problem after investigation of a bug that originated from wrong casting of a structure that was using bit fields (on Cortex M0). Although the specific bit field may look like accessible by byte-aligned instruction, it's not always the case. Sometimes nothing wrong will happen because the accessed field will be luckily aligned by accident. What's more likely however: sooner or later you'll get a hard fault exception because of unaligned access. What's also worth noting is that on ARMv7-M (as opposed to ARMv6-M) the unaligned access can be hidden for the programmer and cause more bus accesses but no hard faults.

Friday 4 December 2015

LwIP IPv6 on K64F

Couple of weeks ago I did a bring-up of IPv6 connectivity using Ethernet on Freedom K64F board. I've used FreeRTOS combined with LwIP as the main components. Generally, everything went smooth beside one thing: because IPv6 uses multicasting during Neighbor Discovery instead of broadcasting as it was in old ARP, the Ethernet Controller needs to accept specific multicast MAC address.  By default (if not in promiscuous mode) all frames with destination MAC addresses that are not in the "whitelist" will be dropped by the HW network controller. We need to do an exception for multicast MAC needed by ICMPv6 protocol. Otherwise, even pings will not work because devices cannot exchange their IPv6 addresses. The whole "hey! who has >ipv6< address?" protocol will not work without it.

As a quick solution I've fixed it on a driver layer:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
enet_status_t ENET_DRV_Init(enet_dev_if_t * enetIfPtr, const enet_user_config_t* userConfig)
 {   
     enet_status_t result;
     uint32_t  frequency; 
     ENET_Type * base;
     uint32_t statusMask = 0;
     enet_cur_status_t curStatus;
     const enet_mac_config_t* macCfgPtr = userConfig->macCfgPtr;
     const enet_buff_config_t* buffCfgPtr = userConfig->buffCfgPtr;
+    uint32_t hash = 0;
+    uint8_t ipv6_multicast[6] = {0};
+    ipv6_multicast[0] = 0x33;
+    ipv6_multicast[1] = 0x33;
+    ipv6_multicast[2] = 0xff;
+    ipv6_multicast[3] = macCfgPtr->macAddr[3];
+    ipv6_multicast[4] = macCfgPtr->macAddr[4];
+    ipv6_multicast[5] = macCfgPtr->macAddr[5];
     
     enet_bd_config bdConfig = {0};
     /* Check the input parameters*/
     if ((!enetIfPtr) || (!macCfgPtr) || (!buffCfgPtr))
     {
         return kStatus_ENET_InvalidInput;
     }
 #if !ENET_RECEIVE_ALL_INTERRUPT
     /* POLL mode needs the extended buffer for data buffer update*/
     if((!buffCfgPtr->extRxBuffQue) || (!buffCfgPtr->extRxBuffNum))
     {
         return kStatus_ENET_InvalidInput;
     }
 #endif
     base = g_enetBase[enetIfPtr->deviceNumber];
 
     /* Store the global ENET structure for ISR input parameter*/
     enetIfHandle[enetIfPtr->deviceNumber] = enetIfPtr;
 
     /* Turn on ENET module clock gate */
     CLOCK_SYS_EnableEnetClock( 0U);
     frequency = CLOCK_SYS_GetSystemClockFreq();
     bdConfig.rxBds = buffCfgPtr->rxBdPtrAlign;
     bdConfig.rxBuffer = buffCfgPtr->rxBufferAlign;
     bdConfig.rxBdNumber = buffCfgPtr->rxBdNumber;
     bdConfig.rxBuffSizeAlign = buffCfgPtr->rxBuffSizeAlign;
     bdConfig.txBds = buffCfgPtr->txBdPtrAlign;
     bdConfig.txBuffer = buffCfgPtr->txBufferAlign;
     bdConfig.txBdNumber = buffCfgPtr->txBdNumber;
     bdConfig.txBuffSizeAlign = buffCfgPtr->txBuffSizeAlign;
     /* Init ENET MAC to reset status*/
     ENET_HAL_Init(base);
     /* Configure MAC controller*/
     ENET_HAL_Config(base, macCfgPtr, frequency, &bdConfig);
+    /* Add IPv6 multicast */
+    ENET_DRV_AddMulticastGroup(enetIfPtr->deviceNumber, ipv6_multicast, &hash);
(..)

There were couple more minor fixes needed as well. You can see whole project here.