Thursday 26 November 2015

Android: memory management insights, part I

TLDR:

Article is a series of Linux memory management experiments. Author realizes that allocating memory on stack is equivalent to just moving Stack Pointer and that acquiring dynamic memory means using brk() or mmap() system calls. Also, the overcommit mechanism has been presented.


Introduction

This article is about how memory allocation works on ARM based Android OS. Android uses modified Linux Kernel, however some of the patches specific to Android are already in the mainline. The list of Kernel features added by Android can be found here. We see there components such as Binder (interprocess communication mechanism and the remote method invocation system), Ashmem (Android shared memory), Logger (kernel part of logcat), Wakelocks (used for power management), Out of Memory handling (lowmemorykiller).
From memory management perspective, Android uses standard Linux Kernel mechanisms with addition of custom out-of-memory handling. One more thing worth noticing is Android-based devices usually don't have swap partition. This means all standard Kernel features related to swapping memory into non-volatile storage are not in use. The exception is when zRAM features are enabled. zRAM is usually enabled on latest Android devices with less than 1GB of RAM. Currently zRAM is not in scope of this article.
From userspace point of view, Android uses Bionic as standard C library. Up to Kitkat dlmalloc was used as a default malloc() implementation. From Lollipop it is the jemalloc. On top of the native part there is a Java virtual machine (although not officially “JVM “due to license reasons) implementation called ART (until L it was Dalvik) with its own managed heap.

The ultimate purpose of this article is to track by examples the memory allocation process up from Dalvik/ART through malloc() and Linux Kernel down to MMU of the processor. Insight of what's happening under the hood should help with understanding memory usage, memory leaks and memory corruptions.

There is a lot of abstraction layers to traverse but lets start in the middle of the road which is a native userspace memory allocation. This means we'll firstly look how a native application can acquire some memory without help of standard C library or any specific malloc() implementation.

Virtual memory

We'll start with creating a reference template program and we'll gradually add more features to see how it affects OS internals. We'll use GCC ARM Embedded toolchain and add the '''-nostdlib''' flag to linker:

$>arm-eabi-gcc -nostdlib -O0 test_app.c -o test_app

Here is our first program:

1
2
3
4
void main()
{
    while(1);
}

Using objdump we can take a look into generated asm:

$>arm-eabi-objdump -d test_app

test_app:     file format elf32-littlearm

Disassembly of section .text:

00008000 <main>:
    8000: e52db004  push {fp}  ; (str fp, [sp, #-4]!)
    8004: e28db000  add fp, sp, #0
    8008: eafffffe  b 8008 <main+0x8>

Beside expected branch instruction (the infinite loop) we can see two additional instructions related to the maintenance of the frame pointer. We'll discuss FP later while explaining how the stack works.

For now, we don't have any variables, so after running our program not so much about memory allocation will happen despite loading the actual code into the memory. We'll not discuss loading ELF file into memory at this point.
Note, we're not using startup files from stdlib, so we have a starting point of our program defaulted to 0x00008000 which is the beginning of .text section. We're informed about this during linking;
“arm-eabi/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000” 
At the beginning of .text section our main() function resides. This is why we don't need to explicitly set the _start symbol for now. If you want to get rid of this warning, just rename main() to _start().

Let's run the program and inspect the actual memory layout that is created by Kernel for our process. At this point, lets quickly recap what actually is a virtual address space of the process. For detailed information you can refer to the Linux Kernel Development by Robert Love or to Anatomy of a program in memory (this is actually a highly recommended reading). In short words: every time the new process is spawned, Kernel “creates” for it a kind-of sandbox of virtual addresses range in which the process operates (performs reads, writes or executes the code). “Creates” means allocates and initialize its internal structures describing the process (for example task_struct, vm_area_struct, mm_struct). On 32bit architecture, each process could have an address space ranging from 0x00000000 to 0xFFFFFFFF. However for performance reasons Kernel needs to has its own mapping available no matter what process is scheduled on. It depends on Kernel configuration how much of the virtual space Kernel needs. Standard split is “3:1” which means that from 4GB address space of 32bit architecture, 3GB is dedicated for userspace program and 1GB is always mapped for Kernel use. That leaves us with addresses from 0x00000000 to 0xbfffffff available for userspace. Anything above cannot be accessed by userspace program. There are exceptions however of this restriction: VDSO and/or ARM Kernel User Helpers which are special mappings that userspace can use for performance purposes (mostly because of userspace implementation of system calls like gettimeofday() or clock_gettime()).
The virtual address space available for user space process (so up to 0xbfffffff) is logically divided into different sections with different permissions. Example section can be a read-only, executable .text or read-write .heap. Because actual data is stored in the physical memory at some point we of course need to do a transition from the virtual address to the physical address. Transition is done with use of structures called page tables.  The page tables are architecture-specific and their location must be provided to MMU during initialization of the system. Page tables are also cached in a TLB of the MMU to speedup access to the given mapping. Every time the process is scheduled by Kernel, the whole address space for the process is switched, which means that processor needs to use a different page table while accessing memory. Later we'll explain more about access to the physical memory. For now, we should assume that each virtual address '''can''' (but not necessarily for all the time) refer to the address in the underlying physical memory and the transition is handled transparently by the OS.

Lets run our program and look into the example virtual space. We can get needed information from /proc/<pid>/maps and for more detailed view from /proc/<pid>/smaps. Note, all commands are performed on ARM based Android device, not on a host machine.

Example of /proc/<pid>/maps:

00008000-00009000 r-xp 00008000 b3:1a 24439      /data/test_app 
b6f98000-b6f99000 r-xp 00000000 00:00 0          [sigpage] 
be7eb000-be80c000 rwxp 00000000 00:00 0          [stack] 
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

Corresponding /proc/<pid>/smaps:

00008000-00009000 r-xp 00008000 b3:1a 24439      /data/test_app 
Size:                  4 kB 
Rss:                   4 kB 
Pss:                   4 kB 
Shared_Clean:          0 kB 
Shared_Dirty:          0 kB 
Private_Clean:         4 kB 
Private_Dirty:         0 kB 
Referenced:            4 kB 
Anonymous:             0 kB 
AnonHugePages:         0 kB 
Swap:                  0 kB 
KernelPageSize:        4 kB 
MMUPageSize:           4 kB 
Locked:                0 kB 
b6f98000-b6f99000 r-xp 00000000 00:00 0          [sigpage] 
Size:                  4 kB 
Rss:                   0 kB 
Pss:                   0 kB 
Shared_Clean:          0 kB 
Shared_Dirty:          0 kB 
Private_Clean:         0 kB 
Private_Dirty:         0 kB 
Referenced:            0 kB 
Anonymous:             0 kB 
AnonHugePages:         0 kB 
Swap:                  0 kB 
KernelPageSize:        4 kB 
MMUPageSize:           4 kB 
Locked:                0 kB 
be7eb000-be80c000 rwxp 00000000 00:00 0          [stack] 
Size:                136 kB 
Rss:                   4 kB 
Pss:                   4 kB 
Shared_Clean:          0 kB 
Shared_Dirty:          0 kB 
Private_Clean:         0 kB 
Private_Dirty:         4 kB 
Referenced:            4 kB 
Anonymous:             4 kB 
AnonHugePages:         0 kB 
Swap:                  0 kB 
KernelPageSize:        4 kB 
MMUPageSize:           4 kB 
Locked:                0 kB 
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors] 
Size:                  4 kB 
Rss:                   0 kB 
Pss:                   0 kB 
Shared_Clean:          0 kB 
Shared_Dirty:          0 kB 
Private_Clean:         0 kB 
Private_Dirty:         0 kB 
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB

Above output will be our reference memory layout. After playing around with the memory allocation we'll see how additional sections appear, how and when the memory usage is changing.  This is our starting point. Let's explain then what we see so far.

After running the program we see in /proc/maps 4 virtual memory areas:
  • data/test_app : it is our .text section (set of processor instructions)
  • [sigpage]: a special mapping for signal handling.
  • [stack] : virtual memory area reserved for the program stack. Each local variable will use this space as a memory, some arguments and registers (especially LR) will be also stored here. It grows downward (so the SP is firstly around be80c000 in our example) and its size can be increased (for userspace processes) by Kernel up to RLIMIT_STACK.
  • [vectors]: Processor exception table is mapped here. Also it's used by the special mechanism called “ARM Kernel User Helpers”. You can refer to Kernel documentation for more details about it.
We don't have global variables, memory-mapped files or dynamic allocation on the heap so far.

Next, we've inspected smaps. Example output looks like this (VM area of the process' stack area):

be7eb000-be80c000 rwxp 00000000 00:00 0          [stack]
Size:                136 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB

Now, we can go through each field and explain what it's telling us.
  • Size – It's a virtual memory size of the area. In above example you can subtract 0xbe7eb000 from 0xbe80c000 to get this value. It is also called “vss” - virtual set size . It's a total virtual space accessible by this area and it is '''not''' related to RAM usage at all (or nearly at all). In worst case (when we physically allocate it fully), the RAM usage will be equal to this value, so it's an upper limit. However, this value can change at runtime, for example when the stack grows or we add memory mapped files.  We'll see later how to allocate this Virtual memory and why Vss is not good for determining memory usage.
  • Rss – Resident set size, it's the total memory actually held in RAM. It doesn't distinguish between shared and private memory. This means, that not only this particular process is using memory reported by RSS but also other processes can use part of this memory.
  • Pss – Proportional set size. Quoting from lwn.net: “The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500. “
At this point, we should mention that from the Kernel memory management perspective, the smallest managed unit is called page. In most Android platforms (32bit ARM) the page size is 4KB. Userspace program cannot request from OS less than page_size of the memory.

Before explaining next fields we need to know about kernel page cache mechanism. If some data is read from a non-volatile storage it is cached (buffered) in the RAM, so each next read is done from RAM not from the non-volatile storage. This is of course possible until we modify that data. If the underlying page is not modified, then it has flag “clean”. If the page has been written, it has flag “dirty”. When the system (or user) performs sync() operation, dirty pages from caches are written physically into the storage.
Note, Kernel holds information about each physical page of the system in the global mem_map variable, which is an array of structures called page. You can take a look on declaration of struct page at kernel/include/linux/mm_types.h.

Getting back to the output of smaps:
  • Shared_Clean– How many shared pages are marked as “clean” (not modified). "Shared" means the physical page can be mapped by more than one process. Example can be a code of shared library.
  • Shared_Dirty – The same as above, but page has been modified.
  • Private_Clean – How many private pages are marked as “clean”. Private page is accessible only by given process.
  • Private_Drity – The same as above, but page has been modified. Note: page doesn't need to be a part of cache mechanism to be marked as dirty. All pages modified in RAM become “dirty” no matter if it's a memory-mapped file or anonymous page. In case of anonymous pages they can be “clean” only if they were no modified at all (so for example they are marked as read-only).
  • Referenced – Number of pages that have been referenced since last LRU list enqueue/requeue. This field is related to Page Frame Reclamation mechanism, which in Android devices is not important because usually there is no swap.
  • Anonymous – How many pages are marked as “anonymous”, which means there is no underlying file to which given page is mapped.
  • AnonHugePages – This field is related to Huge Pages mechanism. It's not valid for most Android devices.
  • Swap – as mentioned before, also not valid for most Android devices.
  • KernelPageSize – Size of the page managed by Kernel. It's a constant value, usually 4KB.
  • MMUPageSize – Page size on which operates MMU.
  • Locked – Number of pages that are exclusively locked, so they cannot be swapped.  Pages can be locked and unlocked by mlock() and munlock() calls.


Global variables

We'll now add couple of different global variables and see in what sections they will be stored:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
static int a;
int b;
static int c = 12345;
int d = 3333;

static char tmp[24] = "TEST";

void main()
{
    static int e;
    static int f = 666;

    while(1);
}

This is how new symbols appear in ELF:

arm-eabi-nm test_app
00010030 b a   <--
00010038 B b   <--
0001003c B __bss_end__
0001003c B _bss_end__
00010030 B __bss_start
00010030 B __bss_start__
0001000c d c  <--
00010010 D d  <--
0001000c D __data_start
00010034 b e.4064 <--
00010030 D _edata
0001003c B _end
0001003c B __end__
0001002c d f.4065 <--
00008000 T main
00080000 N _stack
         U _start
00010014 d tmp <--

“nm” shows offset of the symbol in ELF file, type of the symbol and name of the symbol. If symbol type is uppercase then symbol is global (external). If it's lowercase then it's local (not exported) symbol. We can see that static variables are local ones and not initialized global variables are stored in .bss section.  After loading and running the program we can observe following virtual memory map:

oot@D6503:/proc/8189 # cat maps
00008000-00009000 r-xp 00008000 b3:19 210        /data/test_app
00010000-00011000 rwxp 00008000 b3:19 210        /data/test_app <-- new one
b6f40000-b6f41000 r-xp 00000000 00:00 0          [sigpage]
be887000-be8a8000 rwxp 00000000 00:00 0          [stack]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

We see additional section in which our variables are placed. Here are details of this mapping:

00010000-00011000 rwxp 00008000 b3:19 210        /data/test_app
Size:                  4 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB


Stack

Firstly, we need to quickly describe so-called frame pointer. The stack pointer is saved into frame pointer each time the function is called, so frame pointer can be used as a reference for accessing local variables. The compiler however can generate code without use of frame pointer (-fomit-frame-pointer gcc option). If frame pointer is not used, then compiler needs to track the stack in other way. It depends on the architecture if it can be done other way or not (or how costly it is). In ARM mode the R11 is used as FP and in Thumb it's a R7. Sometimes (refer to gcc manual for more details) there is no possibility to omit the FP at all. In ARM however it's possible to not use a frame pointer (so the R11 can be utilized for other general purposes). FP is also used by debuggers and postmortem-analysis tools, so on some architectures disabling FP will make debugging impossible.

Here is an example program to show how stack works on ARM. Since we have more functions, we created symbol _start instead of main() to tell compiler where our entry point is. Otherwise it would start execution from the top-most function (in our example foo):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
static int foo(int a, int b, int c)
{
    int d = a + b + c;
    int e = d * 2;

    return d;
}

static int bar(int a, int b, int c, int d, int e, int f, int g)
{
    int h = a + b + c + d + e + f + g;
    int i;

    i = foo(a, h, g);

    return i;
}

static void wrapper()
{
    int a = 4;
    int buf[24];

    bar(a, 2, 3, 4, 5, 6, 7);

    return;
}

void  _start()
{
    wrapper();  
    while(1);
}

To get general idea how stack works, read firstly Journey to the stack.
To run our test program through GDB you should issue following commands (assuming test_app is pushed into /data/):

on target:
gdbserver :5039 /data/test_app
on host:
adb forward tcp:5039 tcp:5039
gdbclient test_app
nice to set in GDB:
display/i $pc

you can then step instruction-by-instruction with si command

After running our program through GDB we can see in /proc/<pid>/maps our stack area starting (for example, because this address is randomized) at 0xbe9b2000 and ends at 0xbe9d3000:

be9b2000 - be9d3000 rwxp 00000000 00:00 0          [stack]

Type “info registers” in GDB to see where the SP is just after running the program:

(gdb) info registers
r0             0x0 0
r1             0xbe9d2a17               ← points to argv[0]
r2             0xbe9d2a22               ← points to argv[1] (if present)
r3             0x0 0
r4             0x0 0
r5             0x0 0
r6             0x0 0
r7             0x0 0
r8             0x0 0
r9             0x0 0
r10            0x0 0
r11            0x0 0
r12            0x0 0
sp             0xbe9d28f0               ← points to argc
lr             0x0 0               ← link register
pc             0x8114 0x8114 <_start> ← program counter
cpsr           0x10 16              ← Program status register

As we see, SP is not placed directly at the top of the .stack area. This is because at the beginning of  stack (don't forget it grows downwards) Kernel places environment variables and arguments passed to the program. Example:

(gdb) x/64sb 0xbe9d2a17
0xbe9d2a17: "test_stack"
0xbe9d2a22: "testarg1"
0xbe9d2a2b: "testarg2"
0xbe9d2a34: "_=/system/bin/gdbserver"
0xbe9d2a4c: "PATH=/sbin:/vendor/bin:/system/sbin:/system/bin:/system/xbin"
0xbe9d2a89: "LOOP_MOUNTPOINT=/mnt/obb"
0xbe9d2aa2: "ANDROID_ROOT=/system"
0xbe9d2ab7: "SHELL=/system/bin/sh"
0xbe9d2acc: "ANDROID_ASSETS=/system/app"
0xbe9d2ae7: "TERM=vt100"
0xbe9d2af2: "ANDROID_PROPERTY_WORKSPACE=10,0"
0xbe9d2b12: "ANDROID_BOOTLOGO=1"
0xbe9d2b25: "HOSTNAME=D6503"
0xbe9d2b34: "ANDROID_DATA=/data"
0xbe9d2b47: "BOOTCLASSPATH=/system/framework/core-libart.jar:/system/framework/conscrypt.jar:/system/framework/okhttp.jar:/system/framework/core-junit.jar:/system/framework/bouncycastle.jar:/system/framework/ext.j"...
0xbe9d2c0f: "ar:/system/framework/framework.jar:/system/framework/telephony-common.jar:/system/framework/voip-common.jar:/system/framework/ims-common.jar:/system/framework/mms-common.jar:/system/framework/android."...
0xbe9d2cd7: "policy.jar:/system/framework/apache-xml.jar:/system/framework/qcmediaplayer.jar:/system/framework/WfdCommon.jar:/system/framework/oem-services.jar:/system/framework/org.codeaurora.Performance.jar:/sys"...
0xbe9d2d9f: "tem/framework/vcard.jar:/system/framework/tcmiface.jar:/system/framework/com.xxxyyy.uxp.jar"
0xbe9d2e01: "EMULATED_STORAGE_SOURCE=/mnt/shell/emulated"
0xbe9d2e2d: "ANDROID_SOCKET_adbd=11"
0xbe9d2e44: "EMULATED_STORAGE_TARGET=/storage/emulated"
0xbe9d2e6e: "ANDROID_STORAGE=/storage"
0xbe9d2e87: "MKSH=/system/bin/sh"
0xbe9d2e9b: "EXTERNAL_STORAGE=/storage/emulated/legacy"
0xbe9d2ec5: "LD_PRELOAD=libsigchain.so:libNimsWrap.so"
0xbe9d2eee: "EXTERNAL_STORAGE_USB=/storage/usbdisk"
0xbe9d2f14: "RANDOM=5811"
0xbe9d2f20: "ASEC_MOUNTPOINT=/mnt/asec"
0xbe9d2f3a: "SECONDARY_STORAGE=/storage/sdcard1"
0xbe9d2f5d: "USER=root"
0xbe9d2f67: "HOME=/data"
0xbe9d2f72: "SYSTEMSERVERCLASSPATH=/system/framework/services.jar:/system/framework/ethernet-service.jar:/system/framework/wifi-service.jar"
0xbe9d2ff1: "test_stack"
0xbe9d2ffc: ""

Quick note about CPSR: it's a Current Program Status Register and it contains condition code flags, interrupt disable bits, current processor mode, and other status and control information. Refer to ARM ARM for more details. In our example CPSR has value “16” which means “processor mode is >>User<<”.

Lets proceed with our program and observe SP, FP and the stack content. We'll also look into ARM procedure call standard a little (how parameters are passed and where the return value is stored).

=> 0x00008114 <+0>: push {r11, lr}
   0x00008118 <+4>: add r11, sp, #4
   0x0000811c <+8>: bl 0x80c8 <wrapper>

(note, R11 is a FP)

This is the simplest example – function without arguments or return value. In such case the Link Register and previous Frame Pointer are firstly written to the stack and the FP is then set to SP plus 4.

Here's how the memory around SP looks before issuing any instruction of our program:


After pushing two values SP is automatically set to 8 bytes (two words) lower:


Next instruction:

add r11, sp, #4

writes 0xbe9d28ec to r11, so it points to the beginning of the stack frame (usually where LR is stored - for now it's zero anyway).

Finally we're doing Branch and Link instruction to the address where our function “wrapper” is located:

bl 0x80c8 <wrapper>

This instruction modifies PC and LR:

lr             0x8120 33056
pc             0x80c8 0x80c8 <wrapper>

LR is pointing to the next instruction after the Branch and Link in _start function and the PC points to the first instruction in wrapper() function. Stack in this example is not touched at this point.
On the prologue of the wrapper() function we see the same two instruction as before: push FP and LR to the stack and update FP register:

=> 0x000080c8 <+0>: push {r11, lr}
   0x000080cc <+4>: add r11, sp, #4

At this point we have registers as follows:

r11            0xbe9d28e4
sp             0xbe9d28e0
lr             0x8120
pc             0x80d0

And stack around SP:


If you get back to source code of function “wrapper”, you'll see it creates two local variables: integer “a” and the array of 24 integers. It also calls function “bar” which takes more than 4 arguments. In ARM, first 4 arguments are stored in registers before calling the function. If you need more arguments, they will be stored on the stack. All in all we need 24 words for the array + 1 word for the integer + 3 words for additional arguments that we want to pass to the next function. It gives us 112 bytes of memory that we want to use on the stack. This memory will be acquired by simple moving the Stack Pointer:

0x000080d0 <+8>: sub sp, sp, #120

Hurray. We've allocated memory. It's actually 120 bytes instead of 112 because of alignments required by processor.

Because FP was stored at the beginning of the function, compiler can refer to it while finding addresses on the stack. Our integer “a” should have value “4” and this is achieved with following instructions:

 0x000080d4 <+12>: mov r3, #4
 0x000080d8 <+16>: str r3, [r11, #-8]

Here's the stack after those instructions:


Because “allocating” memory for our buf[24] is just moving the Stack Pointer and because “deleting” local variables is also just moving Stack Pointer back, you should always initialize local variables. In our example, buf[24] gets a fresh, not used before stack so it's zeroed. In general case there will be leftovers from previous stack operations so the content of the uninitialized buf[24] should be concerned as garbage.

Now we finished with two local variables (although buf[24] is not used anywhere). Next, compiler needs to prepare arguments for calling the “bar” function:

0x000080dc <+20>: mov r3, #5
0x000080e0 <+24>: str r3, [sp]
0x000080e4 <+28>: mov r3, #6
0x000080e8 <+32>: str r3, [sp, #4]
0x000080ec <+36>: mov r3, #7
0x000080f0 <+40>: str r3, [sp, #8]
0x000080f4 <+44>: ldr r0, [r11, #-8]

Those instruction will save “5”, “6” and “7” on the stack, just before Stack Pointer:


Next step is to prepare arguments in registers r0-r3:

0x000080f4 <+44>: ldr r0, [r11, #-8]
0x000080f8 <+48>: mov r1, #2
0x000080fc <+52>: mov r2, #3
0x00008100 <+56>: mov r3, #4

Variable “a” is taken from the stack as expected (address is calculated in relation to FP). Other values are of course immediate in our simple example. To sum up: before issuing next Branch and Link instruction, compiler saved first 4 arguments in registers r0-r4 and put other arguments on the stack. It's time for the next frame:

0x00008104 <+60>: bl 0x8050 <bar>

Function “bar” is doing exactly the same thing: stores the frame pointer and link register on the stack, allocates some bytes by moving SP and prepare arguments in registers r0-r2 to call function “foo”. Lets go forward and assume we're returning from function “bar”. We'll get back to “foo” after that because it's slightly different.

Here's the ending of the function:

0x000080bc <+108>: mov r0, r3
0x000080c0 <+112>: sub sp, r11, #4
0x000080c4 <+116>: pop {r11, pc}

Return value is stored in r0. Then, Stack Pointer is simply moved back to the beginning of the stack frame with a help of the frame pointer. Finally, previous Frame Pointer is restored from the stack as well as Link Register is loaded into Program Counter. “pop” instruction will also increase SP accordingly to how many values are taken from the stack (in our example 2 words).

Regarding the “foo()”: compiler noticed that link register will not be touched douring call of the function, so the prologue is simplified:

Dump of assembler code for function foo:

   0x00008000 <+0>: push {r11}  ; (str r11, [sp, #-4]!)
=> 0x00008004 <+4>: add r11, sp, #0

It saves only frame pointer on the stack and updates it with current SP. At the exit following instructions are executed:

0x00008040 <+64>: mov r0, r3
0x00008044 <+68>: sub sp, r11, #0
0x00008048 <+72>: pop {r11}  ; (ldr r11, [sp], #4)
0x0000804c <+76>: bx lr

It stores return value in r0, moves SP back, restores frame pointer and issues Branch and Exchange instruction to the address pointed by Link Register. Branch and Exchange is a branch instruction with optional mode switch to Thumb. Lets see how the LR looks before this instruction:

lr             0x80b4 32948

The Branch and Exchange will switch to Thumb only if the less significant bit of the address is “1”. In our case 0x80b4 has “0” at last bit, so it will do just a normal ARM branch.
Here's how stack looks around SP before exiting from the “foo” function:

After the "0x00008044 <+68>: sub sp, r11, #0":


After the "0x00008048 <+72>: pop {r11}":


And finally we just branch back to the Link Registers and so we land one instruction after the original branch:

   0x000080b0 <+96>: bl 0x8000 <foo>
=> 0x000080b4 <+100>: str r0, [r11, #-12]  ← we continue here

If we would compile it with -fomit-frame-pointer, prologues will be little simplified but all addresses will be calculated in relation to SP. See example:

function wrapper with frame pointer:

000080c8 <wrapper>:
    80c8: e92d4800  push {fp, lr}
    80cc: e28db004  add fp, sp, #4
    80d0: e24dd078  sub sp, sp, #120 ; 0x78
    80d4: e3a03004  mov r3, #4
    80d8: e50b3008  str r3, [fp, #-8]
    80dc: e3a03005  mov r3, #5
    80e0: e58d3000  str r3, [sp]
    80e4: e3a03006  mov r3, #6
    80e8: e58d3004  str r3, [sp, #4]
    80ec: e3a03007  mov r3, #7
    80f0: e58d3008  str r3, [sp, #8]
    80f4: e51b0008  ldr r0, [fp, #-8]
    80f8: e3a01002  mov r1, #2
    80fc: e3a02003  mov r2, #3
    8100: e3a03004  mov r3, #4
    8104: ebffffd1  bl 8050 <bar>
    8108: e1a00000  nop               ; (mov r0, r0)
    810c: e24bd004  sub sp, fp, #4
    8110: e8bd8800  pop {fp, pc}

the same function without frame pointer:

000080b8 <wrapper>:
    80b8: e52de004  push {lr}  ; (str lr, [sp, #-4]!)
    80bc: e24dd07c  sub sp, sp, #124 ; 0x7c
    80c0: e3a03004  mov r3, #4
    80c4: e58d3074  str r3, [sp, #116] ; 0x74
    80c8: e3a03005  mov r3, #5
    80cc: e58d3000  str r3, [sp]
    80d0: e3a03006  mov r3, #6
    80d4: e58d3004  str r3, [sp, #4]
    80d8: e3a03007  mov r3, #7
    80dc: e58d3008  str r3, [sp, #8]
    80e0: e59d0074  ldr r0, [sp, #116] ; 0x74
    80e4: e3a01002  mov r1, #2
    80e8: e3a02003  mov r2, #3
    80ec: e3a03004  mov r3, #4
    80f0: ebffffd3  bl 8044 <bar>
    80f4: e1a00000  nop   ; (mov r0, r0)
    80f8: e28dd07c  add sp, sp, #124 ; 0x7c
    80fc: e49df004  pop {pc}  ; (ldr pc, [sp], #4)

To sum up: allocating memory on the stack means toggling back and forward the stack pointer and (optionally) the frame pointer. Stack grows downwards and “freeing memory” by moving the stack pointer does absolutely nothing with the underlying data (what makes it also a quick operation). We also observed how the arguments are passed to the functions and where the return value is stored.


Dynamic memory

Now, we can start allocating and accessing some dynamic memory. Because we're not using standard library for now, we need to handle '''system calls''' manually. We definitely need those system calls, since as a userspace application we don't have full access to the underlying hardware (RAM) and we need to ask OS for it (refer to ARM processor modes/rings).  Lets see how system calls are implemented on ARM. Example implementation of brk() system call looks as following:

0x400942dc <+0>: mov r12, r7
0x400942e0 <+4>: mov r7, #45 ; 0x2d   <--- store syscall number in r7
0x400942e4 <+8>: svc 0x00000000       <--- generate an exception
0x400942e8 <+12>:       mov r7, r12
0x400942ec <+16>: cmn r0, #4096 
0x400942f0 <+20>: bxls lr
0x400942f4 <+24>: rsb r0, r0, #0
0x400942f8 <+28>: b 0x400b1b40       <--- errno handling

Each system call have a specific number. Linux Kernel ABI for ARM specifies that we need to store  the system call number in register R7. In our example we're storing number “45” which is sys_brk().  All system calls numbers can be read here: kernel/arch/arm/kernel/calls.S. After storing the system call number, the actual transition to the Kernel space happens with usage of “SVC” instruction. It's a “Supervisor Call” - formerly this instruction was called “SWI” like “Software Interrupt” (only name has changed).  It generates a processor exception, so the CPSR is modified to enter supervisor mode and control flow is passed to the vector table and from there to vector_swi in kernel/arch/arm/kernel/entry-common.S. According to ARMv7-A specification, the vector_swi is a handler registered in exception vector + 0x08 offset. Every time a given exception is triggered, processor changes the control flow to the “vector table”. Base address of this table can be configured during initialization in SCTLR (System Control Register). In case of Android device I'm working on, the exception vector is at 0xFFFF0000. It can be configured to other address in Secure mode (TrustZone). You can read more about it in ARM-ARM  “B1.8.1 Exception vectors and the exception base address”.
Anyway, in our example we can see that in kernel/arch/arm/kernel/entry-armv.S there is a __vectors_start symbol which at offset 0x08 has a branch instruction to the vector_swi: "(ldr) pc, __vectors_start + 0x1000". So, after issuing “svc” instruction, processor executes that branch instruction and moves to vector_swi in kernel/arch/arm/kernel/entry-common.S. It does there couple more things related to maintenance of registers (like saving CPSR). Next, from vector_swi we branch to the specific sys_* handler:

ldrcc   pc, [r8, r7, lsl #2]
(in r8 there is our sys_call_table from calls.S and in r7 there is our system call number)

For “brk” system call  it means we'll execute the SYSCALL_DEFINE1(brk, unsigned long, brk) in kernel/mm/mmap.c .  Here actual kernel memory management happens. For now, we just assume it should increase heap area for our process. Quoting from brk() manual:

“brk()  and  sbrk()  change  the location of the program break, which defines the end of the process's data segment (i.e., the program break is the first location after the end of the uninitialized data segment).  Increasing the program break has the effect of allocating memory to the process; decreasing the break deallocates memory.”

Our first goal is to check if it actually works.  Firstly, we need to call brk() with “zero” as an argument. It will give us current address of brk in our process (read again this blog if you don't know what program break is). After that, we can increment this address by some value and call brk() again. As a result, the heap area should be created for our process:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void main()                                                                                                                                                 
{
    asm volatile (
        "mov r0, #0\n"                 // first brk needs zero as an argument
        "mov r7, #45\n"                // 45 is a brk syscall number
        "swi #0\n"                     // execute brk syscalli
        "mov r1, r0\n"                 // save return address in r1
        "add r0, r1, #8192\n"          // increase brk for 2 pages
        "swi #0\n"                     // increase heap (execute brk again, “45” is already in r7)
    };
    while(1);                          // of course instead of loop we can use GDB to set a breakpoint
}

This simple program will ask Kernel for 2 pages (8192 bytes) of memory. We can see now:


root@SGP621:/proc/4679 # cat maps 
00008000-00009000 r-xp 00008000 b3:1a 24441      /data/brk_test 
00fd0000-00fd2000 rwxp 00000000 00:00 0          [heap] 
b6f7a000-b6f7b000 r-xp 00000000 00:00 0          [sigpage] 
bee40000-bee61000 rwxp 00000000 00:00 0          [stack] 
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors] 

New area is created, called “heap”. Looking into more detailed view (in smaps):


00fd0000-00fd2000 rwxp 00000000 00:00 0          [heap] 
Size:                  8 kB 
Rss:                   0 kB 
Pss:                   0 kB 
Shared_Clean:          0 kB 
Shared_Dirty:          0 kB 
Private_Clean:         0 kB 
Private_Dirty:         0 kB 
Referenced:            0 kB 
Anonymous:             0 kB 
AnonHugePages:         0 kB 
Swap:                  0 kB 
KernelPageSize:        4 kB 
MMUPageSize:           4 kB 
Locked:                0 kB 

We see in field “size” that Kernel created for us 8KB of virtual memory space. Because we're using memory over-committing, Kernel only reserved the memory for us, but did NOT allocate anything. For now, it only updated Kernel internal structures like mm_struct or vm_area_struct. We see, that “Rss” value is still zero. The actual memory allocation will happen when we will try to use the memory we've asked for. Let's try writing example data (number “255”) into both pages we've requested by brk():


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
void main()                                                                                                                                                  
{ 
    asm volatile ( 
        "mov r0, #0\n"                 // first brk needs null as an argument 
        "mov r7, #45\n"                // 45 is a brk syscall number 
        "swi #0\n"                     // execute brk syscalli
        "mov r1, r0\n"                 // save return address in r1 
        "add r0, r1, #8192\n"          // increase brk for 2 pages 
        "swi #0\n"                     // increase heap (execute brk again)
        "mov r2, #255\n"               // write "255" to tmp register                                                                                         
        "str r2, [r1]\n"               // write "255" to memory to the beginning of the heap (page fault will happen)
        "str r2, [r1, #32]\n"          // write "255" to memory to the 32 bytes from beginning (no page fault)
        "mov r3, #4096\n"              // use r3 as a tmp register to calculate offset 
        "str r2, [r1, r3]\n"           // write "255" to the second page (page fault)
    };
    while(1);
}

The result is:


01b01000-01b03000 rwxp 00000000 00:00 0          [heap] 
Size:                  8 kB 
Rss:                   8 kB  <-- increased
Pss:                   8 kB  <-- increased
Shared_Clean:          0 kB 
Shared_Dirty:          0 kB 
Private_Clean:         0 kB  
Private_Dirty:         8 kB  <-- increased
Referenced:            8 kB 
Anonymous:             8 kB 
AnonHugePages:         0 kB 
Swap:                  0 kB 
KernelPageSize:        4 kB 
MMUPageSize:           4 kB 
Locked:                0 kB

So, what happened? When for the first time we executed “store” instruction, the MMU in processor couldn't find any entry in its page table that corresponds to the virtual address we were referring to. In that case, the MMU throws the data_abort exception which is handled by kernel as minor page fault and results with starting Kernel memory management mechanism which allocates the actual memory (see kernel/arch/arm/mm/fault.c). We'll explain Kernel internals later. For now, we need to assume, that after allocating this memory, new entry in page table for our process is created and the “store” instruction is re-tried. Now, the entry is present, and the instruction succeeds. Because we're operating on page_size blocks, any other access inside this page will also succeed, but the another page fault will happen for access inside the second page we've requested. To see it better, you can use GDB and go instruction-by-instruction.

The conclusion from above example is that (assuming memory over-commit is set to “1”, which is true in most cases) Kernel will always give us virtual memory, but it doesn't mean that there is a physical resource available at all. We can get Out of Memory situation while referencing this memory.

In the same way you can check how stack area works and how global variables are handled. We'll now move to other way of getting dynamic memory from kernel: using mmap. We're inspecting brk() and mmap() because those are main system calls used by dlmalloc() adn jemalloc().

This is the moment in which you should read manual of mmap(). In short words, mmap() allows you to map a file into the virtual memory so the file can be accessed like it would be in the RAM. Of course, the big advantage of mmaping file (because of demand paging mechanism) is that not entire file is physically copied at once to the memory, but only when particular data is needed. It's also useful when you have many processes that read data from the same file. Kernel can share such mmaped file between all processes and hence save the RAM. The obvious example are shared libraries.
However, mmap() can be also used to just create (reserve) some virtual memory area without having actual file underneath. This is called anonymous mapping and it's heavily used by dlmalloc() and jemalloc(). So, the mmap can be used to create a memory area that programmer can use as dynamic allocated memory. The big advantage of mmap() over brk() is that you can unmap any mmaped space and re-use it. In case of brk() you can only decrease or increase the heap, so if you release some memory in the bottom of your brk()-created area, then you have an unused hole in you virtual memory area (memory fragmentation). You can't use it again with brk() unless you decrement the whole brk() to that point (discarding other memory). That way, the brk()-based heap is actually kind of stack. In dlmalloc() only small allocations are done using brk() call (which is in other hand quite fast). Everything above certain trigger is allocated with usage of mmap().

Declaration of mmap is following:


1
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

We want to call it with those arguments:


1
addr = mmap(0, 8192, PROT_WRITE | PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1. 0);

It should return the virtual address from which we can use 2 pages (8192 bytes) of read/write private memory.

The system call number for mmap is 192. Implementation:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
void main()                                                                                                                                                   
{ 
    asm volatile (        
        "mov r0, #0\n"                // void *addr = 0 
        "mov r1, #8192\n"             // size_t length = 2 pages 
        "mov r2, #3\n"                // int prot = PROT_WRITE | PROT_READ 
        "mov r3, #34\n"               // int flags = MAP_PRIVATE | MAP_ANONYMOUS 
        "mov r4, #-1\n"               // int fd = -1 
        "mov r5, #0\n"                // off_t offset = 0 
        "push {r4, r5}\n"             // push arguments onto stack 
        "mov r7, #192\n"              // 192 is a mmap2 syscall 
        "swi #0\n"                    // call mmap2 
        "mov r2, #255\n"              // write example value "255" to tmp register 
        "str r2, [r0]\n"              // write "255" to the mmpaped memory 
    );  
 
    while(1); 
}

After building it and running through GDB, we can have a look at /proc//maps for the reference:


00008000-00009000 r-xp 00008000 b3:19 215        /data/mkmmap 
b6f64000-b6f65000 r-xp 00000000 00:00 0          [sigpage] 
bed62000-bed83000 rwxp 00000000 00:00 0          [stack] 
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors] 

Now. move instruction by instruction observing the /proc/<pid>/maps in other shell. Move by issuing “ni” command in GDB until swi instruction (remember to add to the GDB config “display/i $pc “).  Now, execute the swi and see the /proc<pid>maps output:


root@D5803:/proc/8186 # cat maps                                               
00008000-00009000 r-xp 00008000 b3:19 215        /data/mkmmap 
b6f62000-b6f64000 rwxp 00000000 00:00 0                         <-- new area
b6f64000-b6f65000 r-xp 00000000 00:00 0          [sigpage] 
bed62000-bed83000 rwxp 00000000 00:00 0          [stack] 
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

New area appeared without any name. Lets see details in /proc/<pid>/smaps:


b6f62000-b6f64000 rwxp 00000000 00:00 0 
Size:                  8 kB 
Rss:                   0 kB 
Pss:                   0 kB 
Shared_Clean:          0 kB 
Shared_Dirty:          0 kB 
Private_Clean:         0 kB 
Private_Dirty:         0 kB 
Referenced:            0 kB 
Anonymous:             0 kB 
AnonHugePages:         0 kB 
Swap:                  0 kB 
KernelPageSize:        4 kB 
MMUPageSize:           4 kB 
Locked:                0 kB 

It has 8kB size, so 2 pages as we requested. Note, that it's only virtual area reserved, there is 0kB of “referenced” memory. Lets see how many minor page faults our process has so far (10th entry in stat):


root@D5803:/proc/8186 # cat stat                                               
8186 (mkmmap) t 8183 8186 7923 34816 8186 4194304 >>63<< 0 0 0 0 0 0 0 20 0 1 0 85958 155648 2 4294967295 32768 32824 3201837312 3201837300 32812 0 0 268435456 0 3222961244 0 0 17 0 0 0 0 0 0 32768 32824 2453504

It's 63 page faults (so far, the number of page faults is related to loading the program itself for example by issuing execve() system call by the shell). Now, lets try to write sample value “255” to our newly created vm area by executing next two instructions (the mov and str).

After that we can see smaps:


6f62000-b6f64000 rwxp 00000000 00:00 0 
Size:                  8 kB 
Rss:                   4 kB   <-- updated
Pss:                   4 kB   <-- updated
Shared_Clean:          0 kB 
Shared_Dirty:          0 kB 
Private_Clean:         0 kB   
Private_Dirty:         4 kB   <-- updated
Referenced:            4 kB   <-- updated
Anonymous:             4 kB   <-- updated
AnonHugePages:         0 kB 
Swap:                  0 kB 
KernelPageSize:        4 kB 
MMUPageSize:           4 kB 
Locked:                0 kB

stat (10th entry = 68):


8186 (mkmmap) t 8183 8186 7923 34816 8186 4194304 >>68<< 0 0 0 0 0 0 0 20 0 1 0 85958 155648 2 4294967295 32768 32824 3201837312 32

Again (as in previous example about brk), when we issued “str” instruction (store in the memory), the processor asked MMU to find in its page table the physical address corresponding to our virtual address returned in r0 by mmap system call. MMU didn't find anything, so it generated an exception, Exception was handled as “minor page fault” which triggered actual memory allocation in the Kernel. The proper physical address was found and page table was updated. After that, the “str” instruction was repeated and now succeeded. Value “255” has been written to the RAM. Note, that we see 5 more minor page faults in the stat. Using GDB each instruction generates 2 minor page faults. We executed two instruction, but the “str” gave us 3 minor page faults. The additional one was related to our actual memory allocation.  In smaps we also see, that Rss and Pss fields were updated as well as Referenced and “Anonymous”. As we expected, now we have 4kB of new private, anonymous memory for our process. When we “touch” second page of our 8kB mmap area the same mechanism will be triggered.

Note, the '''major page fault''' would happen if we mmap actual file instead of anonymous mapping. In such case, Kernel will need to access the flash memory to copy page to the cache.

To sum up dynamic memory allocation or rather the process of requesting Kernel for dynamic memory: the main system calls that userspace application can use are brk() and mmap(), however there are number of other system calls related to memory management like madvise() or mprotect(). The smallest chunk of memory that OS can give to userspace is of '''PAGE_SIZE'''. When the memory overcommit is enabled, actual memory allocation will happen during accessing given address with usage of processor exceptions and page fault handling.

No comments:

Post a Comment