Friday, 27 November 2015

Why does FUSE on Android suck?


FUSE (Filesystem in Userspace) is a very useful mechanism in many applications. The thing is, those applications should not be focused on performance in terms of actual data transfers. FUSE has many advantages implied by userspace sandboxing, but for sure performance wasn't the main design consideration. I'm not telling that it is a bad design or something wrong with FUSE itself. It is just focused on other aspects like security, stability and easiness of creating applications. The problem I'd like to discuss here is that Google decided to use FUSE as a frontend to actual data stored on the non-volatile memory.

FUSE has been introduced in Android 4.4 to handle "emulated" storage. Before that, "emulated" storage path was mounted as VFAT. Here's how it looked on old ICS (output of mount command):

/dev/block/vold/179:14 /mnt/sdcard vfat rw,dirsync,nosuid,nodev,noexec,relatime,uid=1000,gid=1015,fmask=0702,dmask=0702,allow_utime=0020,codepage=cp437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro 0 0

Don't be confused by "sdcard" directory name. It is still internal flash. External storage is usually mounted as something like "sdcard1".

This kind of partition was needed because of compatibility reasons. The applications can store data no matter if it's internal or external flash. In case of storing data on external SD cards, system has to deal usually with FAT32 filesystem. FAT32 is quite different than EXT4 used by Android internally. For instance, it's not case sensitive and doesn't handle discretionary access control.

Because of adding more Android specific permissions, Google decided to use FUSE to emulate FAT32:

/dev/fuse /mnt/shell/emulated fuse rw,nosuid,nodev,noexec,relatime,user_id=1023,group_id=1023,default_permissions,allow_other 0 0


So, how does it work on Android?

First of all, there is a FUSE support enabled in kernel. Complementarily, there is a userspace daemon called "sdcard". On boot, the sdcard daemon mounts a /dev/fuse device to the emulated directory:

1743static int fuse_setup(struct fuse* fuse, gid_t gid, mode_t mask) {
1744    char opts[256];
1746    fuse->fd = open("/dev/fuse", O_RDWR);
1747    if (fuse->fd == -1) {
1748        ERROR("failed to open fuse device: %s\n", strerror(errno));
1749        return -1;
1750    }
1752    umount2(fuse->dest_path, MNT_DETACH);
1754    snprintf(opts, sizeof(opts),
1755            "fd=%i,rootmode=40000,default_permissions,allow_other,user_id=%d,group_id=%d",
1756            fuse->fd, fuse->global->uid, fuse->global->gid);
1757    if (mount("/dev/fuse", fuse->dest_path, "fuse", MS_NOSUID | MS_NODEV | MS_NOEXEC |
1758            MS_NOATIME, opts) != 0) {
1759        ERROR("failed to mount fuse filesystem: %s\n", strerror(errno));
1760        return -1;
1761    }
1763    fuse->gid = gid;
1764    fuse->mask = mask;
1766    return 0;

After that, it polls on FUSE device waiting for messages from the kernel:

1581static void handle_fuse_requests(struct fuse_handler* handler)
1583    struct fuse* fuse = handler->fuse;
1584    for (;;) {
1585        ssize_t len = TEMP_FAILURE_RETRY(read(fuse->fd,
1586                handler->request_buffer, sizeof(handler->request_buffer)));
1587        if (len < 0) {
1588            if (errno == ENODEV) {
1589                ERROR("[%d] someone stole our marbles!\n", handler->token);
1590                exit(2);
1591            }
1592            ERROR("[%d] handle_fuse_requests: errno=%d\n", handler->token, errno);
1593            continue;
1594        }

Since now, every file operation inside directory mounted through FUSE will be handled in a specific way. For example, let's say we'd like to read file "test.txt" located at /sdcard/test.txt. Note again: "sdcard" means internal flash.

root@android: # cd /sdcard
root@android:/sdcard # cat test.txt

We expect cat to issue open(), read() and close() system calls during that operation. Let's have a look at what we get using strace:

root@android:/sdcard # strace -f -e open,openat,read,close cat test.txt
>>stripped output related to loading "cat" by shell<<
(..)                             = 0
openat(AT_FDCWD, "test.txt", O_RDONLY)  = 3
read(3, "1234\n", 1024)                 = 5
read(3, "", 1024)                       = 0
close(3)                                = 0

Looks ok, but hey, what is sdcard daemon doing in the meantime? Strace sdcard in the same time:

root@android: # ps | grep sdcard
media_rw  714   1     23096  1528  ffffffff 81ca6254 S /system/bin/sdcard
root@android: # strace -f -p 714 
Process 714 attached with 3 threads
[pid   916] read(3,  <unfinished ...>
[pid   915] read(3,  <unfinished ...>
[pid   714] read(4,  <unfinished ...>
[pid   916] <... read resumed> "1\0\0\0\1\0\0\0\2\234\3\0\0\0\0\0\200\200@\200\177\0\0\0\0\0\0\0\0\0\0\0"..., 262224) = 49
[pid   916] faccessat(AT_FDCWD, "/data/media/0/test.txt", F_OK) = 0
[pid   916] newfstatat(AT_FDCWD, "/data/media/0/test.txt", {st_mode=S_IFREG|0664, st_size=5, ...}, AT_SYMLINK_NOFOLLOW) = 0
[pid   916] writev(3, [{"\220\0\0\0\0\0\0\0\2\234\3\0\0\0\0\0", 16}, {"\200\261\317\200\177\0\0\0\223(\0\0\0\0\0\0\n\0\0\0\0\0\0\0\n\0\0\0\0\0\0\0"..., 128}], 2) = 144
[pid   915] <... read resumed> "0\0\0\0\16\0\0\0\3\234\3\0\0\0\0\0\200\261\317\200\177\0\0\0\0\0\0\0\0\0\0\0"..., 262224) = 48
[pid   916] read(3,  <unfinished ...>
[pid   915] openat(AT_FDCWD, "/data/media/0/test.txt", O_RDONLY|O_LARGEFILE) = 5
[pid   915] writev(3, [{" \0\0\0\0\0\0\0\3\234\3\0\0\0\0\0", 16}, {"\260p\300\200\177\0\0\0\0\0\0\0\0\0\0\0", 16}], 2 <unfinished ...>
[pid   916] <... read resumed> "P\0\0\0\17\0\0\0\4\234\3\0\0\0\0\0\200\261\317\200\177\0\0\0\0\0\0\0\0\0\0\0"..., 262224) = 80
[pid   915] <... writev resumed> )      = 32
[pid   916] pread64(5,  <unfinished ...>
[pid   915] read(3,  <unfinished ...>
[pid   916] <... pread64 resumed> "1234\n", 4096, 0) = 5
[pid   916] writev(3, [{"\25\0\0\0\0\0\0\0\4\234\3\0\0\0\0\0", 16}, {"1234\n", 5}], 2) = 21
[pid   915] <... read resumed> "8\0\0\0\3\0\0\0\5\234\3\0\0\0\0\0\200\261\317\200\177\0\0\0\0\0\0\0\0\0\0\0"..., 262224) = 56
[pid   916] read(3,  <unfinished ...>
[pid   915] newfstatat(AT_FDCWD, "/data/media/0/test.txt", {st_mode=S_IFREG|0664, st_size=5, ...}, AT_SYMLINK_NOFOLLOW) = 0
[pid   915] writev(3, [{"x\0\0\0\0\0\0\0\5\234\3\0\0\0\0\0", 16}, {"\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\224(\0\0\0\0\0\0\5\0\0\0\0\0\0\0"..., 104}], 2) = 120
[pid   916] <... read resumed> "@\0\0\0\31\0\0\0\6\234\3\0\0\0\0\0\200\261\317\200\177\0\0\0\0\0\0\0\0\0\0\0"..., 262224) = 64
[pid   916] write(3, "\20\0\0\0\0\0\0\0\6\234\3\0\0\0\0\0", 16) = 16
[pid   916] read(3, "@\0\0\0\22\0\0\0\7\234\3\0\0\0\0\0\200\261\317\200\177\0\0\0\0\0\0\0\0\0\0\0"..., 262224) = 64
[pid   916] close(5)                    = 0
[pid   916] write(3, "\20\0\0\0\0\0\0\0\7\234\3\0\0\0\0\0", 16) = 16
[pid   916] read(3,  <unfinished ...>
[pid   915] read(3, ^CProcess 714 detached
Process 915 detached

A lot is happening. This is because each file operation will now work in a following way:
  1. Userspace application issues system call that will be handled by FUSE driver in kernel (we see it in the first strace output)
  2. FUSE driver in kernel notifies userspace daemon (sdcard) about new request
  3. Userspace daemon reads /dev/fuse
  4. Userspace daemon parses command and recognizes file operation (ex. open)
  5. Userspace daemon issues system call to the actual filesystem (EXT4)
  6. Kernel handles physical data access and sends data back to the userspace
  7. Userspace modifies (or not) data and passes it through /dev/fuse to kernel again
  8. Kernel completes original system call and moves data to the actual userspace application (in our example cat)
Uff, that's a lot, isn't it? 


Let's see what side effects such attitude has. The obvious one is performance overhead for each additional system call. Here are numbers (all tests were performed several times, each time similar results were observed):

Test #1: Copy big file within one partition.


root@android:/data # echo 3 > /proc/sys/vm/drop_caches
root@android:/data # dd of=bigbuck.out bs=1m                      
691+1 records in
691+1 records out
725106140 bytes transferred in 10.779 secs (67270260 bytes/sec)


root@android:/sdcard # echo 3 > /proc/sys/vm/drop_caches                      
root@android:/sdcard # dd of=bigbuck.out bs=1m                  
691+1 records in
691+1 records out
725106140 bytes transferred in 13.031 secs (55644704 bytes/sec)


In this test, FUSE is about 17% slower.

Test #2: Copy a lot of small files within one partition. There were 10 000 files each one 5kB of size.


root@android:/data # echo 3 > /proc/sys/vm/drop_caches
root@android:/data # time cp small/* small2/                                  
    0m17.27s real     0m0.32s user     0m6.07s system


root@android:/sdcard # echo 3 > /proc/sys/vm/drop_caches                      
root@android:/sdcard # time cp small/* small2/                                
    1m3.03s real     0m1.05s user     0m9.59s system


I think the comment is superfluous. It took over 1 minute (!) to copy ~50MB of small files on FUSE mounted partition in comparison to ~17 seconds on EXT4 FS.

Double caching

Another implication is double caching of data.  Linux Kernel uses page cache mechanism to store recently accessed data in memory, specifically data from a non-volatile storage. This greatly improves data access performance. However, we don't want to have the same data cached twice. Unfortunately, this will happen because of the way in which FUSE is used on Android.

Observing double caching behavior caused by FUSE is very simple:
  1. Create file with a known size
  2. Copy it into /sdcard folder on the phone (/sdcard is a symlink to /storage/emulated/legacy which is a symlink to /mnt/shell/emulated/0 which is mounted as FUSE)
  3. Drop page cache -> take a snapshot of page cache usage -> read test file -> take another snapshot of page cache -> see a difference between page cache usage before and after reading the file:
root@android: # echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; cat /proc/meminfo | grep Cache ; cat /sdcard/test_file > /dev/null ; cat /proc/meminfo | grep Cache

If size of the file is for example 10MB we'll get something like this:

before file operation:  

Cached: 241864 kB

after file operation: 

Cached: 263072 kB

Expected result would be 10MB more than 241MB in cache, so something around 251MB. Instead, we see 263MB in cache after reading 10MB of data. It means kernel cached twice as needed. The same test performed directly on EXT4 FS (for instance in /data folder) will show, as expected, 10MB more of cached pages.

So, we have the same data cached twice. Once as a user application that issued original open/read system call and once as "sdcard" daemon. First data is cached by FUSE, second one by EXT4 FS.

When I first noticed it I tried to force FUSE to skip caching. Here are my notes from that time:
We can skip fuse cache by providing FOPEN_DIRECT_IO inside kernel. I tested this solution, however it affected performance significantly. Although caching works ok (meaning there is only one copy of data in cache and subsequent reads doesn't generate i/o to the flash) there is additional overhead for switching more often between sdcard daemon and fuse fs in kernel. Maybe it can be tweaked more.
There is FOPEN_KEEP_CACHE in fuse that might be useful – it needs more investigation.
Other solution is to provide O_DIRECT flag in sdcard daemon when it's opening ext4fs files. We then discard caches from ext4fs and we should be able to use page cache created by fuse. However, using O_DIRECT requires user buffers to be aligned in memory to the block size. Also the size of data chunks should be aligned. Sdcard daemon is prepared for external O_DIRECT requests by Google: . The possible solution would be to enable KEEP_CACHE in Fuse on kernel side and use O_DIRECT to all sdcard daemon requests. I did it for ‘read’ case and it works, however there is a significant overhead for the first read of data. Subsequent reads are much faster than originally (due to caching in fuse). Using it for writes may be tricky though.

Another way to solve it is to provide POSIX_FADV_DONTNEED fadvise in sdcard dameon. I tested it as well, however again - it affects performance too much.
Basically, the most important conclusion from above investigation was: get rid of FUSE and implement FAT32 emulation layer inside kernel.

Other issues

Beside performance and double caching, there are other problems with FUSE on Android. For instance, not all features from FAT32 are implemented in sdcard daemon. There were issues with utime() system call and with lack of full support for O_DIRECT flag.

I don't want to blame Google only. As officialy stated:
"Devices may provide external storage by emulating a case-insensitive, permissionless filesystem backed by internal storage. One possible implementation is provided by the FUSE daemon in system/core/sdcard, which can be added as a device-specific init.rc service". 
FUSE daemon is only example implementation that is easiest to maintain, but it has also a lot of drawbacks.
What's more interesting, some mobile vendors (Samsung, Motorola) have already realized it and replaced FUSE with their own in-kernel (or mixed) implementation. Samsung has created driver based on WrapFS called "sdcardfs". In my opinion it's the best approach: use WrapFS to implement FAT32 emulation layer inside kernel. If Samsung implemented it correctly it's another question, but from what I saw in officially published Samsung kernel sources it's not so bad.


To sum-up why does FUSE on Android suck:
  • Performance
  • Double caching
  • Several other minor defects, like missing allow_utime flag
Note, Android as an operating system doesn't access files via FUSE internally. However, high-level applications do. Use cases like saving photos from camera, recording videos or reading offline maps will suffer the most from FUSE drawbacks described in this article. 

1 comment:

  1. Wow, that was a nice helped me understand fuse a lot better