During a performance evaluation, an unfortunate interaction of the STACKLEAK plugin with the RAP plugin was noticed that lead to unnecessary bloat. This blog post highlights the steps that have been taken by our new team member, Mathias Krause, to resolve the source of the problem.

Background on STACKLEAK and RAP

First, some background. You can skip this section if you're familiar with the purpose of the STACKLEAK and RAP plugins.

STACKLEAK

STACKLEAK was introduced back in 2011 as a coarse-grained countermeasure to address stack-based infoleaks. Its main idea is to wipe the kernel stack on syscall exit (sometimes also on entry) to prevent leaking any sensitive information from previous syscalls via uninitialized stack variables.

As wiping the full kernel stack on every syscall is quite a performance killer — it's 16k on x86-64 these days — the plugin tries to track how much stack space is actually used. A function — pax_track_stack() — regularly stores the current stack pointer value into the task-specific variable lowest_stack that gets used on syscall exit as a start address to wipe from. The plugin will inject calls to pax_track_stack() into the generated code for eligible functions, either ones with a large enough stack frame or users of alloca().

Technically this instrumentation happens in two passes: The first pass injects a call to pax_track_stack() in the function prologue of every function. The second one removes the call again for functions that fail the eligibility test, i.e. make no use of alloca() and have a small enough stack frame. The reason for doing it in two passes is that the final stack frame size is only known very late in the compilation phase. However, at that stage, no more additional calls to pax_track_stack() can be injected.

RAP

The RAP plugin was released to the public in April 2016. It provides a security property commonly known as Control Flow Integrity (CFI), preventing exploit techniques that try to execute code out of intended order or even try to execute unintended instructions (return-to-libc, borrowed code chunks, ROP, etc.). To achieve this goal, it weaves identifiers into the code stream that mark legitimate call sites and return locations, which we call RAP hashes. These identifiers are 64-bit values that are generated by hashing the type information of the called function.

As RAP hash values rarely represent valid instructions, let alone instructions without side effects, these identifiers need to be skipped during execution of the code. For call sites, this is as easy as prepending the RAP hash in front of the function. The function symbol still refers to the first instruction, so the RAP hash check in the caller just needs to look 8 bytes in front of it to find the hash.

The return hash, on the other hand, needs to be spatially nearby the call instruction, as it's the return address that needs to be verified. Therefore the RAP hash needs to be part of the actual instruction stream that gets executed — without actually getting executed. Skipping the RAP return hash from getting executed is as simple as jumping over it. Here's an example:

ffffffff81000000 <_stext>:
[...]
ffffffff81000014:       jmp    ffffffff81000023 <_stext+0x23>
ffffffff81000016:       movabs $0xffffffffba01b6dd,%rax
ffffffff81000020:       int3
ffffffff81000021:       int3
ffffffff81000022:       int3
ffffffff81000023:       callq  ffffffff81000220 <__startup_64>
ffffffff81000028: [...]

The JMP instruction skips the embedded RAP return hash (0xffffffffba01b6dd) and continues execution at the CALL instruction. The MOVABS following the JMP will never get executed. It's just to please disassemblers attempting to decode the RAP hash as instructions. The INT3 is just "padding" to ensure the RAP hash will always be at a fixed offset from the return address — -16 bytes in this case.

Control Flow Integrity is achieved by instrumenting all calls, indirect jumps and returns so they provide, and check, these hash values. Calls to pax_track_stack(), generated by the STACKLEAK plugin, will be instrumented as well. This is where it starts to get interesting.

The Issue

The STACKLEAK plugin adds calls to pax_track_stack() that the RAP plugin later instruments by embedding return hashes. However, the STACKLEAK plugin might remove the call in a later pass if the function is deemed to not make use of alloca() and has a stack frame size smaller than the threshold. In this case, however, the RAP return hash remains. This leads to code like the following:

ffffffff81215e10 <rap_sys_mmap_pgoff>:
ffffffff81215e10:       push   %r15
ffffffff81215e12:       mov    %r9,%r15
ffffffff81215e15:       push   %r14
ffffffff81215e17:       mov    %r8,%r14
ffffffff81215e1a:       push   %r13
ffffffff81215e1c:       mov    %rcx,%r13
ffffffff81215e1f:       push   %r12
ffffffff81215e21:       mov    %rdx,%r12
ffffffff81215e24:       push   %rbp
ffffffff81215e25:       mov    %rsi,%rbp
ffffffff81215e28:       push   %rbx
ffffffff81215e29:       mov    %rdi,%rbx
ffffffff81215e2c: ==>   jmp    ffffffff81215e40 <rap_sys_mmap_pgoff+0x30>
ffffffff81215e2e: ==>   movabs $0xffffffffdb9d6e07,%rax
ffffffff81215e38: ==>   int3
ffffffff81215e39: ==>   int3
ffffffff81215e3a: ==>   int3
ffffffff81215e3b: ==>   int3
ffffffff81215e3c: ==>   int3
ffffffff81215e3d: ==>   int3
ffffffff81215e3e: ==>   int3
ffffffff81215e3f: ==>   int3
ffffffff81215e40:       mov    %r15,%r9
ffffffff81215e43:       mov    %r14,%r8
ffffffff81215e46:       mov    %r13,%rcx
ffffffff81215e49:       mov    %r12,%rdx
ffffffff81215e4c:       mov    %rbp,%rsi
ffffffff81215e4f:       mov    %rbx,%rdi
ffffffff81215e52:       jmp    ffffffff81215e61 <rap_sys_mmap_pgoff+0x51>
ffffffff81215e54:       movabs $0xffffffffd6c086f5,%rax
ffffffff81215e5e:       int3
ffffffff81215e5f:       int3
ffffffff81215e60:       int3
ffffffff81215e61:       callq  ffffffff81215af0 <sys_mmap_pgoff>
ffffffff81215e66:       mov    0x30(%rsp),%rdx
ffffffff81215e6b:       cmpq   $0xffffffffd6c086f5,-0x10(%rdx)
ffffffff81215e73:       jne    ffffffff81215e88 <rap_sys_mmap_pgoff+0x78>
ffffffff81215e75:       pop    %rbx
ffffffff81215e76:       pop    %rbp
ffffffff81215e77:       pop    %r12
ffffffff81215e79:       pop    %r13
ffffffff81215e7b:       pop    %r14
ffffffff81215e7d:       pop    %r15
ffffffff81215e7f:       btsq   $0x3f,(%rsp)
ffffffff81215e85:       retq
ffffffff81215e86:       ud2
ffffffff81215e88:       ud1    (%rax),%edx
ffffffff81215e8b:       nopl   0x0(%rax,%rax,1)

Two things stand out:

  1. The lines with arrows next to them above contain an "empty" RAP return hash code fragment. Empty, because there's no subsequent call that would check the RAP hash woven into the code.

  2. The function prologue moves all function arguments (per System V AMD64 ABI passed in registers RDI, RSI, RDX, RCX, R8 and R9) to a new set of registers for no obvious reason — they get moved back immediately after the empty RAP return hash sequence.

Problem 1 arises because the STACKLEAK plugin removed its call to pax_track_stack() but not the RAP return hash, since it doesn't know it's there.

Problem 2 is, again, a remnant of the removed function call. The aforementioned registers are so-called caller-saved registers, meaning their values don't need to be preserved across function calls. Thus the compiler had to move them somewhere else for the following call to sys_mmap_pgoff(). In this case it did so by storing them into callee-saved registers, which are guaranteed to be preserved across function calls by the ABI. However, as sys_mmap_pgoff() expects the very same arguments, it had to move them all back to the original registers.

While Problem 1 is specific to a STACKLEAK and RAP interaction, Problem 2 is STACKLEAK-specific and so even can be seen in upstream's version. Here's a quick additional example of a function without upstream STACKLEAK:

0000000000000900 <wake_page_function>:
     900:       callq  905 <wake_page_function+0x5>     901: R_X86_64_PLT32     __fentry__-0x4
     905:       mov    (%rcx),%rax
     908:       cmp    %rax,-0x10(%rdi)
     90c:       je     911 <wake_page_function+0x11>
     90e:       xor    %eax,%eax
     910:       retq   
     911:       movl   $0x1,0xc(%rcx)
     918:       movslq 0x8(%rcx),%r8
     91c:       cmp    %r8d,-0x8(%rdi)
     920:       jne    90e <wake_page_function+0xe>
     922:       bt     %r8,(%rax)
     926:       jb     92d <wake_page_function+0x2d>
     928:       jmpq   92d <wake_page_function+0x2d>    929: R_X86_64_PLT32     autoremove_wake_function-0x4
     92d:       mov    $0xffffffff,%eax
     932:       retq   

And with upstream STACKLEAK (again, with the track_stack call eliminated), increasing the function's size by 45% and its number of instructions by 56%:

0000000000000a50 <wake_page_function>:
     a50:       callq  a55 <wake_page_function+0x5>     a51: R_X86_64_PLT32     __fentry__-0x4
     a55:       push   %r13
     a57:       mov    %edx,%r13d
     a5a:       push   %r12
     a5c:       mov    %esi,%r12d
     a5f:       push   %rbp
     a60:       mov    %rdi,%rbp
     a63:       push   %rbx
     a64:       mov    %rcx,%rbx
     a67:       mov    (%rbx),%rax
     a6a:       cmp    %rax,-0x10(%rbp)
     a6e:       je     a79 <wake_page_function+0x29>
     a70:       xor    %eax,%eax
     a72:       pop    %rbx
     a73:       pop    %rbp
     a74:       pop    %r12
     a76:       pop    %r13
     a78:       retq   
     a79:       movl   $0x1,0xc(%rbx)
     a80:       movslq 0x8(%rbx),%rdx
     a84:       cmp    %edx,-0x8(%rbp)
     a87:       jne    a70 <wake_page_function+0x20>
     a89:       bt     %rdx,(%rax)
     a8d:       jb     aa6 <wake_page_function+0x56>
     a8f:       mov    %rbx,%rcx
     a92:       mov    %r13d,%edx
     a95:       pop    %rbx
     a96:       mov    %r12d,%esi
     a99:       mov    %rbp,%rdi
     a9c:       pop    %rbp
     a9d:       pop    %r12
     a9f:       pop    %r13
     aa1:       jmpq   aa6 <wake_page_function+0x56>    aa2: R_X86_64_PLT32     autoremove_wake_function-0x4
     aa6:       mov    $0xffffffff,%eax
     aab:       jmp    a72 <wake_page_function+0x22>
     aad:       nopl   (%rax)

All of this leads to unfortunate, unnecessary code bloat.

The Fix

Fixing the spurious register spilling is rather easy. We just need to tell the compiler that pax_track_stack() will follow a special calling convention that will preserve all register values, so the caller doesn't need to preserve any caller-saved registers itself. Unfortunately there's no gcc function attribute to do that on a per-function level. But we can change this on a per-compilation-unit level with the help of the -fcall-saved-* compiler switch.

Now that pax_track_stack() itself will take care of preserving all register values, we need to fix up the callers that don't know about the special calling convention. They would still preserve and restore caller-saved registers which would make the code bloat problem even worse, as now two entities would be preserving the register values instead of just one.

Luckily, there are no direct callers of pax_track_stack(). The only callers are the ones generated by the STACKLEAK plugin itself. So the lack of a gcc function attribute is no real issue, as we can control the call sites by modifying the plugin.

To achieve this goal (and solve the second problem at the same time) we need to hide the call from the compiler. If it's unaware that we're calling a function, it won't try to emit code that would preserve caller-saved registers. Basically, what we want to do is change the GIMPLE call to the equivalent inline assembly construct "asm volatile ("call pax_track_stack")". However, we can do better by using pax_direct_call, a macro provided by the RAP plugin that will take care of embedding the RAP return hash for backward edge checks. Moreover, this ASM statement can now be completely removed in a later pass in case the STACKLEAK plugin decided no call to pax_track_stack() is needed — removing the embedded RAP return hash as well.

Going back to the initial example with RAP enabled, this is what gets generated with these fixes in place:

ffffffff811ecfe0 <rap_sys_mmap_pgoff>:
ffffffff811ecfe0:       jmp    ffffffff811ecfef <rap_sys_mmap_pgoff+0xf>
ffffffff811ecfe2:       movabs $0xffffffffd6c086f5,%rax
ffffffff811ecfec:       int3
ffffffff811ecfed:       int3
ffffffff811ecfee:       int3
ffffffff811ecfef:       callq  ffffffff811eccc0 <sys_mmap_pgoff>
ffffffff811ecff4:       mov    (%rsp),%rdx
ffffffff811ecff8:       cmpq   $0xffffffffd6c086f5,-0x10(%rdx)
ffffffff811ed000:       jne    ffffffff811ed00b <rap_sys_mmap_pgoff+0x2b>
ffffffff811ed002:       btsq   $0x3f,(%rsp)
ffffffff811ed008:       retq
ffffffff811ed009:       ud2
ffffffff811ed00b:       ud1    (%rax),%edx
ffffffff811ed00e:       xchg   %ax,%ax

The above is much shorter — no unneeded RAP return hash anymore and no unneeded register shuffling.

Real Case Impact

These optimizations are an improvement, but do they bring any tangible benefit when calls to pax_track_stack() are actually needed? Let's take a look at a function with an actual call to pax_track_stack():

With the above modifications, the code looks like this:

ffffffff81180120 <__bpf_prog_run384>:
ffffffff81180120:       sub    $0x1e0,%rsp
ffffffff81180127:       jmp    ffffffff81180136 <__bpf_prog_run384+0x16>
ffffffff81180129:       movabs $0xffffffffdb9d6e07,%rax
ffffffff81180133:       int3
ffffffff81180134:       int3
ffffffff81180135:       int3
ffffffff81180136:       callq  ffffffff810729a0 <pax_track_stack>
ffffffff8118013b:       lea    0x1e0(%rsp),%rax
ffffffff81180143:       mov    %rdi,0x8(%rsp)
ffffffff81180148:       lea    0x60(%rsp),%rdx
ffffffff8118014d:       mov    %rsp,%rdi
ffffffff81180150:       mov    %rax,0x50(%rsp)
ffffffff81180155:       jmp    ffffffff81180164 <__bpf_prog_run384+0x44>
ffffffff81180157:       movabs $0xffffffffc45cf82b,%rax
ffffffff81180161:       int3
ffffffff81180162:       int3
ffffffff81180163:       int3
ffffffff81180164:       callq  ffffffff8117e850 <___bpf_prog_run>
ffffffff81180169:       mov    0x1e0(%rsp),%rdx
ffffffff81180171:       cmpq   $0xffffffffba9431ed,-0x10(%rdx)
ffffffff81180179:       jne    ffffffff8118018b <__bpf_prog_run384+0x6b>
ffffffff8118017b:       add    $0x1e0,%rsp
ffffffff81180182:       btsq   $0x3f,(%rsp)
ffffffff81180188:       retq
ffffffff81180189:       ud2
ffffffff8118018b:       ud1    (%rax),%edx
ffffffff8118018e:       xchg   %ax,%ax

The old code looks as follows (additional instructions marked with arrows):

ffffffff811a1eb0 <__bpf_prog_run384>:
ffffffff811a1eb0:  ==>  push   %rbp
ffffffff811a1eb1:  ==>  mov    %rdi,%rbp
ffffffff811a1eb4:  ==>  push   %rbx
ffffffff811a1eb5:  ==>  mov    %rsi,%rbx
ffffffff811a1eb8:       sub    $0x1e0,%rsp
ffffffff811a1ebf:       jmp    ffffffff811a1ece <__bpf_prog_run384+0x1e>
ffffffff811a1ec1:       movabs $0xffffffffdb9d6e07,%rax
ffffffff811a1ecb:       int3
ffffffff811a1ecc:       int3
ffffffff811a1ecd:       int3
ffffffff811a1ece:       callq  ffffffff81269c40 <pax_track_stack>
ffffffff811a1ed3:  ==>  mov    %rbx,%rsi
ffffffff811a1ed6:       lea    0x1e0(%rsp),%rax
ffffffff811a1ede:       lea    0x60(%rsp),%rdx
ffffffff811a1ee3:       mov    %rsp,%rdi
ffffffff811a1ee6:       mov    %rbp,0x8(%rsp)
ffffffff811a1eeb:       mov    %rax,0x50(%rsp)
ffffffff811a1ef0:       jmp    ffffffff811a1eff <__bpf_prog_run384+0x4f>
ffffffff811a1ef2:       movabs $0xffffffffc45cf82b,%rax
ffffffff811a1efc:       int3
ffffffff811a1efd:       int3
ffffffff811a1efe:       int3
ffffffff811a1eff:       callq  ffffffff811a0590 <___bpf_prog_run>
ffffffff811a1f04:       mov    0x1f0(%rsp),%rdx
ffffffff811a1f0c:       cmpq   $0xffffffffba9431ed,-0x10(%rdx)
ffffffff811a1f14:       jne    ffffffff811a1f28 <__bpf_prog_run384+0x78>
ffffffff811a1f16:       add    $0x1e0,%rsp
ffffffff811a1f1d:  ==>  pop    %rbx
ffffffff811a1f1e:  ==>  pop    %rbp
ffffffff811a1f1f:       btsq   $0x3f,(%rsp)
ffffffff811a1f25:       retq
ffffffff811a1f26:       ud2
ffffffff811a1f28:       ud1    (%rax),%edx
ffffffff811a1f2b:       nopl   0x0(%rax,%rax,1)

Not as dramatic a change as for rap_sys_mmap_pgoff(), but still less code.

The new version of __bpf_prog_run384() skips preserving RDI and RSI for the call to pax_track_stack() as those are no longer clobbered by that function. This saves us 7 instructions.

vmlinux size

The overall impact can be seen below by comparing the sizes of a defconfig kernel build:

$ size vmlinux-*
   text    data     bss     dec     hex filename
28418610    9145961 2775180 40339751    2678927 vmlinux-4.14-grsec
26120133    8367721 2759628 37247482    23859fa vmlinux-4.14-grsec+patch

2MB less kernel code (-8%) and ~760KB less data (-8.5%) — Great! — Wait! Less data? But this was all about reducing code size, right? Well, let's take a deeper look:

$ size -A vmlinux-*
vmlinux-4.14-grsec  :
section                     size                   addr
.text                   16130514   18446744071578845184
[...]
.orc_unwind_ip           3555016   18446744071603335328
.orc_unwind              5332524   18446744071606890344
.orc_lookup               252044   18446744071612222868
[...]
.init.begin              2015232   18446744071612481536
[...]
Total                   40340163

vmlinux-4.14-grsec+patch  :
section                     size                   addr
.text                   15135186   18446744071578845184
[...]
.orc_unwind_ip           3033716   18446744071603335328
.orc_unwind              4550574   18446744071606369044
.orc_lookup               236492   18446744071610919620
[...]
.init.begin              1236992   18446744071611162624
[...]
Total                   37247894

In fact, the actual text size reduction is only ~972KB. The remainder comes from the ORC unwinder tables which shrunk by ~1.2MB (-14.4%) as there are fewer instructions to take care of (the JMP and INT3 embedded into the RAP return hash).

The data size reduction, however, ends up not affecting the overall binary size. Even though the .init.begin section is smaller, no real data was dropped. It's the enforced alignment in arch/x86/kernel/vmlinux.lds.S that causes this:

    .init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
        BYTE(0)

#ifdef CONFIG_PAX_KERNEXEC
        . = ALIGN(HPAGE_SIZE);
#else
        . = ALIGN(PAGE_SIZE);
#endif

        __init_begin = .; /* paired with __init_end */
    } :init.begin

$ nm -n vmlinux-4.14-grsec | grep -v __rap_hash_ | grep -C1 __init_begin
ffffffff83013080 D vsyscall_gtod_data
ffffffff83200000 T __init_begin
ffffffff83200000 A init_per_cpu__irq_stack_union

$ nm -n vmlinux-4.14-grsec+patch | grep -v __rap_hash_ | grep -C1 __init_begin
ffffffff82ed1080 D vsyscall_gtod_data
ffffffff83000000 T __init_begin
ffffffff83000000 A init_per_cpu__irq_stack_union

The vmlinux-4.14-grsec kernel was just "unlucky" to require lots of padding for the alignment, while the vmlinux-4.14-grsec+patch one did not.

Availability

These enhancements are available in all grsecurity stable patches as of 02/18/2020.