Mon 30 August 2021

On the Costs of Syscalls

It's well known that syscalls are expensive. And that software mitigations against CPU bugs (such as Meltdown) even have made them more expensive. But how expensive are they really? To begin to answer this question I wrote a small micro-benchmark in order to measure the minimal costs of a syscall. Meaning the cost of syscalls one always has to pay whether a context-switch happens or not, even when the work in the kernel is minuscule, i.e. the costs of switching from user-mode to kernel-mode and back.

Methods¶

The user-kernel mode-switch micro-benchmark uses Google's benchmark library for the measurements and is available in a git repository. The repository also contains some helper scripts, e.g. a playbook for distributing it to and executing it on a bunch of hosts. The benchmark library repeats each case until the result is considered stable and the playbook allows for repeated executions of the test cases. In the following sections the median value of 100 repetitions is reported (real time in nanoseconds).

For the benchmark a bunch of syscalls is called that are expected to be very cheap, such as getting the user id (UID), the program id (PID), closing an invalid file descriptor, calling an non-existent syscall etc. Thus, a measurement should really just include two mode switches. As controls, a few cases don't call a syscall but do other cheap stuff.

I ran the benchmark on a heterogeneous set of hosts, i.e. different kernels, operating systems and configurations. For more details see also the Hosts Section.

Results and Discussion¶

The following table shows the real time (ns) for the different cases:

host	5i4250u	7i6600u	ac3758	x2643	x2667h	x2667s	x2687w	x2689	x2690	xg6144	xg6148	xg6246	xg6256	xg6256b	xs4110
name
assign	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
clock_gettime	23	21	31	19	16	16	17	14	21	14	24	14	13	13	24
clock_gettime_mono	23	22	32	16	16	16	17	14	21	15	24	14	13	13	24
clock_gettime_mono_raw	23	22	33	542	544	350	582	332	762	660	427	274	122	290	218
clock_gettime_tai	23	21	32	542	546	352	587	333	762	660	427	274	122	292	218
close	568	262	275	484	495	283	514	277	668	610	356	243	93	257	145
getpid	558	257	255	2	1	1	2	1	2	1	2	1	1	1	2
getuid	560	231	259	464	473	276	505	259	649	592	347	224	78	239	137
nothing	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
pthread_cond_signal	3	2	6	14	13	12	14	11	15	10	17	10	10	10	17
sched_yield	706	374	430	569	560	414	634	346	773	694	454	280	126	300	232
sqrt	6	2	15	4	4	4	4	1	7	1	3	1	1	1	3
sqrtrec	4	4	15	2	3	2	3	2	3	3	5	3	3	3	5
syscall	560	252	265	440	460	269	497	243	620	579	345	221	76	233	136

Controls¶

The cases used as controls are 'nothing' which literally does nothing, 'assign' which just assigns to a variable, sqrt which computes the square root of a small constant and sqrtrec which stacks a bunch of sqrt calls. The results for these are plausible, i.e. doing nothing is really measured as 10**-7 ns or so, the assignment costs 0.5 ns or so and computing the square root takes only a few ns. Perhaps the most remarkable result is, that computing the square root on a Atom CPU (ac3758) is pretty constant over the 2 cases, whereas on the other hosts its runtime depends on its argument.

Clock Gettime¶

Looking at the syscalls, one relation that holds true on all hosts is that the clock_gettime(CLOCK_REALTIME) syscall is much faster than getuid() or close(). This can be explained by the fact that on Linux, clock_gettime(CLOCK_REALTIME) and a few other syscalls are implemented via the efficient vDSO mechanism. Meaning when they are called no mode switch happens!

clock_gettime() supports different clocks and not all of them are vDSO optimized on all kernels. The table shows that on RHEL 7 querying CLOCK_MONOTIC_RAW and CLOCK_TAI invokes a real syscall while on Fedora 33 kernels (5.12/5.13) these clock readings are also implemented as vDSO.

Dummy Signaling¶

Similarly, the dummy pthread_cond_signal() case, which signals without anybody listening, is much cheaper than a real syscall - since the C library doesn't have to call a real syscall but can just invoke some relatively cheap atomic operation.

Getpid¶

The getpid() syscall is surprisingly fast on RHEL 7. It turns out that RHEL 7 ships an older glibc version which caches the ID of a process! Which arguably is a curious optimization, since, what's the point? I mean how often to you have to call getpid() in a program, really? At some point (around Fedora 26) this feature was removed since it apparently caused more trouble than it's worth. Perhaps unsurprisingly, that removal even broke somebodies workflow.

Real Syscalls¶

So looking at the real syscalls, the user-kernel mode switches cost in the order of a few hundred nanoseconds, on all hosts. The higher costs on some hosts can be explained by CPU bug mitigations being enabled (they are enabled by default) and/or kind of older/lower-end hardware. See also the Hosts Section for some details.

The fastest host is xg6256 which manages to switch modes in less than 100 ns. It has a fast CPU with good single core performance (Xeon Gold 6256), has frequency scaling disabled and runs at a constant 4.1 GHz frequency above it's base frequency (i.e. a frequency between the base and turbo frequency).

Sched Yield¶

The sched_yield() syscall could be considered as a minimal work syscall, e.g. when there is nothing to yield to. Also, the benchmark process is running under the standard scheduling policy and on Linux sched_yield() is described as:

sched_yield() is intended for use with real-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application design is broken.

So unspecified could mean that the syscall just bails out after the process' scheduling policy compares equal to SCHED_OTHER.

On most hosts sched_yield 150 ns or so more expensive than a really minimal syscall such as getuid() - which indicates some more overhead, but not necessarily a context switch.

Nanosleep¶

A bit out of the competition is the nanosleep() syscall:

host	5i4250u	7i6600u	ac3758	x2643	x2667h	x2667s	x2687w	x2689	x2690	xg6144	xg6148	xg6246	xg6256	xg6256b	xs4110
name
nanosleep0	52632	50474	52620	50588	50003	50011	50000	50014	50312	50018	50000	50014	50000	50000	54866
nanosleep0_slack1	4355	2836	7076	3247	2736	2483	2835	3908	3401	2762	2870	2446	1974	2248	3837
nanosleep1_slack1	4348	2840	7102	3252	2736	2486	2834	3908	3410	2767	2871	2446	1975	2246	3836

Calling nanosleep() to sleep for 0 ns or 1 ns seemingly also is a very cheap syscall or even a null operation.

However, in the first case it takes 50 µs an all hosts. Incidentally, 50 µs is also the default timer slack value for a normally scheduled process on Linux. The timer slack mechanism extends timer expirations up to the slack value in order to group multiple timers since this reduces wake ups and thus saves energy. And since nanosleep() creates a timer, it's also affected by this mechanism.

Thus, the other nanosleep cases set a minimal timer slack of 1 ns which reduces the runtime, as expected. However, it's still much more expensive than the other syscalls. Of course, a timer expiration has a limited accuracy. However, with 0 ns or 1 ns no timer has to be expired, really. It turns out that calling nanosleep() unconditionally yields a (voluntary) context switch. Even on isolated cores, where the scheduler happily switches to the swapper kernel thread. Thus, the last two nanonsleep cases really measure the context switch costs which are more expensive than a simple mode switch.

The costs of a context switch match what others are measuring (modulo division by two).

Hosts¶

The following table shows the hosts under benchmark:

	host	CPU	mitigations	poll	os	kernel
0	5i4250u	Core i5-4250U	yes	yes	Fedora 33	5.12
1	7i6600u	Core i7-6600U	yes	yes	Fedora 33	5.13
2	ac3758	Atom C3758	no	yes	Fedora 33	5.13
3	x2643	Xeon E5-2643 v2	yes	no	RHEL 7	3.10
4	x2667h	Xeon E5-2667 v3	yes	yes	RHEL 7	3.10
5	x2667s	Xeon E5-2667 v3	yes	yes	RHEL 7	3.10
6	x2687w	Xeon E5-2687W v3	yes	yes	RHEL 7	3.10
7	x2689	Xeon E5-2689 v4	yes	yes	RHEL 7	3.10
8	x2690	Xeon E5-2690 0	yes	yes	RHEL 7	3.10
9	xg6144	Xeon Gold 6144	yes	yes	RHEL 7	3.10
10	xg6148	Xeon Gold 6148	no	yes	RHEL 7	3.10
11	xg6246	Xeon Gold 6246	yes	yes	RHEL 7	3.10
12	xg6256	Xeon Gold 6256	no	yes	RHEL 7	3.10
13	xg6256b	Xeon Gold 6256	yes	yes	RHEL 7	3.10
14	xs4110	Xeon Silver 4110	no	yes	RHEL 7	3.10

Notes:

the kernels are the ones packaged by the distributions
most RHEL hosts are on RHEL 7.9
CPU mitigations are disabled via the mitigations=off kernel parameter or similar parameters
polling means that CPU frequency scaling and power saving is disabled via kernel parameters and tuned PM QoS settings
so the host's CPU runs on a fixed frequency; where possible this frequency is set slightly above the base frequency, e.g. on the Xeon Gold 6256 CPU it's set to 4.1 GHz
the Atom CPU doesn't support Hyperthreading and Hyperthreading is disabled on all Xeon hosts
all hosts have SELinux and/or Auditing enabled (on Fedora/RHEL these features are enabled by default) which adds some syscall overhead to some degree

Terminology¶

There are basically two important separate terms to distinguish in the above discussion:

Mode Switch (or Mode Transition)
Context Switch

The definitions of these terms might vary in different literature and different operating systems. Also, in other contexts (no pun intended!) one might describe different modes as different contexts. However, the definitions given in the linked Wikipedia articles are widely used and apply to Linux.

Basically, a mode transition denotes the switch between user mode and kernel mode (or between user space and kernel space) whereas a context switch denotes a switch between different tasks, which is facilitated by the kernel. A context switch requires more work than a mode switch and thus is more expensive.