It's well known that syscalls are expensive. And that software mitigations against CPU bugs (such as Meltdown) even have made them more expensive. But how expensive are they really? To begin to answer this question I wrote a small micro-benchmark in order to measure the minimal costs of a syscall. Meaning the cost of syscalls one always has to pay whether a context-switch happens or not, even when the work in the kernel is minuscule, i.e. the costs of switching from user-mode to kernel-mode and back.
Methods¶
The user-kernel mode-switch micro-benchmark uses Google's benchmark library for the measurements and is available in a git repository. The repository also contains some helper scripts, e.g. a playbook for distributing it to and executing it on a bunch of hosts. The benchmark library repeats each case until the result is considered stable and the playbook allows for repeated executions of the test cases. In the following sections the median value of 100 repetitions is reported (real time in nanoseconds).
For the benchmark a bunch of syscalls is called that are expected to be very cheap, such as getting the user id (UID), the program id (PID), closing an invalid file descriptor, calling an non-existent syscall etc. Thus, a measurement should really just include two mode switches. As controls, a few cases don't call a syscall but do other cheap stuff.
I ran the benchmark on a heterogeneous set of hosts, i.e. different kernels, operating systems and configurations. For more details see also the Hosts Section.
Results and Discussion¶
The following table shows the real time (ns) for the different cases:
host | 5i4250u | 7i6600u | ac3758 | x2643 | x2667h | x2667s | x2687w | x2689 | x2690 | xg6144 | xg6148 | xg6246 | xg6256 | xg6256b | xs4110 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||
assign | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
clock_gettime | 23 | 21 | 31 | 19 | 16 | 16 | 17 | 14 | 21 | 14 | 24 | 14 | 13 | 13 | 24 |
clock_gettime_mono | 23 | 22 | 32 | 16 | 16 | 16 | 17 | 14 | 21 | 15 | 24 | 14 | 13 | 13 | 24 |
clock_gettime_mono_raw | 23 | 22 | 33 | 542 | 544 | 350 | 582 | 332 | 762 | 660 | 427 | 274 | 122 | 290 | 218 |
clock_gettime_tai | 23 | 21 | 32 | 542 | 546 | 352 | 587 | 333 | 762 | 660 | 427 | 274 | 122 | 292 | 218 |
close | 568 | 262 | 275 | 484 | 495 | 283 | 514 | 277 | 668 | 610 | 356 | 243 | 93 | 257 | 145 |
getpid | 558 | 257 | 255 | 2 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 2 |
getuid | 560 | 231 | 259 | 464 | 473 | 276 | 505 | 259 | 649 | 592 | 347 | 224 | 78 | 239 | 137 |
nothing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
pthread_cond_signal | 3 | 2 | 6 | 14 | 13 | 12 | 14 | 11 | 15 | 10 | 17 | 10 | 10 | 10 | 17 |
sched_yield | 706 | 374 | 430 | 569 | 560 | 414 | 634 | 346 | 773 | 694 | 454 | 280 | 126 | 300 | 232 |
sqrt | 6 | 2 | 15 | 4 | 4 | 4 | 4 | 1 | 7 | 1 | 3 | 1 | 1 | 1 | 3 |
sqrtrec | 4 | 4 | 15 | 2 | 3 | 2 | 3 | 2 | 3 | 3 | 5 | 3 | 3 | 3 | 5 |
syscall | 560 | 252 | 265 | 440 | 460 | 269 | 497 | 243 | 620 | 579 | 345 | 221 | 76 | 233 | 136 |
Controls¶
The cases used as controls are 'nothing' which literally does
nothing, 'assign' which just assigns to a variable, sqrt
which
computes the square root of a small constant and sqrtrec
which
stacks a bunch of sqrt calls. The results for these are
plausible, i.e. doing nothing is really measured as 10**-7
ns or
so, the assignment costs 0.5 ns or so and computing the square
root takes only a few ns. Perhaps the most remarkable result is,
that computing the square root on a Atom CPU (ac3758) is pretty
constant over the 2 cases, whereas on the other hosts its runtime
depends on its argument.
Clock Gettime¶
Looking at the syscalls, one relation that holds true on all
hosts is that the clock_gettime(CLOCK_REALTIME)
syscall is much
faster than getuid()
or close()
. This can be explained by the
fact that on Linux, clock_gettime(CLOCK_REALTIME)
and a few
other syscalls are implemented via the efficient vDSO mechanism.
Meaning when they are called no mode switch happens!
clock_gettime()
supports different clocks and not all of them
are vDSO optimized on all kernels. The table shows that on RHEL 7 querying
CLOCK_MONOTIC_RAW
and CLOCK_TAI
invokes a real syscall while
on Fedora 33 kernels (5.12/5.13) these clock readings are also
implemented as vDSO.
Dummy Signaling¶
Similarly, the dummy pthread_cond_signal()
case, which signals
without anybody listening, is much cheaper than a real syscall -
since the C library doesn't have to call a real syscall but can
just invoke some relatively cheap atomic operation.
Getpid¶
The getpid()
syscall is surprisingly fast on RHEL 7. It turns
out that RHEL 7 ships an older glibc version which caches the ID
of a process! Which arguably is a curious optimization, since,
what's the point? I mean how often to you have to call
getpid()
in a program, really? At some point (around Fedora 26)
this feature was removed since it apparently caused more
trouble than it's worth. Perhaps unsurprisingly, that
removal even broke somebodies workflow.
Real Syscalls¶
So looking at the real syscalls, the user-kernel mode switches cost in the order of a few hundred nanoseconds, on all hosts. The higher costs on some hosts can be explained by CPU bug mitigations being enabled (they are enabled by default) and/or kind of older/lower-end hardware. See also the Hosts Section for some details.
The fastest host is xg6256
which manages to switch modes in
less than 100 ns. It has a fast CPU with good
single core performance (Xeon Gold 6256), has frequency scaling
disabled and runs at a constant 4.1 GHz frequency above it's base
frequency (i.e. a frequency between the base and turbo
frequency).
Sched Yield¶
The sched_yield()
syscall could be considered as a minimal work
syscall, e.g. when there is nothing to yield to. Also, the
benchmark process is running under the standard scheduling policy
and on Linux sched_yield()
is described as:
sched_yield()
is intended for use with real-time scheduling policies (i.e.,SCHED_FIFO
orSCHED_RR
). Use ofsched_yield()
with nondeterministic scheduling policies such asSCHED_OTHER
is unspecified and very likely means your application design is broken.
So unspecified could mean that the syscall just bails out after
the process' scheduling policy compares equal to SCHED_OTHER
.
On most hosts sched_yield 150 ns or so more expensive than a
really minimal syscall such as getuid()
- which indicates some
more overhead, but not necessarily a context switch.
Nanosleep¶
A bit out of the competition is the nanosleep()
syscall:
host | 5i4250u | 7i6600u | ac3758 | x2643 | x2667h | x2667s | x2687w | x2689 | x2690 | xg6144 | xg6148 | xg6246 | xg6256 | xg6256b | xs4110 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||
nanosleep0 | 52632 | 50474 | 52620 | 50588 | 50003 | 50011 | 50000 | 50014 | 50312 | 50018 | 50000 | 50014 | 50000 | 50000 | 54866 |
nanosleep0_slack1 | 4355 | 2836 | 7076 | 3247 | 2736 | 2483 | 2835 | 3908 | 3401 | 2762 | 2870 | 2446 | 1974 | 2248 | 3837 |
nanosleep1_slack1 | 4348 | 2840 | 7102 | 3252 | 2736 | 2486 | 2834 | 3908 | 3410 | 2767 | 2871 | 2446 | 1975 | 2246 | 3836 |
Calling nanosleep()
to sleep for 0 ns or 1 ns seemingly also is
a very cheap syscall or even a null operation.
However, in the first case it takes 50 µs an all hosts.
Incidentally, 50 µs is also the default timer slack value
for a normally scheduled process on Linux. The timer slack
mechanism extends timer expirations up to the slack value in
order to group multiple timers since this reduces wake ups and
thus saves energy. And since nanosleep()
creates a timer, it's
also affected by this mechanism.
Thus, the other nanosleep cases set a minimal timer slack of 1
ns which reduces the runtime, as expected. However, it's still
much more expensive than the other syscalls. Of course, a timer
expiration has a limited accuracy. However, with 0 ns or 1 ns no
timer has to be expired, really. It turns out that calling
nanosleep()
unconditionally yields a (voluntary) context
switch. Even on isolated cores, where the scheduler happily
switches to the swapper kernel thread. Thus, the last two
nanonsleep cases really measure the context switch costs which
are more expensive than a simple mode switch.
The costs of a context switch match what others are measuring (modulo division by two).
Hosts¶
The following table shows the hosts under benchmark:
host | CPU | mitigations | poll | os | kernel | |
---|---|---|---|---|---|---|
0 | 5i4250u | Core i5-4250U | yes | yes | Fedora 33 | 5.12 |
1 | 7i6600u | Core i7-6600U | yes | yes | Fedora 33 | 5.13 |
2 | ac3758 | Atom C3758 | no | yes | Fedora 33 | 5.13 |
3 | x2643 | Xeon E5-2643 v2 | yes | no | RHEL 7 | 3.10 |
4 | x2667h | Xeon E5-2667 v3 | yes | yes | RHEL 7 | 3.10 |
5 | x2667s | Xeon E5-2667 v3 | yes | yes | RHEL 7 | 3.10 |
6 | x2687w | Xeon E5-2687W v3 | yes | yes | RHEL 7 | 3.10 |
7 | x2689 | Xeon E5-2689 v4 | yes | yes | RHEL 7 | 3.10 |
8 | x2690 | Xeon E5-2690 0 | yes | yes | RHEL 7 | 3.10 |
9 | xg6144 | Xeon Gold 6144 | yes | yes | RHEL 7 | 3.10 |
10 | xg6148 | Xeon Gold 6148 | no | yes | RHEL 7 | 3.10 |
11 | xg6246 | Xeon Gold 6246 | yes | yes | RHEL 7 | 3.10 |
12 | xg6256 | Xeon Gold 6256 | no | yes | RHEL 7 | 3.10 |
13 | xg6256b | Xeon Gold 6256 | yes | yes | RHEL 7 | 3.10 |
14 | xs4110 | Xeon Silver 4110 | no | yes | RHEL 7 | 3.10 |
Notes:
- the kernels are the ones packaged by the distributions
- most RHEL hosts are on RHEL 7.9
- CPU mitigations are disabled via the
mitigations=off
kernel parameter or similar parameters - polling means that CPU frequency scaling and power saving is disabled via kernel parameters and tuned PM QoS settings
- so the host's CPU runs on a fixed frequency; where possible this frequency is set slightly above the base frequency, e.g. on the Xeon Gold 6256 CPU it's set to 4.1 GHz
- the Atom CPU doesn't support Hyperthreading and Hyperthreading is disabled on all Xeon hosts
- all hosts have SELinux and/or Auditing enabled (on Fedora/RHEL these features are enabled by default) which adds some syscall overhead to some degree
Terminology¶
There are basically two important separate terms to distinguish in the above discussion:
- Mode Switch (or Mode Transition)
- Context Switch
The definitions of these terms might vary in different literature and different operating systems. Also, in other contexts (no pun intended!) one might describe different modes as different contexts. However, the definitions given in the linked Wikipedia articles are widely used and apply to Linux.
Basically, a mode transition denotes the switch between user mode and kernel mode (or between user space and kernel space) whereas a context switch denotes a switch between different tasks, which is facilitated by the kernel. A context switch requires more work than a mode switch and thus is more expensive.