The curious case of nanosleep vs. hr_sleep

This weekend, I finally got around reading a bookmarked paper that proposes an improved nanosleep:

Faltelli, Marco & Belocchi, Giacomo & Quaglia, Francesco & Pontarelli, Salvatore & Bianchi, Giuseppe. (2021). Metronome: adaptive and precise intermittent packet retrieval in DPDK.

In the following I document a few notes.

The Papers

At page 4 I'm asking myself whether the authors ever posted their code to the LKML. They did and the paper receives devastating criticism from Thomas Gleixner, the Linux kernel maintainer of the affected subsystem and a well-known expert in realtime computing in general and high resolution timers in particular:

Thomas Gleixner. Re: [PATCH] kernel/time: Feedback reply for hr_sleep syscall, a fine-grained sleep service (2021-04-08) (mirror)

Turns out I was reading the first version of a preprint and meanwhile version 3 was released after that LKML post. That version apparently corrects a few of the serious flaws pointed out by Thomas Gleixner, but:

  • the paper's authors don't reply on the LKML to the posted criticism
  • they don't acknowledge Thomas Gleixner for his review in the acknowledgements section in the new version of their paper, although they certainly profited from his explanations and independent tests
  • from the revised preprint (3rd version) the authors reference their old and identically named conference paper for more details, although it also contains the criticized flaws that were removed from the 3rd preprint revision
  • the ACM conference paper page doesn't mention any corrections
  • the paper's accompanying github repository doesn't reference any of the criticism

Also, in the 3rd preprint revision, as of 2024-07-09 the latest version, Section 3 A now has some new issues:

  • 'This factor can be controlled using the prctl() system call, putting it to the minimal value of 1.'
    Thus, the reader is left wondering what unit the minimal value might have ... (yes, it's nanoseconds).
  • 'These data have been collected by running the thread issuing the sleep request as a classical SCHED_OTHER (normal) priority thread and—as hinted before—with the timer slack of nanosleep() set to 1µs.'
    Either they mixed up the units and they actually used the minimal value of 1 ns or they used an unnecessarily high timer slack value to make their proposed improvement look a bit better.

The Good

Scrutinizing the API of syscalls that are parametrized with timespec struct is certainly a valuable contribution.

Passing a user pointer into a syscall that needs to be dereferenced in the kernel (over the user space and kernel address space boundary) has some overhead.

Also, such an indirection makes tracing it and debugging somewhat harder.

Looking at the Linux sycall calling conventions, it's clear that there are more than enough call argument registers available to pass all of struct timespec's fields directly via registers. Also, in the case of nanosleep, which optionally returns the remaining waiting time via a second argument pointer, at least some architectures such as x86-64 would allow to return these values via registers, as well. (on x86-64, syscall return values are put into register rax and optionally register rdx which are both 64 bit wide).

Hence, it isn't even necessary to limit the sleep time specification to a single nanoseconds argument like hr_sleep() does. FWIW, on 64 architectures, the Linux kernel has no problem to convert up to 2^63-1 nanoseconds into its internal ktime_t and for all practical purposes a maximum sleep time of up to 292 years or so seems to be sufficient. Also FWIW, when using POSIX API such as nanosleep() the valid range of timespec::tv_nsec is 'just' [0, 999999999].

However, measuring the likely tiny overhead of copying a timespec struct from user to kernel space requires careful and rigorous work.

Of course, it's highly questionable whether eliminating the expected timespec user to kernel copy overhead really would justify adding another syscall.

Addendum

Apparently, the hr_sleep() authors initially contacted the kernel community only after their conference paper was published. (i.e. first LKML post on 2021-01-15, conference paper was published 2020-11-24 and presented at the conference in the first week of December, 2020) At that time it received some feedback from another kernel developer, Andy Lutomirski. Like with the second posting, the hr_sleep authors didn't bother to reply directly to any of the issues mentioned in the review.

It seems that the paper was submitted to a proper conference and hence it was peer reviewed, but perhaps the academic reviewers weren't sufficiently familiar with the relevant parts of the Linux kernel and thus missed the issues that were discussed on the LKML.


On 2024-07-07, I opened an issue (archive) on the hr_sleep's github repository, asking for clarification.

It was deleted quickly without any comment:

This issue has been deleted.

For the sake of completeness I'm reproducing my github post below:

Clarifications nanosleep vs. hr_sleep measurements #5

Reading through Section 3 A of your preprint (3rd revision) I noticed a few issues:

  1. 'This factor can be controlled using the prctl() system call, putting it to the minimal value of 1.' Perhaps you want to add the unit of that minimal value.
  2. 'These data have been collected by running the thread issuing the sleep request as a classical SCHED_OTHER (normal) priority thread and—as hinted before—with the timer slack of nanosleep() set to 1µs.' Please clarify, did you really set the timer slack to 1 µs? Or did you set it to its minimal value, i.e. 1 ns?
  3. 'We remand the reader to [14] for an extended evaluation of this implementation.In [..]' Firstly, there is a space missing after the full stop. Secondly, why do you continue to refer to an evaluation whose flaws were pointed out to you by the maintainer of the relevant Linux subsystems while apparently you tried to address some of them in that 3rd version of your preprint?
  4. 'Figure 1' You don't mention any details regarding the system you measured those latencies on. Thus, the results are hard to reproduce. The text states 'The tests have been conducted on an isolated NUMA node equipped with Intel Xeon Silver 2.1 GHz cores. The server is running Linux kernel 5.4.' But why don't you mention the exact CPU model and exact kernel version? Also relevant but completely missing: whether relevant kernel parameters where supplied, such as turning some/all kernel CPU bug mitigations off, whether Hyperthreading was enabled, whether frequency scaling, energy saving modes, turbo boost, etc. etc. where enabled ...

BTW, google searches still return the ResearchGate page for the first preprint version at relatively high rank for me - perhaps you want to remove it from ResearchGate to avoid further confusion.

Also, the ACM page of the conference paper version doesn't contain any hints regarding regarding published corrections.

Similarly, in the README of this repository you still link to your conference paper without mentioning that the analysis of hr_sleep vs. nanosleep therein is deeply flawed.

Finally, since you profited from the LKML review of your paper and apparently that review prompted a significant rework of Section 3 A in the 3rd revision of your preprint, why didn't you acknowledge that Linux kernel maintainer in the Acknowledgement Section in your 3rd revision?