Better Datamash Build Story with Meson

This article is a small case study on introducing the Meson build tool into a legacy Autotools and Gnulib centric code base, i.e. the GNU Datamash project. With Meson, the result builds twice as fast and configures one order of magnitude faster while only requiring a few hundred lines of Meson build files. As as side-effect of this experiment the size of the executables is halved.

On Datamash

Datamash is a command line program for running various statistics on tabular data, such as computing the count, average or median of a column.

Its sweet-spot seems to be ad-hoc analyses on the command line, on systems where installing a more heavy-weight alternative such as R or DuckDB would be too inconvenient.

In the following comparisons, I compare Datamash 1.9, released in the first half of 2025, against my datamash 'main' branch which has Meson support and a few other fixes and improvements added, and is based on the end of 2025 upstream 'master' branch, i.e. release 1.9 plus just a few minor upstream commits.

On Meson

Meson is modern build tool that is an alternative to autotools and CMake. See also my 2021 summary of meson.

Motivation

The Meson design improves on pain points many users often have with Autotools, such as:

  • configure wasting time checking for tons of things that are true everywhere since decades (and doing so single-threaded)
  • build wasting time with inefficient auto-generated recursive makefiles
  • layers of m4 and shell scripting are involved which results in a big ball of mud that easily breaks and is hard to fix or work on, in general
  • all the Autotools generated project specific code being hard to review and being an invitation for supply chain attackers such as Jia Tan

The idea with Meson is that it eliminates all these issues, in particular that it speeds up the build, makes the project specific build files easier to review, reduces maintenance efforts and simplifies working with the build system, in general.

Configure

Using the Datamash 1.9 release, running ./configure takes 13 s or so on a modern laptop (12th Gen Intel CPU, NVMe storage, Fedora 43). That configure triggers 488 checks which seems excessive.

Some things are checked redundantly which autoconf tries to mitigate via caching the results, such as checks for round():

checking whether round is declared... yes
checking whether round works... yes
checking whether roundl is declared... yes
checking whether round is declared... (cached) yes
checking whether round works... (cached) yes
checking whether roundl is declared... (cached) yes

For some reason, other checks are invoked up to two times:

checking for stdint.h... yes
checking whether stdint.h conforms to C99... yes
checking whether stdint.h works without ISO C predefines... yes
checking whether stdint.h has UINTMAX_WIDTH etc.... yes
checking for stdint.h... yes
checking for stdint.h... (cached) yes
checking for stdint.h... (cached) yes

In total there are 30 cached checks.

Some more examples of surprising checks:

checking whether strdup is declared... yes
checking whether strnlen is declared... yes
checking for strnlen... yes
checking for working strnlen... yes
checking for strtold... yes
checking for strtoumax... yes
checking whether strtoumax is declared... yes
checking whether strtod obeys C99... yes
checking whether strtold obeys POSIX... yes
checking for strtoll... yes
checking whether strtoll works... yes
checking for strtoull... yes
checking whether strtoull works... yes

These are covered since either C89, C99, POSIX.1.-2001 or POSIX.1-2008. I mean, how realistic is it that there are actually Datamash users who want to install the latests greatest datamash, but run such a broken and vulnerable operating system that lacks support for theses decades old standards.

Sure, retrocomputing can be fun, but isn't the point of retrocomputing to run old software?


In comparison, the Meson checks concentrate on the essentials such that the setup finishes in 0.5 s or so! Meaning an order of magnitude faster. Also, the output volume is lower and thus more useful:

$ meson setup --buildtype=debugoptimized ..
The Meson build system
Version: 1.8.5
Source dir: /home/juser/del/datamash
Build dir: /home/juser/del/datamash/build
Build type: native build
Project name: datamash
Project version: undefined
C compiler for the host machine: cc (gcc 15.2.1 "cc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7)")
C linker for the host machine: cc ld.bfd 2.45.1-1
Host machine cpu family: x86_64
Host machine cpu: x86_64
Library m found: YES
Found pkg-config: YES (/usr/bin/pkg-config) 2.3.0
Run-time dependency nettle found: YES 3.10.1
Compiler for C supports function attribute alloc_size: YES
Compiler for C supports function attribute cold: YES
Compiler for C supports function attribute const: YES
Fetching value of define "__has_c_attribute(fallthrough)" : 202311
Compiler for C supports function attribute format: YES
Compiler for C supports function attribute malloc: YES
Compiler for C supports function attribute warn_unused_result: YES
Compiler for C supports function attribute pure: YES
Compiler for C supports function attribute returns_nonnull: YES
Compiler for C supports function attribute sentinel: YES
Checking for function "strtoumax" : YES
Configuring config.h using configuration
Program ./tests/datamash-tests.pl found: YES (/home/juser/del/datamash/./tests/datamash-tests.pl)
[..]
Program ./tests/decorate-sort-tests.pl found: YES (/usr/bin/env perl /home/juser/del/datamash/./tests/decorate-sort-tests.pl)
Program perl found: YES (/usr/bin/perl)
Program sh found: YES (/usr/bin/sh)
Program msgfmt found: YES (/usr/bin/msgfmt)
Program msginit found: YES (/usr/bin/msginit)
Program msgmerge found: YES (/usr/bin/msgmerge)
Program xgettext found: YES (/usr/bin/xgettext)
Program help2man found: YES (/usr/bin/help2man)
Configuring version.texi using configuration
Program makeinfo found: YES (/usr/bin/makeinfo)
Build targets in project: 22

The check for strtoumax is included as an example how easy it's to implement such a check in Meson, if you really need it.

Build

Allowing the make to run on 2 cores in parallel, the datamash autotools build finishes in 4 s or so.

Looking at the build messages, there are a few oddities:

gcc -DLOCALEDIR=\"/home/juser/local/datamash-1.9/share/locale\" -DHAVE_CONFIG_H -I.
    -Ilib -I./lib -Isrc -I./src  -Wall -Wextra -Wformat-security -Wswitch-enum
    -Wswitch-default -Wunused-parameter -Wfloat-equal -fdiagnostics-show-option
    -funit-at-a-time -Wmissing-format-attribute -Wstrict-overflow -Wsuggest-attribute=const
    -Wsuggest-attribute=pure   
    -g -O2 -MT src/datamash-column-headers.o -MD -MP -MF src/.deps/datamash-column-headers.Tpo
    -c -o src/datamash-column-headers.o `test -f 'src/column-headers.c' || echo './'`src/column-headers.c

There is an extra shell invocation for each translation unit, which seems to be pointless, as you either end up with

gcc ... -c -o src/foo.o src/foo.c

or

gcc ... -c -o src/foo.o ./src/foo.c

There is also some redundancy in linking to -lm:

gcc -Wall -Wextra -Wformat-security -Wswitch-enum -Wswitch-default -Wunused-parameter
    -Wfloat-equal -fdiagnostics-show-option -funit-at-a-time -Wmissing-format-attribute
    -Wstrict-overflow -Wsuggest-attribute=const -Wsuggest-attribute=pure
    -g -O2   -o datamash
    src/datamash-text-options.o src/datamash-utils.o src/datamash-randutils.o src/datamash-text-lines.o
    src/datamash-column-headers.o src/datamash-op-defs.o src/datamash-op-scanner.o
    src/datamash-op-parser.o src/datamash-field-ops.o src/datamash-crosstab.o
    src/datamash-double-format.o src/datamash-datamash.o lib/libdatamash.a
    -lm -lm -lm -lm -lm -lm -lm -lm              -lm -lm   -lm -lm -lm -lm -lm -lm

Meaning for unknown reasons Autotools links 16 times to libm. Arguably, this is a good example of how even Autotools fans struggle with its complexity and are overwhelmed by it.


In comparison, Meson builds everything in 1.5 s or so (i.e. more than twice as fast) and its default output is more reasonable. That means by default the user isn't spammed with overly long and cryptic low-level command invocations, but useful high-level progress:

$ ninja -j2 
[1/50] Compiling C object datamash.p/src_column-headers.c.o
[2/50] Compiling C object datamash.p/src_crosstab.c.o
[..]
[15/50] Compiling C object datamash.p/_usr_share_gnulib_lib_hashcode-mem.c.o
[16/50] Compiling C object datamash.p/_usr_share_gnulib_lib_hashcode-string2.c.o
[..]
[31/50] Linking target datamash
[32/50] Building translation po/da/LC_MESSAGES/datamash-da.mo
[33/50] Building translation po/de/LC_MESSAGES/datamash-de.mo
[34/50] Compiling C object decorate.p/_usr_share_gnulib_lib_xmalloc.c.o
[35/50] Building translation po/eo/LC_MESSAGES/datamash-eo.mo
[..]
[40/50] Linking target decorate
[48/50] Generating datamash manual with a custom command (wrapped by meson to set env)
[49/50] Generating decorate manual with a custom command (wrapped by meson to set env)
[50/50] Generating gen-info with a custom command

Of course, when a command errors out the full command line and error message are presented to the user and alternatively, one can run the build in verbose mode by adding -v to the ninja command.

The verbose mode doesn't slow down the build noticeable. Enabling it shows that meson doesn't have the same issues with linking libm:

cc  -o datamash
    datamash.p/src_column-headers.c.o datamash.p/src_crosstab.c.o datamash.p/src_datamash.c.o
    datamash.p/src_double-format.c.o datamash.p/src_field-ops.c.o datamash.p/src_op-defs.c.o
    datamash.p/src_op-parser.c.o datamash.p/src_op-scanner.c.o datamash.p/src_randutils.c.o
    datamash.p/src_text-lines.c.o datamash.p/src_text-options.c.o datamash.p/src_utils.c.o
    datamash.p/modern_system.c.o datamash.p/_usr_share_gnulib_lib_exitfail.c.o
    datamash.p/_usr_share_gnulib_lib_hash.c.o datamash.p/_usr_share_gnulib_lib_hashcode-mem.c.o
    datamash.p/_usr_share_gnulib_lib_hashcode-string2.c.o datamash.p/_usr_share_gnulib_lib_linebuffer.c.o
    datamash.p/_usr_share_gnulib_lib_next-prime.c.o datamash.p/_usr_share_gnulib_lib_version-etc.c.o
    datamash.p/_usr_share_gnulib_lib_xalloc-die.c.o datamash.p/_usr_share_gnulib_lib_xmalloc.c.o
    -Wl,--as-needed -Wl,--no-undefined -Wl,--start-group -lm /usr/lib64/libnettle.so -Wl,--end-group

In contrast to the Autotools status quo, I decided against creating a static library of the Gnulib parts, since the effort doesn't pay off, because it's only used by two executables. Instead I link only those Gnulib translation units directly which are essential for each binary.

Code Size

Comparing the sizes of executables in the Datamash 1.9 vs. Meson branch shows substantial savings with Meson (all sizes in bytes, via size -G):

branch filename text data bss total
1.9 datamash 97370 42120 968 140458
meson datamash 46736 35756 640 83132
1.9 decorate 27348 22068 1576 50992
meson decorate 15520 16224 1248 32992

That means that the code size of the datamash executable is reduced by over 50 percent.

These space savings are due to only linking those parts of Gnulib that are essential, i.e. Datamash 1.9 links the following 120 Gnulib translation units:

af_alg.c                        hash-pjw-bare.c                 stat-time.c
arpa_inet.c                     hash-pjw.c                      stdlib.c
asnprintf.c                     ialloc.c                        striconv.c
base64.c                        imaxtostr.c                     stripslash.c
basename.c                      inttostr.c                      strnlen1.c
basename-lgpl.c                 linebuffer.c                    sys_socket.c
bitrotate.c                     localcharset.c                  trim.c
c32isalnum.c                    localename.c                    u64.c
c32isalpha.c                    localename-environ.c            uinttostr.c
c32isblank.c                    localename-table.c              umaxtostr.c
c32iscntrl.c                    localename-unsafe.c             unicase/tolower.c
c32isdigit.c                    malloca.c                       unictype/ctype_alnum.c
c32isgraph.c                    math.c                          unictype/ctype_alpha.c
c32islower.c                    mbchar.c                        unictype/ctype_blank.c
c32isprint.c                    mbrtoc32.c                      unictype/ctype_cntrl.c
c32ispunct.c                    mbrtowc.c                       unictype/ctype_digit.c
c32isspace.c                    mbslen.c                        unictype/ctype_graph.c
c32isupper.c                    mbsstr.c                        unictype/ctype_lower.c
c32isxdigit.c                   mbszero.c                       unictype/ctype_print.c
c32tolower.c                    mbuiter.c                       unictype/ctype_punct.c
c32width.c                      mbuiterf.c                      unictype/ctype_space.c
c-ctype.c                       md5.c                           unictype/ctype_upper.c
cloexec.c                       md5-stream.c                    unictype/ctype_xdigit.c
closeout.c                      offtostr.c                      unistd.c
close-stream.c                  printf-args.c                   unistr/u8-mbtoucr.c
c-strcasecmp.c                  printf-parse.c                  unistr/u8-uctomb-aux.c
dirname.c                       progname.c                      unistr/u8-uctomb.c
dirname-lgpl.c                  propername.c                    uniwidth/width.c
exitfail.c                      quotearg.c                      vasnprintf.c
fcntl.c                         reallocarray.c                  version-etc.c
fd-hook.c                       realloc.c                       vsnzprintf.c
fpurge.c                        setlocale_null.c                wctype-h.c
freading.c                      setlocale_null-unlocked.c       xalloc-die.c
getlocalename_l-unsafe.c        sha1.c                          xmalloc.c
getprogname.c                   sha1-stream.c                   xsize.c
glthread/lock.c                 sha256.c                        xstriconv.c
glthread/once.c                 sha256-stream.c                 xstrtol.c
glthread/threadlib.c            sha512.c                        xstrtol-error.c
hard-locale.c                   sha512-stream.c                 xstrtoul.c
hash.c                          sh-quote.c                      xstrtoumax.c

While the Meson branch only links the following 9 Gnulib translation units:

exitfail.c
hash.c
hashcode-mem.c
hashcode-string2.c
linebuffer.c
next-prime.c
version-etc.c
xalloc-die.c
xmalloc.c

The following measures make this reduction possible:

  1. Identifying Gnulib functionality that is widely available in the standard library or in quasi-standard libraries and use that instead of Gnulib.
  2. Removal of accidental code bloat.

For example, after fixing the percentile computation, I was able to replace the separate median implementation with an alias that saves 200 bytes or so in code size.

Since libnettle is widely available and often pre-installed as core system dependency, it suggests itself for the cryptographic hash and base64 computations in datamash instead of vendoring in the Gnulib versions. Switching to libnettle is a moderate source code change but saves several kilobytes in the executable.

Another source of Gnulib over-usage is quoting, i.e. after replacing a popen invocation with an idiomatic direct fork and exec sequence, we can get rid of Gnulib's sh-quote.c translation unit and its dependency hell. Similarly, to me a minimal quoting implementation in the datamash decorate command is preferable as it's more size efficient and easier to review.

One more code bloat source is how Datamash 1.9 quotes trivial arguments in error messages, i.e. via a Gnulib function that quotes in a localized fashion. I don't think that this is worth it and thus switched the branch to a more minimal version.

In another example, I changed the code into invoking the standardized strtoumax instead of using the closely named but non-conforming Gnulib version.

Besides code sizes, Gnulib versions are always suspect to being less battle-tested and optimized, and when bugs are fixed in them, it's easy to miss updates, i.e. in contrast to a shared system library, as Gnulib is designed to be vendored.

Of course, the code changes aren't really Meson specific, but I argue that the way I integrated the Gnulib dependencies into to Meson build file simplified those changes vs. having to hack the autotools Gnulib integration to link only the essential translation units.

On Gnulib

Gnulib describes itself as source-code library and a user is supposed to vendor the gnulib parts one needs into a project (i.e. copy and bundle gnulib source code files with his/her project).

Although the Gnulib manual has a few paragraphs on its philosophy and design, it doesn't give any rationale for that odd design decision. It just states:

Classical libraries are installed as binary object code. Gnulib is different: It is used as a source code library. Each package that uses Gnulib thus ships with part of the Gnulib source code. The used portion of Gnulib is tailored to the package: A build tool, called gnulib-tool, is provided that copies a tailored subset of Gnulib into the package.

In my opinion Gnulib is different here for now good reason. There is no reason why Gnulib couldn't provide the same functionality, i.e. the portability wrapper and glue code functions in a secure, reliable and structured way, as a normal shared library, like - say - libbsd.

Even worse, this introduces several disadvantages:

  1. Similar to all the generated build file cruft Autotools projects come with, Gnulib code from some arbitrary version dumped into a lib/ sub-directory is just another great opportunity for a supply chain attacker to hide malicious code.
  2. As with static linking, and in contrast to shared linking, updates to library require rebuilding all users.
  3. Worse than with static linking, since the code is bundled, the dependency isn't obvious, and thus easy to miss. For example, it cannot be searched for with a distribution's package manager.
  4. Bundling Gnulib multiplies reviewing efforts, since with each copy it has to be checked whether it can be traced to the upstream Gnulib repository or contains malicious modifications.

Curiously, when it comes to supply chain security Gnulib tries to talk down other language package ecosystems, while completely ignoring it's own fundamental issues:

Many programming languages nowadays have an ecosystem of reusable source code packages, available through a central site, together with a tool that downloads dependencies from this central site. [.. Python example ..] Most of them are vulnerable to supply chain attacks. [..] While some mitigations exist, they are often cumbersome to put in place. [..] Gnulib is not vulnerable to such attacks, because all of its code is managed in a single repository, with a limited set of committers and with established code review practices.

This misses the point very hard.

If you take their Python example, the stuff Gnulib provides is part of the Python standard library, since it exactly fits Python's batteries included philosophy. Thus, there is really no need to install some random package via pip for basic functionality, as the Gnulib team is trying to insinuate here.

Having a limited set of comitters and 'established code review practices' transports almost zero information. Many projects could claim that. Even the xz project had a limited set of comitters.

The main supply chain risk the Gnulib team should worry about is the vendoring of their 'library'. Instead of encouraging developers to bundle their questionable wrappers, where better standardized wrappers or common libraries exist for decades, Gnulib could look into releasing a proper shared library.

Source Overhead

Another important metric is the size of the build system in a project. The more files and more lines of code the higher the efforts for review, maintenance and debugging when something goes wrong.

With Datamash 1.9 the Autotools build system is quite large. There are at least:

$ wc -l aclocal.m4 cfg.mk config.in  configure  configure.ac GNUmakefile init.cfg maint.mk Makefile.am Makefile.in    
   1926 aclocal.m4
    193 cfg.mk
   3047 config.in
  56440 configure
    229 configure.ac
    130 GNUmakefile
     79 init.cfg
   1950 maint.mk
    347 Makefile.am
  12206 Makefile.in
  76547 total
$ du -ch aclocal.m4 cfg.mk config.in  configure  configure.ac GNUmakefile init.cfg maint.mk Makefile.am Makefile.in  | tail -1
2.5M    total

The configure is auto-generated from other files and part of the release source archive to allow users compilation without having to install autoconf.

After running ./configure another large Makefile is generated:

$ wc -l Makefile
12206 Makefile
$ du -h Makefile
800K    Makefile

For comparison, the actual Datamash Source Code is less than 400 kilobytes large:

$ du -h src
316K    src

In addition, the vendored Gnulib sources come with over 30 thousand lines (or over 1 megabyte) of m4 macros and over 4 MiB of bundled C source code:

$ wc -l m4/* | tail -1
  33272 total
$ du -h m4
1.7M    m4
$ du -h lib
4.3M    lib

When building from the datamash git repository, one also has to consider at least just another largish shell script:

$ wc -l bootstrap*
 1087 bootstrap
  177 bootstrap.conf
 1264 total

This is for generating Autotools files which aren't part of the repository, such as the large configure script. It also updates the Gnulib bundle and downloads translation files. Again this is another good target for a supply chain attacker to hide malicious code. Often such boilerplate code is just copy and pasted from some upstream and doesn't need to be customized, but having to review that, having to check whether it was modified, if it was whether the modification is legitimate, is quite wasteful, i.e. busy work that isn't necessary when using a better build tool.

Having all those bits and pieces bundled also increases the risk only updating some of it and thus ending up with a collection of moving parts and state that is very likely to be untested.


In comparison, with the Meson branch there is just:

$ wc -l meson.build meson.options po/meson.build 
  404 meson.build
    6 meson.options
    5 po/meson.build
  415 total
$ du -ch meson.build meson.options po/meson.build | tail -1
28K     total

The 404 lines meson.build is structured like this:

#lines category
56 comments
55 empty lines
91 configuration data macro settings
53 test suite definitions
149 main build target definitions

With Meson the megabytes of Autotools bloat listed above can be dropped. Since the Meson build file uses a system-wide installed Gnulib, a few more megabytes can be dropped, removing the bundled copy. Removing those files would reduce the size of the release .tar.gz archive by almost 50 %, i.e. from 2.6 MiB to 1.4 MiB or so.

Test Suite

Datamash comes with a test suite, i.e. a bunch of (mostly Perl) scripts that invoke the datamash executables with various test inputs.

With Autotools a test suite invocation runs like this:

$ make check
if test -d ./.git                               \
        && git --version >/dev/null 2>&1; then                  \
  cd . &&                                               \
  git submodule --quiet foreach                                 \
      'test "$(git rev-parse "$sha1")"                  \
          = "$(git merge-base origin "$sha1")"'         \
    || { echo 'maint.mk: found non-public submodule commit' >&2;        \
         exit 1; };                                             \
else                                                            \
  : ;                                                           \
fi
make  check-recursive
make[1]: Entering directory '/home/juser/del/datamash-1.9'
Making check in po
make[2]: Entering directory '/home/juser/del/datamash-1.9/po'
make[2]: Nothing to be done for 'check'.
make[2]: Leaving directory '/home/juser/del/datamash-1.9/po'
make[2]: Entering directory '/home/juser/del/datamash-1.9'
make  check-TESTS
make[3]: Entering directory '/home/juser/del/datamash-1.9'
make[4]: Entering directory '/home/juser/del/datamash-1.9'
PASS: tests/datamash-show-env.sh
PASS: tests/datamash-tests.pl
PASS: tests/datamash-tests-deprecated.pl
PASS: tests/datamash-tests-2.pl
[..]
PASS: tests/decorate-sort-tests.pl
============================================================================
Testsuite summary for GNU datamash 1.9
============================================================================
# TOTAL: 30
# PASS:  28
# SKIP:  2
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
make[4]: Leaving directory '/home/juser/del/datamash-1.9'
make[3]: Leaving directory '/home/juser/del/datamash-1.9'
make[2]: Leaving directory '/home/juser/del/datamash-1.9'
make[1]: Leaving directory '/home/juser/del/datamash-1.9'

As it's common with Autotools, there is a lot of extra noise and echoing that distracts from the important information.


In comparison the Meson output is more to the point and useful:

$ meson test                                                 
ninja: Entering directory `/home/juser/del/datamash/build'
ninja: no work to do.
 1/25 deprecated_test                     OK              0.08s
 2/25 tests2                              OK              0.40s
 3/25 deprecated2_test                    OK              0.07s
 4/25 tests                               OK              0.66s
 5/25 parser_test                         OK              0.20s
 6/25 md5_test                            OK              0.04s
[..]
25/25 deco_sort                           OK              0.15s

Ok:                25  
Fail:              0   

Full log written to /home/juser/del/datamash/build/meson-logs/testlog.txt


Side Effects

As a side effect of introducing Meson, a few other things are now possible or much easier than with Autotools.

In particular, the first-class support of out-of-tree builds simplifies having multiple build trees around for different target configurations, e.g. one with release optimization enabled, one with special debugging flags etc.

Also, Meson supports out-of-the-box complicated build modes such as LTO and instrumentalization via simple setup flags.

Nowadays Autotools supports out-of-tree builds, as well, but usually this isn't documented well and for each project that uses Autotools there is a risk that it doesn't support this mode fully. Similarly, passing LTO and instrumentalization flags is also possible with Autotools, but arguably this is more tedious and error-prone.

See also

My Meson branch (named main) contains all the changes that are discussed in this article. My changes are the first commits in 2026 and that series contains 9 consecutive commits or so.

Direct links:

Conclusion

Porting Datamash from Autotools to Meson required relatively little effort that quickly pays off, since build setup is sped up over ten times, build time is sped up two times and the executable sizes are halved, while the Meson build configuration is simpler and more compact and thus easier to review and maintain.

It shows that Meson already was adopted by many open source projects, because Meson support for configuration macros and tests is good. Last but not least, it helps that Meson is better documented than Autotools.