on build systems
Recently I have been thinking about what makes for a good build system. I want to analyze the major pain points I have encountered building software, and identify where these systems go wrong. Seeing as C is one of the most frustrating languages to build for in my experience, I will use the language as a case study
definition
I think of build systems as a very broad category of software; the goal of which is to automate the process of building other software. This typically involves several tasks, mainly:
- linting
- formatting
- interpreting
- compiling
- linking
- packaging
- deploying
- executing
Software that performs one or more of these tasks is a build tool. Any number of build tools can make up a build system
c and c++
The C family of languages has a quite complicated ecosystem of competing build systems. To start with, there are the compilers themselves: GCC and Clang. A typical invocation of either looks like this:
cc main.c foo.c bar.h -Iinclude/ -Llib/ -O2 -oProgram
This is quite verbose as far as CLI tools go. The path of every source file must be specified, as well as the location of libraries to be linked. Path variables also play a role in the linking process, adding a layer of hidden complexity Most software also makes use of numerous compiler flags, all of which must be typed every time.
To compile a C project using only the compiler requires first learning its structure, wrangling each of its dependencies manually, and reading documentation to find the appropriate build flags for your platform. This is not a reasonable ask for any developer
make
Compiler developers have decided that these problems are out of scope, and so another layer of abstraction is necessary. Make is a rudimentary scripting language primarily used to build C projects.
# Example Makefile taken from:
# https://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/
CC=gcc
CFLAGS=-I.
DEPS=hellomake.h
OBJ=hellomake.o hellofunc.o
%.o: %.c $(DEPS)
$(CC) -c -o $@ $< $(CFLAGS)
hellomake: $(OBJ)
$(CC) -o $@ $^ $(CFLAGS)
Make attempt to abstract away the complexity of the C compilation process.
Variables and pattern matching of file names are particularly well suited for
managing compiler flags and object files. However, by far the most attractive
feature of Make is the ability to simply type make
to compile the entire
program.
As software becomes more complex, so too does the task of building it. The limitations of C make this problem particularly egregious, given its fragile dependency resolution and lack of meta-programming. Make is an attempt to bridge this gap, and is a Turing-Complete language in its own right. The Makefile which builds the Linux kernel is over 2000 lines as of writing. The massive demands placed on this intermediary language have exposed its weak points, mainly that it is stringly-typed and full of cryptic, unintuitive syntax. Maintaining complex Makefiles contributes to the difficulty of building software almost as much as it helps
cmake
Just as Make acts as an abstraction over C compilers, CMake acts as an abstraction over Makefiles. CMake is a great example of what happens to software development when there are no adults in the room, so to speak. Compiling a C program should be a simple task, ideally one that requires nothing more than a C compiler. Failing that, a simple build scripting language should be more than enough to handle even industrial use cases. When our build system needs a build system, we have completely lost the plot and need to reevaluate the problem from square one
CMake -> Makefile -> gcc/clang -> Assembly
compile targets
Imagine a world where the Make language was more expressive, functional, and well-thought-out. Suddenly the idea of CMake becomes silly; clearly introducing another language into the mix would only slow down development and introduce an entirely new category of bugs. CMake can only exist because Make failed to accomplish its goal. The same could be said for the GCC and Clang compilation syntax. Rather than fix the underlying issue, we treat the failed product as a new compilation target and build a new thing to abstract away (never replace!) the old thing.
Developers are not (generally) stupid; this pattern exists for a reason. In the case of C, it is sometimes necessary to execute arbitrary code at build-time. The obvious solution is to create a new language to handle this need - but why is the original language not sufficient? Make is written in C, so by definition C can do anything Make can do. The issue is that C source files do not contain enough information for the compiler to build the entire program. This information must be embedded in another nonstandard format, which itself must be parsed and executed by a nonstandard build tool
shebang
Developers have become overly complacent with build systems. Look at any project
today, and in the root directory you will see a layer of congealed fat:
package.json
, CMakeLists.txt
, Cargo.toml
, build.gradle
, maybe a python
virtual environment, along with any ignore files, linter configs, etc... Every
new tool, language, and config file means another program to install and another
step in the build process. Every one of these dependencies makes the project
more fragile and less portable. Meanwhile, we are not making good use of the
tools we already have. We ought to be demanding more from language designers.
The build process should not be an afterthought left for developers to figure
out, it should be a core consideration when designing grammar and syntax.
If you have ever used a scripting language, you are probably familiar with the shebang line.
#!/usr/bin/env python3
print("Hello World!")
This wonderfully useful one-liner captures what I mean by making use of existing
tools, and treating the build process as a grammar concern. This Python file
describes how to run itself to the shell which invokes it. Since the #
character is used as a comment in Python, the line can be safely ignored by
any other programs or tools that read the file. This system is not perfect, the
name or path of the python executable may vary between systems, and the shebang
relies on a shell to interpret it (technically a build system). However,
expanding on this concept may help alleviate our build system woes
doing better
Let's look at how C syntax could be changed to adopt some of these ideals, starting with a simple example:
#!/usr/bin/gcc -E
// Warn or error if specific compiler not used
#compiler gcc 12.2.0
#semver 1.0.0
#ifdef RELEASE
#opt o2
#endif
#warn all
#libs lib/
#output buid/MyProgram
// Warn or error if library semver does not match
#include "mylib.h" "0.2.*"
int main() {
/* ... */
return 0;
}
Here, I have replaced command-line flags with a special #
prefixed syntax.
Since all the compiler directives are in-line with the source code itself,
we can take advantage of the shebang line just like scripting languages do.
Using the #ifdef
directive, we can even conditionally enable flags for
release mode. Let's see what we can do with an even more radical approach:
// build.c
#include <compile.h>
#include <link.h>
#define RELEASE 0
int main(int argc, char* argv[]) {
// Struct representing the invoked compiler
compiler_t compiler = get_compiler();
if (strcmp("gcc", compiler.name) != 0
|| compiler.semver.major <= 12) {
// Abort build with error
emit_error("Incompatible compiler version!");
return 1;
}
semver_t version = {.major=1, .minor=0, .patch=0};
int opt_level;
// We could easily check for a flag
// in argv here instead
if (RELEASE) {
opt_level = 2;
} else {
opt_level = 0;
}
// A realistic function would probably take
// some structure containing compile directives
artifact_t executable = compile(
&compiler, "main.c", opt_level, version
);
artifact_t mylib = load_dylib("lib/mylib.so");
link(&executable, &mylib);
write_artifact(&executable, "bin/MyProgram");
}
In this example, we create a new file build.c
which acts as a pseudo-Makefile.
The compile.h
and link.h
includes are compiler implemented, and so do not
need to be linked from the system's libc. All flags passed to the compiler are
handed off to the main()
function. It is easy to imagine an
equivalent to make clean
that erases all build artifacts, or a caching system
that only rebuilds modified files
going further
I am not a C developer by any stretch, and so I will spare you any more
pseudocode. I hope these examples show that replacing Makefiles with pure
C is not such an unreasonable idea. Still, we can go even further; imagine
if we split the compile()
function into lexing, parsing, and IR generating
intermediary functions. This would make meta-programming simple and
straightforward, and even allow for the introduction of program-specific syntax.
Developers could create libraries for common build tasks such as cloning git
repositories, running tests, or submitting binaries to package managers
setbacks
Comparing my pseudocode to the Makefile example, it is obvious which is more idiomatic and understandable. This is partially due to my lack of creativity and skill as a C developer. However, I imagine Make will always have an advantage here, at least when it comes to small projects. While our build system is now in C, it is still a separate entity from the project itself. The minimum possible C program is now two files rather than just one. So far I have tried to conform as closely as possible to standard C syntax and grammar, but this approach will always feel like a hack more than a well-thought-out language feature
the ouroboros
Most languages draw a very strong distinction between compile-time and run-time code. Typically, compile-time execution may happen only within macros or constant functions, if it is even allowed at all. This habit can be traced back to assembly programmers who deemed self-modifying code a dangerous antipattern. This mindset is what I believe drives us to create these leaning towers of build systems.
What would a language built around meta-programming look like? I suspect that a language with a truly infinite degree of self reflection is possible. Such a language could be far more expressive than its peers using less syntax. Imagine if a library could implement a new language-wide keyword, and even implement that keyword using the same keyword. Perhaps concepts as basic as structs, enumerated types, and integers could be defined within the language itself. The line between compilation and execution disappears. The line between language and program grows thin. I imagine this process as if the language were eating itself, like an ouroboros.
If such a language existed, it follows that every other language would simply be a strict subset of this language (lets call it "every-lang"). For example, we could write an every-lang library which implements every piece of Lua syntax and grammar on a meta-program level. A user that imports this library could then simply write code in Lua, and compile the program using the every-lang compiler. This library would effectively be a Lua build system, that is also an every-lang build system, that is also an "every language" build system
-- A Lua program? or an every-lang program?
require "everylang"
function fact(n)
if n == 0 then
return 1
else
return n * fact(n-1)
end
end
print(fact(5))
on crashing and burning
This every-lang is, to put it lightly, a little far-fetched. Such a language would be nearly impossible to implement or reason about. A practically useful language must make compromises with its users, and the fundamental laws of computation. I believe that the next frontier for language design will be pushing the boundary on this front - how close can we get to every-lang without crashing and burning?