Codegen in Databases

Just-in-time compilation is usually associated with managed languages like Java and C#, or scripting languages like Javascript. As detailed in the previous post, many other applications benefit from ad hoc code generation as well. This post is a tutorial on code generation in relational databases. Relational databases are commonly accessed with the standard query language SQL. The query optimizer generates an optimized query plan and passes it to the query execution engine for processing which in modern systems generates machine code for faster execution.

SIGMOD Programming Contest 2018

SIGMOD is one of the most popular database conferences. Each year, a programming contest is held, and the top finalists can present their implementations at the venue. It is all about performance. The submissions are ranked by their evaluation time. The winner with the fastest program usually organizes the contest in the following year.

In 2018, the task was to write a prototype of a query execution engine. The queries are quite simple, consisting only of joins with an equivalence predicate (equi-joins) and simple filters (single predicate, no complex expressions). Furthermore, there is only one data type: an integer of 8 bytes. This limits the number of operators the execution engine must support considerably which makes it a nice task for a programming contest, and a tutorial such as this one.

The queries are in an easy to parse text format. Let’s have a look at one example query.

Example query: 0 2 4|0.1=1.2&1.0=2.1&0.1>3000|0.0 1.1

There are three parts separated by |:

  1. The list of relations participating in the query.
  2. All equi-joins and filters combined in one conjunction.
  3. The list of columns to project on which implicitly sums up all elements and just returns a single tuple with the sums for each projected column.

In SQL, it would look like the following query.

SELECT SUM("0".c0), SUM("1".c1)
FROM r0 "0", r2 "1", r4 "2"
WHERE "0".c1="1".c2 and "1".c0="2".c1 and "0".c1>3000

The implicit sum in the projection is quite interesting. It means that the result set is always just a single tuple. If we can keep the amount of intermediate results as low as possible before condensing it to a single list of sums in the projections, we are golden. With code generation we can avoid most of the intermediate results by just keeping track of the position in the table (row id). As we will see in the next section, additional optimizations eliminate the overhead inherent to interpretation, resulting in a small amount of instructions doing only the necessary work.

Code generation is perceived by a lot of developers as very complex and hard to grasp, if one is not a compiler developer. I argue, it is not that hard if one gets into the right mindset and uses the right tools. I hope that you will agree with me at the end of this tutorial.

Data-centric Execution Pipeline

We follow the approach of data-centric code generation introduced in HyPer with the paper “Efficiently Compiling Efficient Query Plans for Modern Hardware” by Thomas Neumann. Check out the list of publications on their website if you want to get deeper into the topic.

We focus purely on the query execution engine in this tutorial. A database does a lot of optimizations beforehand during query planning. The order in which operators are executed, especially the join order, has a large impact on the query execution time. For simplicity, we assume that this is already done and returns a left deep query plan with one long operator pipeline from a scan all the way to the projection.

Left Deep Query PlanOperator PipelineRσR1>3000R1=S2S0=T1πR0,S1STScanFilterHashjoinHashjoinProjection/Sum

The figure on the left is a left deep query plan of the example query. For clarity, the participating tables were renamed to R, S and T. The long operator pipeline is marked in red. The sequence of operators in the pipeline is illustrated on the right, from the bottom to the top. The results of an operator are pushed into the next operator until the projection is reached where the results are aggregated into a sum per projected column, here R0 and S1.

Our goal is to generate code to execute the pipeline. We gracefully ignore the build phase of the hashjoins constructing the hashtables and assume that it is already taken care of. The following handwritten C++ code shall be the template for the code generation, illustrating how the operators are merged together into a single function.

// result variables for projection
uint32_t num=0, p0=0, p1=0;

// scan of R, loop over all row ids
for(uint32_t R_index=0; R_index<R.size; ++R_index){

	// filter on R1
	if(R1[R_index] > 3000){

		// hashjoin R1=S2, prepared hashtable of S2 containing row ids of S
		auto [it,itend] = HT_S2.equal_range(R1[R_index]);
		// loop over all join partners
		for(; it!=itend; it++){
			// set row id for this join partner
			uint32_t S_index = *it;

			// hash join S0=T1
			auto [it2,itend2] = HT_T1.equal_range(S0[S_index]);
			for(; it2!=itend2; it2++){
				// row id is unused, semi-join
				uint32_t T_index = *it2;

				// projection
				++num;             // number of rows in aggregate
				p0 += R0[R_index]; // first sum
				p1 += S1[S_index]; // second sum
			}
		}
	}
}

At the beginning, we declare and initialize the result variables of the projection. The first operator is a scan of the table R. As we want to keep the intermediate results as low as possible, we simply loop over all row ids of the table and only access the elements of the table when needed.

Next, we apply the filter on R1. The tables are in columnar storage, i.e., a table is a collection of columns. Each column is an array of integers. We access the column R1 at the row id coming from the scan and compare it with the filter predicate.

Afterwards, we join table R and S when R1 is equal to S2. The hashtable HT_S2 was constructed beforehand in the build phase. It is a multimap containing for each distinct value of S2 a list of row ids representing the positions of the value in the column. We probe the hashtable with the current value of R1 to get the row ids of the join partners in S. We loop over all join partners to consider them one after the other. The next hashjoin works in the same manner.

Finally, in the most inner loop, we have the code for the implicit sum in the projection. We simply access each projected column with the current row id of the corresponding table and add it to the sum. Additionally, we keep track of the number of values added to the sum to distinguish an empty result set from a sum which is equal to zero.

As we can see, the operators boil down to just a few statements and are seamlessly nested into each other. There are no operator boundaries leading to any call overhead. The code is completely inlined. Furthermore, the usage of row ids to avoid intermediate results feels quite natural. The code is clean and easy to follow.

In the next section, we will discuss some variants of code generation before jumping into the actual implementation.

Variants of Code Generation

There are multiple ways to generate code in a database. It depends on what language or tool we want to target which in turn does the heavy lifting of generating the native machine code of the CPU. The following figure summarizes the various alternatives.

PlanningCodeGenerationPlanQueryDatabaseFront EndC++JavaFortranMiddle EndProgram AnalysisOptimizationsBack Endx86MIPSSPARCRegister AllocatorCompilerIRIRnative codea) source codeb) IRc) IRd) assembly

Modern compiler frameworks like LLVM are modular, consisting of three main components: language front ends parsing source code, the middle end containing most optimizations, and multiple back ends supporting various microarchitectures. Each component can be targeted by the code generation of the database.

a) We can generate source code of a high-level programming language like C++. The result would look similar to the example source code in the previous section. The main advantage is ease-of-use. Developers are familiar with source code. It is not so complex to generate source code for each operator in a nested fashion. The main disadvantage is the high compilation latency. The generated source code must go through all the stages of the compiler to be ready for execution. For short running queries, the compilation takes way longer than the query evaluation, even when interpreting.

b) We can reduce the overhead of the compilation by skipping the language front end and passing the intermediate representation (IR) directly to the compiler. We still leverage all the optimizations the compiler has in store for us, but IR is quite complex to generate. The IR in the LLVM compiler framework is similar to an architecture agnostic assembly language in a special format which has useful properties for the optimization passes but is not easy to write.

c) We can shed off even more compilation overhead by sending the IR to the back end and doing an unoptimized build. Depending on the query, this can have a large negative influence on the execution time of the generated code.

d) The last variant skips most of the compiler stages by generating assembly instructions of the target microarchitecture. It only leverages the register allocator of the back end which assigns virtual registers to physical registers. The use of virtual registers makes the generation of nested code fragments much easier. Nevertheless, we are writing assembly which is a tedious and error-prone process, and not portable at all.

To make our life easier, we will use COAT, an EDSL for C++ which simplifies the implementation of code generation. For details, see the previous post introducing it. The variants b) and c) are available with COAT’s LLVM backend, variant d) with the AsmJit backend. The complexity is hidden from sight behind the EDSL. We reap the benefits without sacrificing the ease-of-use.

Morsel-Driven Parallelism

The parallelization scheme we will use is very simple: we partition the input of the pipeline in equal-sized batches, called morsels. Each morsel is independent of any other morsel and is executed in parallel without any synchronization. When a morsel is fully processed, the resulting sums are added atomically to global sums which store the result of the whole query.

This parallelization scheme is straightforward and only changes the code generation of the scan operator slightly, as we will see in the next section.

Operator Implementations

In this section, I will explain how the code generation is done in the operators. The whole project is available in a git repository on github. You can check out the code to see all the details.

To recap, the programming contest of 2018 requires only a few operators: scan, filter, equi-join and projection/sum. Moreover, there is only one data type we have to support. Everything is an integer. A pipeline is a sequence of operators. For this workload, the pipeline always starts with a scan and ends in a projection.

For simplicity, we link all operators together in a linked list. Each operator, except projection of course, calls the next operator in the list at the code position where it should place its code. Code can be placed before and after the code of the nested operators.

Here’s a simple abstract class we use as the base class for all operators.

class Operator{
protected:
	Operator *next=nullptr;

public:
	Operator(){}
	virtual ~Operator(){
		delete next;
	}

	void setNext(Operator *next){
		this->next = next;
	}

	// code generation with coat, for each backend, chosen at runtime
	virtual void codegen(Fn_asmjit&, CodegenContext<Fn_asmjit>&)=0;
	virtual void codegen(Fn_llvmjit&, CodegenContext<Fn_llvmjit>&)=0;
};

There are two pure virtual member functions codegen: one for the LLVM backend and one for the AsmJit backend. CodegenContext is a struct holding information each operator can use when generating code, e.g., a list of row ids, one for each participating table.

The scan operator is called first. It iterates over all row ids in the morsel.

class ScanOperator final : public Operator{
private:
	template<class Fn>
	void codegen_impl(Fn &fn, CodegenContext<Fn> &ctx){
		// get lower bound of morsel
		// do not make a copy, just take the virtual register from arguments
		ctx.rowids[0] = std::move(std::get<0>(ctx.arguments));
		// get upper bound of morsel
		auto &upper = std::get<1>(ctx.arguments);
		// iterate over morsel
		coat::do_while(fn, [&]{
			// call next operator
			next->codegen(fn, ctx);
			// next row id
			++ctx.rowids[0];
		}, ctx.rowids[0] < upper);
	}

public:
	ScanOperator(const Relation &relation) {}

	void codegen(Fn_asmjit &fn, CodegenContext<Fn_asmjit> &ctx) override { codegen_impl(fn, ctx); }
	void codegen(Fn_llvmjit &fn, CodegenContext<Fn_llvmjit> &ctx) override { codegen_impl(fn, ctx); }
};

Both COAT backends share the same operator implementation in codegen_impl. All other operators do the same which is why only codegen_impl is shown from now on.

First, we get the start value of the morsel from the first argument and reuse it as the value representing the row id of the first table. Next, we get the upper bound of the morsel from the second argument. Finally, we use coat::do_while() to generate a loop which will iterate as long as there are row ids left in the morsel.

The lambda is used as the loop body. It calls codegen on the next operator to insert the nested code into the loop body. Finally, it increments the row id of the morsel.

That’s all there is to a scan operator. The filter operator is straightforward as well.

template<class Fn>
void codegen_impl(Fn &fn, CodegenContext<Fn> &ctx){
	// read from column, depends on column type
	auto val = loadValue(fn, column, ctx.rowids[relid]);
	// conditionally generate code depending on comparison type
	switch(comparison){
		case Filter::Comparison::Less: {
			coat::if_then(fn, val < constant, [&]{
				next->codegen(fn, ctx);
			});
			break;
		}
		case Filter::Comparison::Greater: {
			coat::if_then(fn, val > constant, [&]{
				next->codegen(fn, ctx);
			});
			break;
		}
		case Filter::Comparison::Equal: {
			coat::if_then(fn, val == constant, [&]{
				next->codegen(fn, ctx);
			});
			break;
		}
	}
}

First, we read the value from the column of the table we filter at the position of the current row id of this table. Next, we conditionally generate code for the different comparison types. For each comparison type, we call coat::if_then() with the corresponding condition. The lambda we pass is the then-branch. It just calls the next operator to insert the code inside the then-branch.

For reading the value, we used the helper function loadValue. It generates slightly different code to fetch the value depending on the type of the column. During load, columns with small values got moved to a denser 32-bit or 16-bit representation instead of the initial 64-bit. Thus, we have three types of columns to handle. More advanced compression schemes can be added as well.

template<class Fn, class CC>
coat::Value<CC,uint64_t> loadValue(Fn &fn, const column_t &col, coat::Value<CC,uint64_t> &idx){
	coat::Value<CC, uint64_t> loaded(fn, "loaded");
	switch(col.index()){
		case 0: {
			auto vr_col = fn.embedValue(std::get<uint64_t*>(col), "col");
			// fetch 64 bit value from column
			loaded = vr_col[idx];
			break;
		}
		case 1: {
			auto vr_col = fn.embedValue(std::get<uint32_t*>(col), "col");
			// fetch 32 bit value from column and extend to 64 bit
			loaded.widen(vr_col[idx]);
			break;
		}
		case 2: {
			auto vr_col = fn.embedValue(std::get<uint16_t*>(col), "col");
			// fetch 16 bit value from column and extend to 64 bit
			loaded.widen(vr_col[idx]);
			break;
		}

		default:
			fprintf(stderr, "unknown type in column_t: %lu\n", col.index());
			abort();
	}
	return loaded;
}

Similarly, the equi-join operator is quite simple. Most of the things happen in the “hash table” which is actually a concise array table (similar to CSR for the graph people among us).

template<class Fn>
void codegen_impl(Fn &fn, CodegenContext<Fn> &ctx){
	// fetch value from probed column
	auto val = loadValue(fn, probeColumn, ctx.rowids[probeRelation]);
	// embed pointer to hashtable in the generated code
	auto ht = fn.embedValue(hashtable, "hashtable");
	// iterate over all join partners
	ht.iterate(val, [&](auto &ele){
		// set rowid of joined relation
		ctx.rowids[buildRelation] = ele;
		next->codegen(fn, ctx);
	});
}

First, we fetch the value from the probed column. Next, we embed the pointer to the “hash table” in the generated code and get the coat::Struct object ht back, initialized with the pointer address. The address of the pointer is stored in the generated code as an immediate value. coat::Struct is a wrapper for a pointer to a struct/class and provides access to member variables when they are marked in a special way.

template<typename T>
class MultiArrayTable final {

#define MEMBERS(x) \
	x(T, min) \
	x(T, max) \
	x(T*, offsets) \
	x(T*, rows)

DECLARE_PRIVATE(MEMBERS)
#undef MEMBERS

Yes, macros. C++ lacks compile-time reflection. COAT uses macros to add a bit of meta data, making the list of members and their types available to other templates like coat::Struct. This way we get access to each member from the generated code.

The heavy lifting in the equi-join operator is done by ht.iterate(). It is a custom function added to the wrapper object with CRTP. Hence, the implementation details of the “hash table” can be encapsulated inside its header file, even for code generation. As a user of the data structure, we do not have to know how to iterate over all entries with the same key. We use the provided convenience function. For details, check out the file include/MultiArrayTable.h.

Just for fun, I implemented additional join variants: joining with a unique column using a simple array table, and a semi-join using a bitset to check if there is a join partner. You can see it as a form of strength reduction. We are replacing the generic join operator with cheaper versions if the data allows it. A minor speedup is the result. Follow the links and check out the code if you are curious.

The last operator in each pipeline is the projection operator.

template<class Fn>
void codegen_impl(Fn &fn, CodegenContext<Fn> &ctx){
	// iterate over all projected columns
	for(size_t i=0; i<size; ++i){
		auto [column, relid] = projections[i];
		// fetch value from projected column
		auto val = loadValue(fn, *column, ctx.rowids[relid]);
		// add value to implicit sum on the projected column
		ctx.results[i] += val;
	}
	// count number of elements in sum
	++ctx.amount;
}

For each projected column, we fetch the value and add it to the implicit sum on this projected column. Additionally, we count the number of elements we have in the sum. This counter is used to distinguish an empty result set from a sum which just happens to be zero.

To return the results back to the caller of the generated function, we must store the sums to memory. After all operators in the pipeline generated their code, the following function is called to generate an epilogue.

template<class Fn>
void codegen_save_impl(Fn &fn, CodegenContext<Fn> &ctx){
	// get memory location to store result tuple to from an argument
	auto &projaddr = std::get<2>(ctx.arguments);
	// iterate over all projected columns
	for(size_t i=0; i<size; ++i){
		// store sum to memory
		projaddr[i] = ctx.results[i];
	}
}

It stores each sum in contiguous memory. The pointer was passed as an argument to the generated function. The counter for the number of elements in the projection is passed as return value back to the caller.

And, that’s it! Code generation can be that easy.

Experimental Evaluation

If you build the prototype and run the workload provided by the contest, you will see that the execution with the AsmJit backend is much faster than the LLVM backend. Compilation latency is the main difference here. The workload consists of a lot of short running queries which benefit from the very low compilation latency of AsmJit. Even disabling optimizations in LLVM does not close the gap for the compilation latency, and the execution time suffers considerably from it.

Here are some measurements from my workstation. These numbers are the sum for all queries in the workload. Depending on your machine, you might get quite different numbers, but the trend should be similar.

Back EndCompilation LatencyExecution Time
AsmJit5 ms550 ms
LLVM -O0341 ms768 ms
LLVM -O1650 ms513 ms
LLVM -O2663 ms521 ms
LLVM -O3660 ms567 ms

In database research, the use of LLVM as JIT compiler seems to be the standard. JIT assemblers are usually neglected for being too hard to use and producing inefficient code. Furthermore, they are not portable. Well, COAT makes them easy to use and code efficiency is not so bad as we see by the results. Portability remains an issue. I still like to have them on the table to give me a baseline, especially a lower bound for the compilation latency. A JIT assembler shows me what is possible with a simple 1:1 mapping of C++ expressions to assembly instructions and how much a JIT compiler can improve on that given I invest time in optimizations. An unoptimized build with LLVM does not provide me with the same lower bound.

These results should be taken with a grain of salt. This is just a simple prototype running a single workload. It needs to be shown if a JIT assembler like AsmJit can be used as a viable alternative to LLVM for short running queries in a more realistic setting. COAT makes it easily accessible, so why not try it out.

Conclusion

I hope this tutorial was informative and not too long. Code generation has a small learning curve to get into the right mindset. We are generating code which is executed later. We have to keep in mind what is run when. COAT helps to concentrate on the control flow of the generated code by hiding the gritty details of the compiler APIs. It streamlines the process to the point where code generation is not hard anymore.

In databases, more and more researchers and companies are adopting JIT compilation to squeeze more performance from the hardware. Now is a good time to jump in on it.

References