postgresql/src/backend/jit/README

What is Just-in-Time Compilation?
=================================

Just-in-Time compilation (JIT) is the process of turning some form of
interpreted program evaluation into a native program, and doing so at
runtime.

For example, instead of using a facility that can evaluate arbitrary
SQL expressions to evaluate an SQL predicate like WHERE a.col = 3, it
is possible to generate a function than can be natively executed by
the CPU that just handles that expression, yielding a speedup.

This is JIT, rather than ahead-of-time (AOT) compilation, because it
is done at query execution time, and perhaps only in cases where the
relevant task is repeated a number of times. Given the way JIT
compilation is used in PostgreSQL, the lines between interpretation,
AOT and JIT are somewhat blurry.

Note that the interpreted program turned into a native program does
not necessarily have to be a program in the classical sense. E.g. it
is highly beneficial to JIT compile tuple deforming into a native
function just handling a specific type of table, despite tuple
deforming not commonly being understood as a "program".


Why JIT?
========

Parts of PostgreSQL are commonly bottlenecked by comparatively small
pieces of CPU intensive code. In a number of cases that is because the
relevant code has to be very generic (e.g. handling arbitrary SQL
level expressions, over arbitrary tables, with arbitrary extensions
installed). This often leads to a large number of indirect jumps and
unpredictable branches, and generally a high number of instructions
for a given task. E.g. just evaluating an expression comparing a
column in a database to an integer ends up needing several hundred
cycles.

By generating native code large numbers of indirect jumps can be
removed by either making them into direct branches (e.g. replacing the
indirect call to an SQL operator's implementation with a direct call
to that function), or by removing it entirely (e.g. by evaluating the
branch at compile time because the input is constant). Similarly a lot
of branches can be entirely removed (e.g. by again evaluating the
branch at compile time because the input is constant). The latter is
particularly beneficial for removing branches during tuple deforming.


How to JIT
==========

PostgreSQL, by default, uses LLVM to perform JIT. LLVM was chosen
because it is developed by several large corporations and therefore
unlikely to be discontinued, because it has a license compatible with
PostgreSQL, and because its IR can be generated from C using the Clang
compiler.


Shared Library Separation
-------------------------

To avoid the main PostgreSQL binary directly depending on LLVM, which
would prevent LLVM support being independently installed by OS package
managers, the LLVM dependent code is located in a shared library that
is loaded on-demand.

An additional benefit of doing so is that it is relatively easy to
evaluate JIT compilation that does not use LLVM, by changing out the
shared library used to provide JIT compilation.

To achieve this, code intending to perform JIT (e.g. expression evaluation)
calls an LLVM independent wrapper located in jit.c to do so. If the
shared library providing JIT support can be loaded (i.e. PostgreSQL was
compiled with LLVM support and the shared library is installed), the task
of JIT compiling an expression gets handed off to the shared library. This
obviously requires that the function in jit.c is allowed to fail in case
no JIT provider can be loaded.

Which shared library is loaded is determined by the jit_provider GUC,
defaulting to "llvmjit".

Cloistering code performing JIT into a shared library unfortunately
also means that code doing JIT compilation for various parts of code
has to be located separately from the code doing so without
JIT. E.g. the JIT version of execExprInterp.c is located in jit/llvm/
rather than executor/.


JIT Context
-----------

For performance and convenience reasons it is useful to allow JITed
functions to be emitted and deallocated together. It is e.g. very
common to create a number of functions at query initialization time,
use them during query execution, and then deallocate all of them
together at the end of the query.

Lifetimes of JITed functions are managed via JITContext. Exactly one
such context should be created for work in which all created JITed
function should have the same lifetime. E.g. there's exactly one
JITContext for each query executed, in the query's EState.  Only the
release of a JITContext is exposed to the provider independent
facility, as the creation of one is done on-demand by the JIT
implementations.

Emitting individual functions separately is more expensive than
emitting several functions at once, and emitting them together can
provide additional optimization opportunities. To facilitate that, the
LLVM provider separates defining functions from optimizing and
emitting functions in an executable manner.

Creating functions into the current mutable module (a module
essentially is LLVM's equivalent of a translation unit in C) is done
using
  extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
in which it then can emit as much code using the LLVM APIs as it
wants. Whenever a function actually needs to be called
  extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
returns a pointer to it.

E.g. in the expression evaluation case this setup allows most
functions in a query to be emitted during ExecInitNode(), delaying the
function emission to the time the first time a function is actually
used.


Error Handling
--------------

There are two aspects of error handling.  Firstly, generated (LLVM IR)
and emitted functions (mmap()ed segments) need to be cleaned up both
after a successful query execution and after an error. This is done by
registering each created JITContext with the current resource owner,
and cleaning it up on error / end of transaction. If it is desirable
to release resources earlier, jit_release_context() can be used.

The second, less pretty, aspect of error handling is OOM handling
inside LLVM itself. The above resowner based mechanism takes care of
cleaning up emitted code upon ERROR, but there's also the chance that
LLVM itself runs out of memory. LLVM by default does *not* use any C++
exceptions. Its allocations are primarily funneled through the
standard "new" handlers, and some direct use of malloc() and
mmap(). For the former a 'new handler' exists:
http://en.cppreference.com/w/cpp/memory/new/set_new_handler
For the latter LLVM provides callbacks that get called upon failure
(unfortunately mmap() failures are treated as fatal rather than OOM errors).
What we've chosen to do for now is have two functions that LLVM using code
must use:
extern void llvm_enter_fatal_on_oom(void);
extern void llvm_leave_fatal_on_oom(void);
before interacting with LLVM code.

When a libstdc++ new or LLVM error occurs, the handlers set up by the
above functions trigger a FATAL error. We have to use FATAL rather
than ERROR, as we *cannot* reliably throw ERROR inside a foreign
library without risking corrupting its internal state.

Users of the above sections do *not* have to use PG_TRY/CATCH blocks,
the handlers instead are reset on toplevel sigsetjmp() level.

Using a relatively small enter/leave protected section of code, rather
than setting up these handlers globally, avoids negative interactions
with extensions that might use C++ such as PostGIS. As LLVM code
generation should never execute arbitrary code, just setting these
handlers temporarily ought to suffice.


Type Synchronization
--------------------

To be able to generate code that can perform tasks done by "interpreted"
PostgreSQL, it obviously is required that code generation knows about at
least a few PostgreSQL types.  While it is possible to inform LLVM about
type definitions by recreating them manually in C code, that is failure
prone and labor intensive.

Instead there is one small file (llvmjit_types.c) which references each of
the types required for JITing. That file is translated to bitcode at
compile time, and loaded when LLVM is initialized in a backend.

That works very well to synchronize the type definition, but unfortunately
it does *not* synchronize offsets as the IR level representation doesn't
know field names.  Instead, required offsets are maintained as defines in
the original struct definition, like so:
#define FIELDNO_TUPLETABLESLOT_NVALID 9
        int                     tts_nvalid;             /* # of valid values in tts_values */
While that still needs to be defined, it's only required for a
relatively small number of fields, and it's bunched together with the
struct definition, so it's easily kept synchronized.


Inlining
--------

One big advantage of JITing expressions is that it can significantly
reduce the overhead of PostgreSQL's extensible function/operator
mechanism, by inlining the body of called functions/operators.

It obviously is undesirable to maintain a second implementation of
commonly used functions, just for inlining purposes. Instead we take
advantage of the fact that the Clang compiler can emit LLVM IR.

The ability to do so allows us to get the LLVM IR for all operators
(e.g. int8eq, float8pl etc), without maintaining two copies.  These
bitcode files get installed into the server's
  $pkglibdir/bitcode/postgres/
Using existing LLVM functionality (for parallel LTO compilation),
additionally an index is over these is stored to
$pkglibdir/bitcode/postgres.index.bc

Similarly extensions can install code into
  $pkglibdir/bitcode/[extension]/
accompanied by
  $pkglibdir/bitcode/[extension].index.bc

just alongside the actual library.  An extension's index will be used
to look up symbols when located in the corresponding shared
library. Symbols that are used inside the extension, when inlined,
will be first looked up in the main binary and then the extension's.


Caching
-------

Currently it is not yet possible to cache generated functions, even
though that'd be desirable from a performance point of view. The
problem is that the generated functions commonly contain pointers into
per-execution memory. The expression evaluation machinery needs to
be redesigned a bit to avoid that. Basically all per-execution memory
needs to be referenced as an offset to one block of memory stored in
an ExprState, rather than absolute pointers into memory.

Once that is addressed, adding an LRU cache that's keyed by the
generated LLVM IR will allow the usage of optimized functions even for
faster queries.

A longer term project is to move expression compilation to the planner
stage, allowing e.g. to tie compiled expressions to prepared
statements.

An even more advanced approach would be to use JIT with few
optimizations initially, and build an optimized version in the
background. But that's even further off.


What to JIT
===========

Currently expression evaluation and tuple deforming are JITed. Those
were chosen because they commonly are major CPU bottlenecks in
analytics queries, but are by no means the only potentially beneficial cases.

For JITing to be beneficial a piece of code first and foremost has to
be a CPU bottleneck. But also importantly, JITing can only be
beneficial if overhead can be removed by doing so. E.g. in the tuple
deforming case the knowledge about the number of columns and their
types can remove a significant number of branches, and in the
expression evaluation case a lot of indirect jumps/calls can be
removed.  If neither of these is the case, JITing is a waste of
resources.

Future avenues for JITing are tuple sorting, COPY parsing/output
generation, and later compiling larger parts of queries.


When to JIT
===========

Currently there are a number of GUCs that influence JITing:

- jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
  get JITed, *without* optimization (expensive part), corresponding to
  -O0. This commonly already results in significant speedups if
  expression/deforming is a bottleneck (removing dynamic branches
  mostly).
- jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
  get JITed, *with* optimization (expensive part).
- jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has
  higher cost.

Whenever a query's total cost is above these limits, JITing is
performed.

Alternative costing models, e.g. by generating separate paths for
parts of a query with lower cpu_* costs, are also a possibility, but
it's doubtful the overhead of doing so is sufficient.  Another
alternative would be to count the number of times individual
expressions are estimated to be evaluated, and perform JITing of these
individual expressions.

The obvious seeming approach of JITing expressions individually after
a number of execution turns out not to work too well. Primarily
because emitting many small functions individually has significant
overhead. Secondarily because the time until JITing occurs causes
relative slowdowns that eat into the gain of JIT compilation.