296 lines
13 KiB
Plaintext
296 lines
13 KiB
Plaintext
What is Just-in-Time Compilation?
|
|
=================================
|
|
|
|
Just-in-Time compilation (JIT) is the process of turning some form of
|
|
interpreted program evaluation into a native program, and doing so at
|
|
runtime.
|
|
|
|
For example, instead of using a facility that can evaluate arbitrary
|
|
SQL expressions to evaluate an SQL predicate like WHERE a.col = 3, it
|
|
is possible to generate a function than can be natively executed by
|
|
the CPU that just handles that expression, yielding a speedup.
|
|
|
|
This is JIT, rather than ahead-of-time (AOT) compilation, because it
|
|
is done at query execution time, and perhaps only in cases where the
|
|
relevant task is repeated a number of times. Given the way JIT
|
|
compilation is used in PostgreSQL, the lines between interpretation,
|
|
AOT and JIT are somewhat blurry.
|
|
|
|
Note that the interpreted program turned into a native program does
|
|
not necessarily have to be a program in the classical sense. E.g. it
|
|
is highly beneficial to JIT compile tuple deforming into a native
|
|
function just handling a specific type of table, despite tuple
|
|
deforming not commonly being understood as a "program".
|
|
|
|
|
|
Why JIT?
|
|
========
|
|
|
|
Parts of PostgreSQL are commonly bottlenecked by comparatively small
|
|
pieces of CPU intensive code. In a number of cases that is because the
|
|
relevant code has to be very generic (e.g. handling arbitrary SQL
|
|
level expressions, over arbitrary tables, with arbitrary extensions
|
|
installed). This often leads to a large number of indirect jumps and
|
|
unpredictable branches, and generally a high number of instructions
|
|
for a given task. E.g. just evaluating an expression comparing a
|
|
column in a database to an integer ends up needing several hundred
|
|
cycles.
|
|
|
|
By generating native code large numbers of indirect jumps can be
|
|
removed by either making them into direct branches (e.g. replacing the
|
|
indirect call to an SQL operator's implementation with a direct call
|
|
to that function), or by removing it entirely (e.g. by evaluating the
|
|
branch at compile time because the input is constant). Similarly a lot
|
|
of branches can be entirely removed (e.g. by again evaluating the
|
|
branch at compile time because the input is constant). The latter is
|
|
particularly beneficial for removing branches during tuple deforming.
|
|
|
|
|
|
How to JIT
|
|
==========
|
|
|
|
PostgreSQL, by default, uses LLVM to perform JIT. LLVM was chosen
|
|
because it is developed by several large corporations and therefore
|
|
unlikely to be discontinued, because it has a license compatible with
|
|
PostgreSQL, and because its IR can be generated from C using the Clang
|
|
compiler.
|
|
|
|
|
|
Shared Library Separation
|
|
-------------------------
|
|
|
|
To avoid the main PostgreSQL binary directly depending on LLVM, which
|
|
would prevent LLVM support being independently installed by OS package
|
|
managers, the LLVM dependent code is located in a shared library that
|
|
is loaded on-demand.
|
|
|
|
An additional benefit of doing so is that it is relatively easy to
|
|
evaluate JIT compilation that does not use LLVM, by changing out the
|
|
shared library used to provide JIT compilation.
|
|
|
|
To achieve this, code intending to perform JIT (e.g. expression evaluation)
|
|
calls an LLVM independent wrapper located in jit.c to do so. If the
|
|
shared library providing JIT support can be loaded (i.e. PostgreSQL was
|
|
compiled with LLVM support and the shared library is installed), the task
|
|
of JIT compiling an expression gets handed off to the shared library. This
|
|
obviously requires that the function in jit.c is allowed to fail in case
|
|
no JIT provider can be loaded.
|
|
|
|
Which shared library is loaded is determined by the jit_provider GUC,
|
|
defaulting to "llvmjit".
|
|
|
|
Cloistering code performing JIT into a shared library unfortunately
|
|
also means that code doing JIT compilation for various parts of code
|
|
has to be located separately from the code doing so without
|
|
JIT. E.g. the JIT version of execExprInterp.c is located in jit/llvm/
|
|
rather than executor/.
|
|
|
|
|
|
JIT Context
|
|
-----------
|
|
|
|
For performance and convenience reasons it is useful to allow JITed
|
|
functions to be emitted and deallocated together. It is e.g. very
|
|
common to create a number of functions at query initialization time,
|
|
use them during query execution, and then deallocate all of them
|
|
together at the end of the query.
|
|
|
|
Lifetimes of JITed functions are managed via JITContext. Exactly one
|
|
such context should be created for work in which all created JITed
|
|
function should have the same lifetime. E.g. there's exactly one
|
|
JITContext for each query executed, in the query's EState. Only the
|
|
release of a JITContext is exposed to the provider independent
|
|
facility, as the creation of one is done on-demand by the JIT
|
|
implementations.
|
|
|
|
Emitting individual functions separately is more expensive than
|
|
emitting several functions at once, and emitting them together can
|
|
provide additional optimization opportunities. To facilitate that, the
|
|
LLVM provider separates defining functions from optimizing and
|
|
emitting functions in an executable manner.
|
|
|
|
Creating functions into the current mutable module (a module
|
|
essentially is LLVM's equivalent of a translation unit in C) is done
|
|
using
|
|
extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
|
|
in which it then can emit as much code using the LLVM APIs as it
|
|
wants. Whenever a function actually needs to be called
|
|
extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
|
|
returns a pointer to it.
|
|
|
|
E.g. in the expression evaluation case this setup allows most
|
|
functions in a query to be emitted during ExecInitNode(), delaying the
|
|
function emission to the time the first time a function is actually
|
|
used.
|
|
|
|
|
|
Error Handling
|
|
--------------
|
|
|
|
There are two aspects of error handling. Firstly, generated (LLVM IR)
|
|
and emitted functions (mmap()ed segments) need to be cleaned up both
|
|
after a successful query execution and after an error. This is done by
|
|
registering each created JITContext with the current resource owner,
|
|
and cleaning it up on error / end of transaction. If it is desirable
|
|
to release resources earlier, jit_release_context() can be used.
|
|
|
|
The second, less pretty, aspect of error handling is OOM handling
|
|
inside LLVM itself. The above resowner based mechanism takes care of
|
|
cleaning up emitted code upon ERROR, but there's also the chance that
|
|
LLVM itself runs out of memory. LLVM by default does *not* use any C++
|
|
exceptions. Its allocations are primarily funneled through the
|
|
standard "new" handlers, and some direct use of malloc() and
|
|
mmap(). For the former a 'new handler' exists:
|
|
http://en.cppreference.com/w/cpp/memory/new/set_new_handler
|
|
For the latter LLVM provides callbacks that get called upon failure
|
|
(unfortunately mmap() failures are treated as fatal rather than OOM errors).
|
|
What we've chosen to do for now is have two functions that LLVM using code
|
|
must use:
|
|
extern void llvm_enter_fatal_on_oom(void);
|
|
extern void llvm_leave_fatal_on_oom(void);
|
|
before interacting with LLVM code.
|
|
|
|
When a libstdc++ new or LLVM error occurs, the handlers set up by the
|
|
above functions trigger a FATAL error. We have to use FATAL rather
|
|
than ERROR, as we *cannot* reliably throw ERROR inside a foreign
|
|
library without risking corrupting its internal state.
|
|
|
|
Users of the above sections do *not* have to use PG_TRY/CATCH blocks,
|
|
the handlers instead are reset on toplevel sigsetjmp() level.
|
|
|
|
Using a relatively small enter/leave protected section of code, rather
|
|
than setting up these handlers globally, avoids negative interactions
|
|
with extensions that might use C++ such as PostGIS. As LLVM code
|
|
generation should never execute arbitrary code, just setting these
|
|
handlers temporarily ought to suffice.
|
|
|
|
|
|
Type Synchronization
|
|
--------------------
|
|
|
|
To be able to generate code that can perform tasks done by "interpreted"
|
|
PostgreSQL, it obviously is required that code generation knows about at
|
|
least a few PostgreSQL types. While it is possible to inform LLVM about
|
|
type definitions by recreating them manually in C code, that is failure
|
|
prone and labor intensive.
|
|
|
|
Instead there is one small file (llvmjit_types.c) which references each of
|
|
the types required for JITing. That file is translated to bitcode at
|
|
compile time, and loaded when LLVM is initialized in a backend.
|
|
|
|
That works very well to synchronize the type definition, but unfortunately
|
|
it does *not* synchronize offsets as the IR level representation doesn't
|
|
know field names. Instead, required offsets are maintained as defines in
|
|
the original struct definition, like so:
|
|
#define FIELDNO_TUPLETABLESLOT_NVALID 9
|
|
int tts_nvalid; /* # of valid values in tts_values */
|
|
While that still needs to be defined, it's only required for a
|
|
relatively small number of fields, and it's bunched together with the
|
|
struct definition, so it's easily kept synchronized.
|
|
|
|
|
|
Inlining
|
|
--------
|
|
|
|
One big advantage of JITing expressions is that it can significantly
|
|
reduce the overhead of PostgreSQL's extensible function/operator
|
|
mechanism, by inlining the body of called functions/operators.
|
|
|
|
It obviously is undesirable to maintain a second implementation of
|
|
commonly used functions, just for inlining purposes. Instead we take
|
|
advantage of the fact that the Clang compiler can emit LLVM IR.
|
|
|
|
The ability to do so allows us to get the LLVM IR for all operators
|
|
(e.g. int8eq, float8pl etc), without maintaining two copies. These
|
|
bitcode files get installed into the server's
|
|
$pkglibdir/bitcode/postgres/
|
|
Using existing LLVM functionality (for parallel LTO compilation),
|
|
additionally an index is over these is stored to
|
|
$pkglibdir/bitcode/postgres.index.bc
|
|
|
|
Similarly extensions can install code into
|
|
$pkglibdir/bitcode/[extension]/
|
|
accompanied by
|
|
$pkglibdir/bitcode/[extension].index.bc
|
|
|
|
just alongside the actual library. An extension's index will be used
|
|
to look up symbols when located in the corresponding shared
|
|
library. Symbols that are used inside the extension, when inlined,
|
|
will be first looked up in the main binary and then the extension's.
|
|
|
|
|
|
Caching
|
|
-------
|
|
|
|
Currently it is not yet possible to cache generated functions, even
|
|
though that'd be desirable from a performance point of view. The
|
|
problem is that the generated functions commonly contain pointers into
|
|
per-execution memory. The expression evaluation machinery needs to
|
|
be redesigned a bit to avoid that. Basically all per-execution memory
|
|
needs to be referenced as an offset to one block of memory stored in
|
|
an ExprState, rather than absolute pointers into memory.
|
|
|
|
Once that is addressed, adding an LRU cache that's keyed by the
|
|
generated LLVM IR will allow the usage of optimized functions even for
|
|
faster queries.
|
|
|
|
A longer term project is to move expression compilation to the planner
|
|
stage, allowing e.g. to tie compiled expressions to prepared
|
|
statements.
|
|
|
|
An even more advanced approach would be to use JIT with few
|
|
optimizations initially, and build an optimized version in the
|
|
background. But that's even further off.
|
|
|
|
|
|
What to JIT
|
|
===========
|
|
|
|
Currently expression evaluation and tuple deforming are JITed. Those
|
|
were chosen because they commonly are major CPU bottlenecks in
|
|
analytics queries, but are by no means the only potentially beneficial cases.
|
|
|
|
For JITing to be beneficial a piece of code first and foremost has to
|
|
be a CPU bottleneck. But also importantly, JITing can only be
|
|
beneficial if overhead can be removed by doing so. E.g. in the tuple
|
|
deforming case the knowledge about the number of columns and their
|
|
types can remove a significant number of branches, and in the
|
|
expression evaluation case a lot of indirect jumps/calls can be
|
|
removed. If neither of these is the case, JITing is a waste of
|
|
resources.
|
|
|
|
Future avenues for JITing are tuple sorting, COPY parsing/output
|
|
generation, and later compiling larger parts of queries.
|
|
|
|
|
|
When to JIT
|
|
===========
|
|
|
|
Currently there are a number of GUCs that influence JITing:
|
|
|
|
- jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
|
|
get JITed, *without* optimization (expensive part), corresponding to
|
|
-O0. This commonly already results in significant speedups if
|
|
expression/deforming is a bottleneck (removing dynamic branches
|
|
mostly).
|
|
- jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
|
|
get JITed, *with* optimization (expensive part).
|
|
- jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has
|
|
higher cost.
|
|
|
|
Whenever a query's total cost is above these limits, JITing is
|
|
performed.
|
|
|
|
Alternative costing models, e.g. by generating separate paths for
|
|
parts of a query with lower cpu_* costs, are also a possibility, but
|
|
it's doubtful the overhead of doing so is sufficient. Another
|
|
alternative would be to count the number of times individual
|
|
expressions are estimated to be evaluated, and perform JITing of these
|
|
individual expressions.
|
|
|
|
The obvious seeming approach of JITing expressions individually after
|
|
a number of execution turns out not to work too well. Primarily
|
|
because emitting many small functions individually has significant
|
|
overhead. Secondarily because the time until JITing occurs causes
|
|
relative slowdowns that eat into the gain of JIT compilation.
|