432 lines
22 KiB
Plaintext
432 lines
22 KiB
Plaintext
src/backend/utils/mmgr/README
|
|
|
|
Notes About Memory Allocation Redesign
|
|
======================================
|
|
|
|
Up through version 7.0, Postgres had serious problems with memory leakage
|
|
during large queries that process a lot of pass-by-reference data. There
|
|
was no provision for recycling memory until end of query. This needed to be
|
|
fixed, even more so with the advent of TOAST which allows very large chunks
|
|
of data to be passed around in the system. This document describes the new
|
|
memory management system implemented in 7.1.
|
|
|
|
|
|
Background
|
|
----------
|
|
|
|
We already do most of our memory allocation in "memory contexts", which
|
|
are usually AllocSets as implemented by backend/utils/mmgr/aset.c. What
|
|
we need to do is create more contexts and define proper rules about when
|
|
they can be freed.
|
|
|
|
The basic operations on a memory context are:
|
|
|
|
* create a context
|
|
|
|
* allocate a chunk of memory within a context (equivalent of standard
|
|
C library's malloc())
|
|
|
|
* delete a context (including freeing all the memory allocated therein)
|
|
|
|
* reset a context (free all memory allocated in the context, but not the
|
|
context object itself)
|
|
|
|
Given a chunk of memory previously allocated from a context, one can
|
|
free it or reallocate it larger or smaller (corresponding to standard
|
|
library's free() and realloc() routines). These operations return memory
|
|
to or get more memory from the same context the chunk was originally
|
|
allocated in.
|
|
|
|
At all times there is a "current" context denoted by the
|
|
CurrentMemoryContext global variable. The backend macro palloc()
|
|
implicitly allocates space in that context. The MemoryContextSwitchTo()
|
|
operation selects a new current context (and returns the previous context,
|
|
so that the caller can restore the previous context before exiting).
|
|
|
|
The main advantage of memory contexts over plain use of malloc/free is
|
|
that the entire contents of a memory context can be freed easily, without
|
|
having to request freeing of each individual chunk within it. This is
|
|
both faster and more reliable than per-chunk bookkeeping. We already use
|
|
this fact to clean up at transaction end: by resetting all the active
|
|
contexts, we reclaim all memory. What we need are additional contexts
|
|
that can be reset or deleted at strategic times within a query, such as
|
|
after each tuple.
|
|
|
|
|
|
Some Notes About the palloc API Versus Standard C Library
|
|
---------------------------------------------------------
|
|
|
|
The behavior of palloc and friends is similar to the standard C library's
|
|
malloc and friends, but there are some deliberate differences too. Here
|
|
are some notes to clarify the behavior.
|
|
|
|
* If out of memory, palloc and repalloc exit via elog(ERROR). They never
|
|
return NULL, and it is not necessary or useful to test for such a result.
|
|
|
|
* palloc(0) is explicitly a valid operation. It does not return a NULL
|
|
pointer, but a valid chunk of which no bytes may be used. (However, the
|
|
chunk might later be repalloc'd larger; it can also be pfree'd without
|
|
error.) (Note: this behavior is new in Postgres 8.0; earlier versions
|
|
disallowed palloc(0). It seems more consistent to allow it, however.)
|
|
Similarly, repalloc allows realloc'ing to zero size.
|
|
|
|
* pfree and repalloc do not accept a NULL pointer. This is intentional.
|
|
|
|
|
|
pfree/repalloc No Longer Depend On CurrentMemoryContext
|
|
-------------------------------------------------------
|
|
|
|
In this proposal, pfree() and repalloc() can be applied to any chunk
|
|
whether it belongs to CurrentMemoryContext or not --- the chunk's owning
|
|
context will be invoked to handle the operation, regardless. This is a
|
|
change from the old requirement that CurrentMemoryContext must be set
|
|
to the same context the memory was allocated from before one can use
|
|
pfree() or repalloc(). The old coding requirement is obviously fairly
|
|
error-prone, and will become more so the more context-switching we do;
|
|
so I think it's essential to use CurrentMemoryContext only for palloc.
|
|
We can avoid needing it for pfree/repalloc by putting restrictions on
|
|
context managers as discussed below.
|
|
|
|
We could even consider getting rid of CurrentMemoryContext entirely,
|
|
instead requiring the target memory context for allocation to be specified
|
|
explicitly. But I think that would be too much notational overhead ---
|
|
we'd have to pass an appropriate memory context to called routines in
|
|
many places. For example, the copyObject routines would need to be passed
|
|
a context, as would function execution routines that return a
|
|
pass-by-reference datatype. And what of routines that temporarily
|
|
allocate space internally, but don't return it to their caller? We
|
|
certainly don't want to clutter every call in the system with "here is
|
|
a context to use for any temporary memory allocation you might want to
|
|
do". So there'd still need to be a global variable specifying a suitable
|
|
temporary-allocation context. That might as well be CurrentMemoryContext.
|
|
|
|
|
|
Additions to the Memory-Context Mechanism
|
|
-----------------------------------------
|
|
|
|
If we are going to have more contexts, we need more mechanism for keeping
|
|
track of them; else we risk leaking whole contexts under error conditions.
|
|
|
|
We can do this by creating trees of "parent" and "child" contexts. When
|
|
creating a memory context, the new context can be specified to be a child
|
|
of some existing context. A context can have many children, but only one
|
|
parent. In this way the contexts form a forest (not necessarily a single
|
|
tree, since there could be more than one top-level context).
|
|
|
|
We then say that resetting or deleting any particular context resets or
|
|
deletes all its direct and indirect children as well. This feature allows
|
|
us to manage a lot of contexts without fear that some will be leaked; we
|
|
only need to keep track of one top-level context that we are going to
|
|
delete at transaction end, and make sure that any shorter-lived contexts
|
|
we create are descendants of that context. Since the tree can have
|
|
multiple levels, we can deal easily with nested lifetimes of storage,
|
|
such as per-transaction, per-statement, per-scan, per-tuple. Storage
|
|
lifetimes that only partially overlap can be handled by allocating
|
|
from different trees of the context forest (there are some examples
|
|
in the next section).
|
|
|
|
For convenience we will also want operations like "reset/delete all
|
|
children of a given context, but don't reset or delete that context
|
|
itself".
|
|
|
|
|
|
Globally Known Contexts
|
|
-----------------------
|
|
|
|
There will be several widely-known contexts that will typically be
|
|
referenced through global variables. At any instant the system may
|
|
contain many additional contexts, but all other contexts should be direct
|
|
or indirect children of one of these contexts to ensure they are not
|
|
leaked in event of an error.
|
|
|
|
TopMemoryContext --- this is the actual top level of the context tree;
|
|
every other context is a direct or indirect child of this one. Allocating
|
|
here is essentially the same as "malloc", because this context will never
|
|
be reset or deleted. This is for stuff that should live forever, or for
|
|
stuff that the controlling module will take care of deleting at the
|
|
appropriate time. An example is fd.c's tables of open files, as well as
|
|
the context management nodes for memory contexts themselves. Avoid
|
|
allocating stuff here unless really necessary, and especially avoid
|
|
running with CurrentMemoryContext pointing here.
|
|
|
|
PostmasterContext --- this is the postmaster's normal working context.
|
|
After a backend is spawned, it can delete PostmasterContext to free its
|
|
copy of memory the postmaster was using that it doesn't need. (Anything
|
|
that has to be passed from postmaster to backends will be passed in
|
|
TopMemoryContext. The postmaster will have only TopMemoryContext,
|
|
PostmasterContext, and ErrorContext --- the remaining top-level contexts
|
|
will be set up in each backend during startup.)
|
|
|
|
CacheMemoryContext --- permanent storage for relcache, catcache, and
|
|
related modules. This will never be reset or deleted, either, so it's
|
|
not truly necessary to distinguish it from TopMemoryContext. But it
|
|
seems worthwhile to maintain the distinction for debugging purposes.
|
|
(Note: CacheMemoryContext will have child-contexts with shorter lifespans.
|
|
For example, a child context is the best place to keep the subsidiary
|
|
storage associated with a relcache entry; that way we can free rule
|
|
parsetrees and so forth easily, without having to depend on constructing
|
|
a reliable version of freeObject().)
|
|
|
|
MessageContext --- this context holds the current command message from the
|
|
frontend, as well as any derived storage that need only live as long as
|
|
the current message (for example, in simple-Query mode the parse and plan
|
|
trees can live here). This context will be reset, and any children
|
|
deleted, at the top of each cycle of the outer loop of PostgresMain. This
|
|
is kept separate from per-transaction and per-portal contexts because a
|
|
query string might need to live either a longer or shorter time than any
|
|
single transaction or portal.
|
|
|
|
TopTransactionContext --- this holds everything that lives until end of the
|
|
top-level transaction. This context will be reset, and all its children
|
|
deleted, at conclusion of each top-level transaction cycle. In most cases
|
|
you don't want to allocate stuff directly here, but in CurTransactionContext;
|
|
what does belong here is control information that exists explicitly to manage
|
|
status across multiple subtransactions. Note: this context is NOT cleared
|
|
immediately upon error; its contents will survive until the transaction block
|
|
is exited by COMMIT/ROLLBACK.
|
|
|
|
CurTransactionContext --- this holds data that has to survive until the end
|
|
of the current transaction, and in particular will be needed at top-level
|
|
transaction commit. When we are in a top-level transaction this is the same
|
|
as TopTransactionContext, but in subtransactions it points to a child context.
|
|
It is important to understand that if a subtransaction aborts, its
|
|
CurTransactionContext is thrown away after finishing the abort processing;
|
|
but a committed subtransaction's CurTransactionContext is kept until top-level
|
|
commit (unless of course one of the intermediate levels of subtransaction
|
|
aborts). This ensures that we do not keep data from a failed subtransaction
|
|
longer than necessary. Because of this behavior, you must be careful to clean
|
|
up properly during subtransaction abort --- the subtransaction's state must be
|
|
delinked from any pointers or lists kept in upper transactions, or you will
|
|
have dangling pointers leading to a crash at top-level commit. An example of
|
|
data kept here is pending NOTIFY messages, which are sent at top-level commit,
|
|
but only if the generating subtransaction did not abort.
|
|
|
|
PortalContext --- this is not actually a separate context, but a
|
|
global variable pointing to the per-portal context of the currently active
|
|
execution portal. This can be used if it's necessary to allocate storage
|
|
that will live just as long as the execution of the current portal requires.
|
|
|
|
ErrorContext --- this permanent context will be switched into for error
|
|
recovery processing, and then reset on completion of recovery. We'll
|
|
arrange to have, say, 8K of memory available in it at all times. In this
|
|
way, we can ensure that some memory is available for error recovery even
|
|
if the backend has run out of memory otherwise. This allows out-of-memory
|
|
to be treated as a normal ERROR condition, not a FATAL error.
|
|
|
|
|
|
Contexts For Prepared Statements And Portals
|
|
--------------------------------------------
|
|
|
|
A prepared-statement object has an associated private context, in which
|
|
the parse and plan trees for its query are stored. Because these trees
|
|
are read-only to the executor, the prepared statement can be re-used many
|
|
times without further copying of these trees.
|
|
|
|
An execution-portal object has a private context that is referenced by
|
|
PortalContext when the portal is active. In the case of a portal created
|
|
by DECLARE CURSOR, this private context contains the query parse and plan
|
|
trees (there being no other object that can hold them). Portals created
|
|
from prepared statements simply reference the prepared statements' trees,
|
|
and won't actually need any storage allocated in their private contexts.
|
|
|
|
|
|
Transient Contexts During Execution
|
|
-----------------------------------
|
|
|
|
When creating a prepared statement, the parse and plan trees will be built
|
|
in a temporary context that's a child of MessageContext (so that it will
|
|
go away automatically upon error). On success, the finished plan is
|
|
copied to the prepared statement's private context, and the temp context
|
|
is released; this allows planner temporary space to be recovered before
|
|
execution begins. (In simple-Query mode we'll not bother with the extra
|
|
copy step, so the planner temp space stays around till end of query.)
|
|
|
|
The top-level executor routines, as well as most of the "plan node"
|
|
execution code, will normally run in a context that is created by
|
|
ExecutorStart and destroyed by ExecutorEnd; this context also holds the
|
|
"plan state" tree built during ExecutorStart. Most of the memory
|
|
allocated in these routines is intended to live until end of query,
|
|
so this is appropriate for those purposes. The executor's top context
|
|
is a child of PortalContext, that is, the per-portal context of the
|
|
portal that represents the query's execution.
|
|
|
|
The main improvement needed in the executor is that expression evaluation
|
|
--- both for qual testing and for computation of targetlist entries ---
|
|
needs to not leak memory. To do this, each ExprContext (expression-eval
|
|
context) created in the executor will now have a private memory context
|
|
associated with it, and we'll arrange to switch into that context when
|
|
evaluating expressions in that ExprContext. The plan node that owns the
|
|
ExprContext is responsible for resetting the private context to empty
|
|
when it no longer needs the results of expression evaluations. Typically
|
|
the reset is done at the start of each tuple-fetch cycle in the plan node.
|
|
|
|
Note that this design gives each plan node its own expression-eval memory
|
|
context. This appears necessary to handle nested joins properly, since
|
|
an outer plan node might need to retain expression results it has computed
|
|
while obtaining the next tuple from an inner node --- but the inner node
|
|
might execute many tuple cycles and many expressions before returning a
|
|
tuple. The inner node must be able to reset its own expression context
|
|
more often than once per outer tuple cycle. Fortunately, memory contexts
|
|
are cheap enough that giving one to each plan node doesn't seem like a
|
|
problem.
|
|
|
|
A problem with running index accesses and sorts in a query-lifespan context
|
|
is that these operations invoke datatype-specific comparison functions,
|
|
and if the comparators leak any memory then that memory won't be recovered
|
|
till end of query. The comparator functions all return bool or int32,
|
|
so there's no problem with their result data, but there can be a problem
|
|
with leakage of internal temporary data. In particular, comparator
|
|
functions that operate on TOAST-able data types will need to be careful
|
|
not to leak detoasted versions of their inputs. This is annoying, but
|
|
it appears a lot easier to make the comparators conform than to fix the
|
|
index and sort routines, so that's what I propose to do for 7.1. Further
|
|
cleanup can be left for another day.
|
|
|
|
There will be some special cases, such as aggregate functions. nodeAgg.c
|
|
needs to remember the results of evaluation of aggregate transition
|
|
functions from one tuple cycle to the next, so it can't just discard
|
|
all per-tuple state in each cycle. The easiest way to handle this seems
|
|
to be to have two per-tuple contexts in an aggregate node, and to
|
|
ping-pong between them, so that at each tuple one is the active allocation
|
|
context and the other holds any results allocated by the prior cycle's
|
|
transition function.
|
|
|
|
Executor routines that switch the active CurrentMemoryContext may need
|
|
to copy data into their caller's current memory context before returning.
|
|
I think there will be relatively little need for that, because of the
|
|
convention of resetting the per-tuple context at the *start* of an
|
|
execution cycle rather than at its end. With that rule, an execution
|
|
node can return a tuple that is palloc'd in its per-tuple context, and
|
|
the tuple will remain good until the node is called for another tuple
|
|
or told to end execution. This is pretty much the same state of affairs
|
|
that exists now, since a scan node can return a direct pointer to a tuple
|
|
in a disk buffer that is only guaranteed to remain good that long.
|
|
|
|
A more common reason for copying data will be to transfer a result from
|
|
per-tuple context to per-run context; for example, a Unique node will
|
|
save the last distinct tuple value in its per-run context, requiring a
|
|
copy step.
|
|
|
|
Another interesting special case is VACUUM, which needs to allocate
|
|
working space that will survive its forced transaction commits, yet
|
|
be released on error. Currently it does that through a "portal",
|
|
which is essentially a child context of TopMemoryContext. While that
|
|
way still works, it's ugly since xact abort needs special processing
|
|
to delete the portal. Better would be to use a context that's a child
|
|
of PortalContext and hence is certain to go away as part of normal
|
|
processing. (Eventually we might have an even better solution from
|
|
nested transactions, but this'll do fine for now.)
|
|
|
|
|
|
Mechanisms to Allow Multiple Types of Contexts
|
|
----------------------------------------------
|
|
|
|
We may want several different types of memory contexts with different
|
|
allocation policies but similar external behavior. To handle this,
|
|
memory allocation functions will be accessed via function pointers,
|
|
and we will require all context types to obey the conventions given here.
|
|
(This is not very far different from the existing code.)
|
|
|
|
A memory context will be represented by an object like
|
|
|
|
typedef struct MemoryContextData
|
|
{
|
|
NodeTag type; /* identifies exact kind of context */
|
|
MemoryContextMethods methods;
|
|
MemoryContextData *parent; /* NULL if no parent (toplevel context) */
|
|
MemoryContextData *firstchild; /* head of linked list of children */
|
|
MemoryContextData *nextchild; /* next child of same parent */
|
|
char *name; /* context name (just for debugging) */
|
|
} MemoryContextData, *MemoryContext;
|
|
|
|
This is essentially an abstract superclass, and the "methods" pointer is
|
|
its virtual function table. Specific memory context types will use
|
|
derived structs having these fields as their first fields. All the
|
|
contexts of a specific type will have methods pointers that point to the
|
|
same static table of function pointers, which will look like
|
|
|
|
typedef struct MemoryContextMethodsData
|
|
{
|
|
Pointer (*alloc) (MemoryContext c, Size size);
|
|
void (*free_p) (Pointer chunk);
|
|
Pointer (*realloc) (Pointer chunk, Size newsize);
|
|
void (*reset) (MemoryContext c);
|
|
void (*delete) (MemoryContext c);
|
|
} MemoryContextMethodsData, *MemoryContextMethods;
|
|
|
|
Alloc, reset, and delete requests will take a MemoryContext pointer
|
|
as parameter, so they'll have no trouble finding the method pointer
|
|
to call. Free and realloc are trickier. To make those work, we will
|
|
require all memory context types to produce allocated chunks that
|
|
are immediately preceded by a standard chunk header, which has the
|
|
layout
|
|
|
|
typedef struct StandardChunkHeader
|
|
{
|
|
MemoryContext mycontext; /* Link to owning context object */
|
|
Size size; /* Allocated size of chunk */
|
|
};
|
|
|
|
It turns out that the existing aset.c memory context type does this
|
|
already, and probably any other kind of context would need to have the
|
|
same data available to support realloc, so this is not really creating
|
|
any additional overhead. (Note that if a context type needs more per-
|
|
allocated-chunk information than this, it can make an additional
|
|
nonstandard header that precedes the standard header. So we're not
|
|
constraining context-type designers very much.)
|
|
|
|
Given this, the pfree routine will look something like
|
|
|
|
StandardChunkHeader * header =
|
|
(StandardChunkHeader *) ((char *) p - sizeof(StandardChunkHeader));
|
|
|
|
(*header->mycontext->methods->free_p) (p);
|
|
|
|
We could do it as a macro, but the macro would have to evaluate its
|
|
argument twice, which seems like a bad idea (the current pfree macro
|
|
does not do that). This is already saving two levels of function call
|
|
compared to the existing code, so I think we're doing fine without
|
|
squeezing out that last little bit ...
|
|
|
|
|
|
More Control Over aset.c Behavior
|
|
---------------------------------
|
|
|
|
Currently, aset.c allocates an 8K block upon the first allocation in
|
|
a context, and doubles that size for each successive block request.
|
|
That's good behavior for a context that might hold *lots* of data, and
|
|
the overhead wasn't bad when we had only a few contexts in existence.
|
|
With dozens if not hundreds of smaller contexts in the system, we will
|
|
want to be able to fine-tune things a little better.
|
|
|
|
The creator of a context will be able to specify an initial block size
|
|
and a maximum block size. Selecting smaller values will prevent wastage
|
|
of space in contexts that aren't expected to hold very much (an example is
|
|
the relcache's per-relation contexts).
|
|
|
|
Also, it will be possible to specify a minimum context size. If this
|
|
value is greater than zero then a block of that size will be grabbed
|
|
immediately upon context creation, and cleared but not released during
|
|
context resets. This feature is needed for ErrorContext (see above),
|
|
but will most likely not be used for other contexts.
|
|
|
|
We expect that per-tuple contexts will be reset frequently and typically
|
|
will not allocate very much space per tuple cycle. To make this usage
|
|
pattern cheap, the first block allocated in a context is not given
|
|
back to malloc() during reset, but just cleared. This avoids malloc
|
|
thrashing.
|
|
|
|
|
|
Other Notes
|
|
-----------
|
|
|
|
The original version of this proposal suggested that functions returning
|
|
pass-by-reference datatypes should be required to return a value freshly
|
|
palloc'd in their caller's memory context, never a pointer to an input
|
|
value. I've abandoned that notion since it clearly is prone to error.
|
|
In the current proposal, it is possible to discover which context a
|
|
chunk of memory is allocated in (by checking the required standard chunk
|
|
header), so nodeAgg can determine whether or not it's safe to reset
|
|
its working context; it doesn't have to rely on the transition function
|
|
to do what it's expecting.
|