1073 lines
42 KiB
Plaintext
1073 lines
42 KiB
Plaintext
<!-- doc/src/sgml/storage.sgml -->
|
|
|
|
<chapter id="storage">
|
|
|
|
<title>Database Physical Storage</title>
|
|
|
|
<para>
|
|
This chapter provides an overview of the physical storage format used by
|
|
<productname>PostgreSQL</productname> databases.
|
|
</para>
|
|
|
|
<sect1 id="storage-file-layout">
|
|
|
|
<title>Database File Layout</title>
|
|
|
|
<para>
|
|
This section describes the storage format at the level of files and
|
|
directories.
|
|
</para>
|
|
|
|
<para>
|
|
Traditionally, the configuration and data files used by a database
|
|
cluster are stored together within the cluster's data
|
|
directory, commonly referred to as <varname>PGDATA</varname> (after the name of the
|
|
environment variable that can be used to define it). A common location for
|
|
<varname>PGDATA</varname> is <filename>/var/lib/pgsql/data</filename>. Multiple clusters,
|
|
managed by different server instances, can exist on the same machine.
|
|
</para>
|
|
|
|
<para>
|
|
The <varname>PGDATA</varname> directory contains several subdirectories and control
|
|
files, as shown in <xref linkend="pgdata-contents-table"/>. In addition to
|
|
these required items, the cluster configuration files
|
|
<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
|
|
<filename>pg_ident.conf</filename> are traditionally stored in
|
|
<varname>PGDATA</varname>, although it is possible to place them elsewhere.
|
|
</para>
|
|
|
|
<table tocentry="1" id="pgdata-contents-table">
|
|
<title>Contents of <varname>PGDATA</varname></title>
|
|
<tgroup cols="2">
|
|
<thead>
|
|
<row>
|
|
<entry>
|
|
Item
|
|
</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
|
|
<row>
|
|
<entry><filename>PG_VERSION</filename></entry>
|
|
<entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>base</filename></entry>
|
|
<entry>Subdirectory containing per-database subdirectories</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>current_logfiles</filename></entry>
|
|
<entry>File recording the log file(s) currently written to by the logging
|
|
collector</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>global</filename></entry>
|
|
<entry>Subdirectory containing cluster-wide tables, such as
|
|
<structname>pg_database</structname></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_commit_ts</filename></entry>
|
|
<entry>Subdirectory containing transaction commit timestamp data</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_dynshmem</filename></entry>
|
|
<entry>Subdirectory containing files used by the dynamic shared memory
|
|
subsystem</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_logical</filename></entry>
|
|
<entry>Subdirectory containing status data for logical decoding</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_multixact</filename></entry>
|
|
<entry>Subdirectory containing multitransaction status data
|
|
(used for shared row locks)</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_notify</filename></entry>
|
|
<entry>Subdirectory containing LISTEN/NOTIFY status data</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_replslot</filename></entry>
|
|
<entry>Subdirectory containing replication slot data</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_serial</filename></entry>
|
|
<entry>Subdirectory containing information about committed serializable transactions</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_snapshots</filename></entry>
|
|
<entry>Subdirectory containing exported snapshots</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_stat</filename></entry>
|
|
<entry>Subdirectory containing permanent files for the statistics
|
|
subsystem</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_stat_tmp</filename></entry>
|
|
<entry>Subdirectory containing temporary files for the statistics
|
|
subsystem</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_subtrans</filename></entry>
|
|
<entry>Subdirectory containing subtransaction status data</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_tblspc</filename></entry>
|
|
<entry>Subdirectory containing symbolic links to tablespaces</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_twophase</filename></entry>
|
|
<entry>Subdirectory containing state files for prepared transactions</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_wal</filename></entry>
|
|
<entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>pg_xact</filename></entry>
|
|
<entry>Subdirectory containing transaction commit status data</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>postgresql.auto.conf</filename></entry>
|
|
<entry>A file used for storing configuration parameters that are set by
|
|
<command>ALTER SYSTEM</command></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>postmaster.opts</filename></entry>
|
|
<entry>A file recording the command-line options the server was
|
|
last started with</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><filename>postmaster.pid</filename></entry>
|
|
<entry>A lock file recording the current postmaster process ID (PID),
|
|
cluster data directory path,
|
|
postmaster start timestamp,
|
|
port number,
|
|
Unix-domain socket directory path (empty on Windows),
|
|
first valid listen_address (IP address or <literal>*</literal>, or empty if
|
|
not listening on TCP),
|
|
and shared memory segment ID
|
|
(this file is not present after server shutdown)</entry>
|
|
</row>
|
|
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
For each database in the cluster there is a subdirectory within
|
|
<varname>PGDATA</varname><filename>/base</filename>, named after the database's OID in
|
|
<structname>pg_database</structname>. This subdirectory is the default location
|
|
for the database's files; in particular, its system catalogs are stored
|
|
there.
|
|
</para>
|
|
|
|
<para>
|
|
Note that the following sections describe the behavior of the builtin
|
|
<literal>heap</literal> <link linkend="tableam">table access method</link>,
|
|
and the builtin <link linkend="indexam">index access methods</link>. Due
|
|
to the extensible nature of <productname>PostgreSQL</productname>, other
|
|
access methods might work differently.
|
|
</para>
|
|
|
|
<para>
|
|
Each table and index is stored in a separate file. For ordinary relations,
|
|
these files are named after the table or index's <firstterm>filenode</firstterm> number,
|
|
which can be found in <structname>pg_class</structname>.<structfield>relfilenode</structfield>. But
|
|
for temporary relations, the file name is of the form
|
|
<literal>t<replaceable>BBB</replaceable>_<replaceable>FFF</replaceable></literal>, where <replaceable>BBB</replaceable>
|
|
is the backend ID of the backend which created the file, and <replaceable>FFF</replaceable>
|
|
is the filenode number. In either case, in addition to the main file (a/k/a
|
|
main fork), each table and index has a <firstterm>free space map</firstterm> (see <xref
|
|
linkend="storage-fsm"/>), which stores information about free space available in
|
|
the relation. The free space map is stored in a file named with the filenode
|
|
number plus the suffix <literal>_fsm</literal>. Tables also have a
|
|
<firstterm>visibility map</firstterm>, stored in a fork with the suffix <literal>_vm</literal>,
|
|
to track which pages are known to have no dead tuples. The visibility map is
|
|
described further in <xref linkend="storage-vm"/>. Unlogged tables and indexes
|
|
have a third fork, known as the initialization fork, which is stored in a fork
|
|
with the suffix <literal>_init</literal> (see <xref linkend="storage-init"/>).
|
|
</para>
|
|
|
|
<caution>
|
|
<para>
|
|
Note that while a table's filenode often matches its OID, this is
|
|
<emphasis>not</emphasis> necessarily the case; some operations, like
|
|
<command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms
|
|
of <command>ALTER TABLE</command>, can change the filenode while preserving the OID.
|
|
Avoid assuming that filenode and table OID are the same.
|
|
Also, for certain system catalogs including <structname>pg_class</structname> itself,
|
|
<structname>pg_class</structname>.<structfield>relfilenode</structfield> contains zero. The
|
|
actual filenode number of these catalogs is stored in a lower-level data
|
|
structure, and can be obtained using the <function>pg_relation_filenode()</function>
|
|
function.
|
|
</para>
|
|
</caution>
|
|
|
|
<para>
|
|
When a table or index exceeds 1 GB, it is divided into gigabyte-sized
|
|
<firstterm>segments</firstterm>. The first segment's file name is the same as the
|
|
filenode; subsequent segments are named filenode.1, filenode.2, etc.
|
|
This arrangement avoids problems on platforms that have file size limitations.
|
|
(Actually, 1 GB is just the default segment size. The segment size can be
|
|
adjusted using the configuration option <option>--with-segsize</option>
|
|
when building <productname>PostgreSQL</productname>.)
|
|
In principle, free space map and visibility map forks could require multiple
|
|
segments as well, though this is unlikely to happen in practice.
|
|
</para>
|
|
|
|
<para>
|
|
A table that has columns with potentially large entries will have an
|
|
associated <firstterm>TOAST</firstterm> table, which is used for out-of-line storage of
|
|
field values that are too large to keep in the table rows proper.
|
|
<structname>pg_class</structname>.<structfield>reltoastrelid</structfield> links from a table to
|
|
its <acronym>TOAST</acronym> table, if any.
|
|
See <xref linkend="storage-toast"/> for more information.
|
|
</para>
|
|
|
|
<para>
|
|
The contents of tables and indexes are discussed further in
|
|
<xref linkend="storage-page-layout"/>.
|
|
</para>
|
|
|
|
<para>
|
|
Tablespaces make the scenario more complicated. Each user-defined tablespace
|
|
has a symbolic link inside the <varname>PGDATA</varname><filename>/pg_tblspc</filename>
|
|
directory, which points to the physical tablespace directory (i.e., the
|
|
location specified in the tablespace's <command>CREATE TABLESPACE</command> command).
|
|
This symbolic link is named after
|
|
the tablespace's OID. Inside the physical tablespace directory there is
|
|
a subdirectory with a name that depends on the <productname>PostgreSQL</productname>
|
|
server version, such as <literal>PG_9.0_201008051</literal>. (The reason for using
|
|
this subdirectory is so that successive versions of the database can use
|
|
the same <command>CREATE TABLESPACE</command> location value without conflicts.)
|
|
Within the version-specific subdirectory, there is
|
|
a subdirectory for each database that has elements in the tablespace, named
|
|
after the database's OID. Tables and indexes are stored within that
|
|
directory, using the filenode naming scheme.
|
|
The <literal>pg_default</literal> tablespace is not accessed through
|
|
<filename>pg_tblspc</filename>, but corresponds to
|
|
<varname>PGDATA</varname><filename>/base</filename>. Similarly, the <literal>pg_global</literal>
|
|
tablespace is not accessed through <filename>pg_tblspc</filename>, but corresponds to
|
|
<varname>PGDATA</varname><filename>/global</filename>.
|
|
</para>
|
|
|
|
<para>
|
|
The <function>pg_relation_filepath()</function> function shows the entire path
|
|
(relative to <varname>PGDATA</varname>) of any relation. It is often useful
|
|
as a substitute for remembering many of the above rules. But keep in
|
|
mind that this function just gives the name of the first segment of the
|
|
main fork of the relation — you may need to append a segment number
|
|
and/or <literal>_fsm</literal>, <literal>_vm</literal>, or <literal>_init</literal> to find all
|
|
the files associated with the relation.
|
|
</para>
|
|
|
|
<para>
|
|
Temporary files (for operations such as sorting more data than can fit in
|
|
memory) are created within <varname>PGDATA</varname><filename>/base/pgsql_tmp</filename>,
|
|
or within a <filename>pgsql_tmp</filename> subdirectory of a tablespace directory
|
|
if a tablespace other than <literal>pg_default</literal> is specified for them.
|
|
The name of a temporary file has the form
|
|
<filename>pgsql_tmp<replaceable>PPP</replaceable>.<replaceable>NNN</replaceable></filename>,
|
|
where <replaceable>PPP</replaceable> is the PID of the owning backend and
|
|
<replaceable>NNN</replaceable> distinguishes different temporary files of that backend.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="storage-toast">
|
|
|
|
<title>TOAST</title>
|
|
|
|
<indexterm>
|
|
<primary>TOAST</primary>
|
|
</indexterm>
|
|
<indexterm><primary>sliced bread</primary><see>TOAST</see></indexterm>
|
|
|
|
<para>
|
|
This section provides an overview of <acronym>TOAST</acronym> (The
|
|
Oversized-Attribute Storage Technique).
|
|
</para>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> uses a fixed page size (commonly
|
|
8 kB), and does not allow tuples to span multiple pages. Therefore, it is
|
|
not possible to store very large field values directly. To overcome
|
|
this limitation, large field values are compressed and/or broken up into
|
|
multiple physical rows. This happens transparently to the user, with only
|
|
small impact on most of the backend code. The technique is affectionately
|
|
known as <acronym>TOAST</acronym> (or <quote>the best thing since sliced bread</quote>).
|
|
The <acronym>TOAST</acronym> infrastructure is also used to improve handling of
|
|
large data values in-memory.
|
|
</para>
|
|
|
|
<para>
|
|
Only certain data types support <acronym>TOAST</acronym> — there is no need to
|
|
impose the overhead on data types that cannot produce large field values.
|
|
To support <acronym>TOAST</acronym>, a data type must have a variable-length
|
|
(<firstterm>varlena</firstterm>) representation, in which, ordinarily, the first
|
|
four-byte word of any stored value contains the total length of the value in
|
|
bytes (including itself). <acronym>TOAST</acronym> does not constrain the rest
|
|
of the data type's representation. The special representations collectively
|
|
called <firstterm><acronym>TOAST</acronym>ed values</firstterm> work by modifying or
|
|
reinterpreting this initial length word. Therefore, the C-level functions
|
|
supporting a <acronym>TOAST</acronym>-able data type must be careful about how they
|
|
handle potentially <acronym>TOAST</acronym>ed input values: an input might not
|
|
actually consist of a four-byte length word and contents until after it's
|
|
been <firstterm>detoasted</firstterm>. (This is normally done by invoking
|
|
<function>PG_DETOAST_DATUM</function> before doing anything with an input value,
|
|
but in some cases more efficient approaches are possible.
|
|
See <xref linkend="xtypes-toast"/> for more detail.)
|
|
</para>
|
|
|
|
<para>
|
|
<acronym>TOAST</acronym> usurps two bits of the varlena length word (the high-order
|
|
bits on big-endian machines, the low-order bits on little-endian machines),
|
|
thereby limiting the logical size of any value of a <acronym>TOAST</acronym>-able
|
|
data type to 1 GB (2<superscript>30</superscript> - 1 bytes). When both bits are zero,
|
|
the value is an ordinary un-<acronym>TOAST</acronym>ed value of the data type, and
|
|
the remaining bits of the length word give the total datum size (including
|
|
length word) in bytes. When the highest-order or lowest-order bit is set,
|
|
the value has only a single-byte header instead of the normal four-byte
|
|
header, and the remaining bits of that byte give the total datum size
|
|
(including length byte) in bytes. This alternative supports space-efficient
|
|
storage of values shorter than 127 bytes, while still allowing the data type
|
|
to grow to 1 GB at need. Values with single-byte headers aren't aligned on
|
|
any particular boundary, whereas values with four-byte headers are aligned on
|
|
at least a four-byte boundary; this omission of alignment padding provides
|
|
additional space savings that is significant compared to short values.
|
|
As a special case, if the remaining bits of a single-byte header are all
|
|
zero (which would be impossible for a self-inclusive length), the value is
|
|
a pointer to out-of-line data, with several possible alternatives as
|
|
described below. The type and size of such a <firstterm>TOAST pointer</firstterm>
|
|
are determined by a code stored in the second byte of the datum.
|
|
Lastly, when the highest-order or lowest-order bit is clear but the adjacent
|
|
bit is set, the content of the datum has been compressed and must be
|
|
decompressed before use. In this case the remaining bits of the four-byte
|
|
length word give the total size of the compressed datum, not the
|
|
original data. Note that compression is also possible for out-of-line data
|
|
but the varlena header does not tell whether it has occurred —
|
|
the content of the <acronym>TOAST</acronym> pointer tells that, instead.
|
|
</para>
|
|
|
|
<para>
|
|
As mentioned, there are multiple types of <acronym>TOAST</acronym> pointer datums.
|
|
The oldest and most common type is a pointer to out-of-line data stored in
|
|
a <firstterm><acronym>TOAST</acronym> table</firstterm> that is separate from, but
|
|
associated with, the table containing the <acronym>TOAST</acronym> pointer datum
|
|
itself. These <firstterm>on-disk</firstterm> pointer datums are created by the
|
|
<acronym>TOAST</acronym> management code (in <filename>access/heap/tuptoaster.c</filename>)
|
|
when a tuple to be stored on disk is too large to be stored as-is.
|
|
Further details appear in <xref linkend="storage-toast-ondisk"/>.
|
|
Alternatively, a <acronym>TOAST</acronym> pointer datum can contain a pointer to
|
|
out-of-line data that appears elsewhere in memory. Such datums are
|
|
necessarily short-lived, and will never appear on-disk, but they are very
|
|
useful for avoiding copying and redundant processing of large data values.
|
|
Further details appear in <xref linkend="storage-toast-inmemory"/>.
|
|
</para>
|
|
|
|
<para>
|
|
The compression technique used for either in-line or out-of-line compressed
|
|
data is a fairly simple and very fast member
|
|
of the LZ family of compression techniques. See
|
|
<filename>src/common/pg_lzcompress.c</filename> for the details.
|
|
</para>
|
|
|
|
<sect2 id="storage-toast-ondisk">
|
|
<title>Out-of-line, on-disk TOAST storage</title>
|
|
|
|
<para>
|
|
If any of the columns of a table are <acronym>TOAST</acronym>-able, the table will
|
|
have an associated <acronym>TOAST</acronym> table, whose OID is stored in the table's
|
|
<structname>pg_class</structname>.<structfield>reltoastrelid</structfield> entry. On-disk
|
|
<acronym>TOAST</acronym>ed values are kept in the <acronym>TOAST</acronym> table, as
|
|
described in more detail below.
|
|
</para>
|
|
|
|
<para>
|
|
Out-of-line values are divided (after compression if used) into chunks of at
|
|
most <symbol>TOAST_MAX_CHUNK_SIZE</symbol> bytes (by default this value is chosen
|
|
so that four chunk rows will fit on a page, making it about 2000 bytes).
|
|
Each chunk is stored as a separate row in the <acronym>TOAST</acronym> table
|
|
belonging to the owning table. Every
|
|
<acronym>TOAST</acronym> table has the columns <structfield>chunk_id</structfield> (an OID
|
|
identifying the particular <acronym>TOAST</acronym>ed value),
|
|
<structfield>chunk_seq</structfield> (a sequence number for the chunk within its value),
|
|
and <structfield>chunk_data</structfield> (the actual data of the chunk). A unique index
|
|
on <structfield>chunk_id</structfield> and <structfield>chunk_seq</structfield> provides fast
|
|
retrieval of the values. A pointer datum representing an out-of-line on-disk
|
|
<acronym>TOAST</acronym>ed value therefore needs to store the OID of the
|
|
<acronym>TOAST</acronym> table in which to look and the OID of the specific value
|
|
(its <structfield>chunk_id</structfield>). For convenience, pointer datums also store the
|
|
logical datum size (original uncompressed data length) and physical stored size
|
|
(different if compression was applied). Allowing for the varlena header bytes,
|
|
the total size of an on-disk <acronym>TOAST</acronym> pointer datum is therefore 18
|
|
bytes regardless of the actual size of the represented value.
|
|
</para>
|
|
|
|
<para>
|
|
The <acronym>TOAST</acronym> management code is triggered only
|
|
when a row value to be stored in a table is wider than
|
|
<symbol>TOAST_TUPLE_THRESHOLD</symbol> bytes (normally 2 kB).
|
|
The <acronym>TOAST</acronym> code will compress and/or move
|
|
field values out-of-line until the row value is shorter than
|
|
<symbol>TOAST_TUPLE_TARGET</symbol> bytes (also normally 2 kB, adjustable)
|
|
or no more gains can be had. During an UPDATE
|
|
operation, values of unchanged fields are normally preserved as-is; so an
|
|
UPDATE of a row with out-of-line values incurs no <acronym>TOAST</acronym> costs if
|
|
none of the out-of-line values change.
|
|
</para>
|
|
|
|
<para>
|
|
The <acronym>TOAST</acronym> management code recognizes four different strategies
|
|
for storing <acronym>TOAST</acronym>-able columns on disk:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<literal>PLAIN</literal> prevents either compression or
|
|
out-of-line storage; furthermore it disables use of single-byte headers
|
|
for varlena types.
|
|
This is the only possible strategy for
|
|
columns of non-<acronym>TOAST</acronym>-able data types.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>EXTENDED</literal> allows both compression and out-of-line
|
|
storage. This is the default for most <acronym>TOAST</acronym>-able data types.
|
|
Compression will be attempted first, then out-of-line storage if
|
|
the row is still too big.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>EXTERNAL</literal> allows out-of-line storage but not
|
|
compression. Use of <literal>EXTERNAL</literal> will
|
|
make substring operations on wide <type>text</type> and
|
|
<type>bytea</type> columns faster (at the penalty of increased storage
|
|
space) because these operations are optimized to fetch only the
|
|
required parts of the out-of-line value when it is not compressed.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>MAIN</literal> allows compression but not out-of-line
|
|
storage. (Actually, out-of-line storage will still be performed
|
|
for such columns, but only as a last resort when there is no other
|
|
way to make the row small enough to fit on a page.)
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
Each <acronym>TOAST</acronym>-able data type specifies a default strategy for columns
|
|
of that data type, but the strategy for a given table column can be altered
|
|
with <link linkend="sql-altertable"><command>ALTER TABLE ... SET STORAGE</command></link>.
|
|
</para>
|
|
|
|
<para>
|
|
<symbol>TOAST_TUPLE_TARGET</symbol> can be adjusted for each table using
|
|
<link linkend="sql-altertable"><command>ALTER TABLE ... SET (toast_tuple_target = N)</command></link>
|
|
</para>
|
|
|
|
<para>
|
|
This scheme has a number of advantages compared to a more straightforward
|
|
approach such as allowing row values to span pages. Assuming that queries are
|
|
usually qualified by comparisons against relatively small key values, most of
|
|
the work of the executor will be done using the main row entry. The big values
|
|
of <acronym>TOAST</acronym>ed attributes will only be pulled out (if selected at all)
|
|
at the time the result set is sent to the client. Thus, the main table is much
|
|
smaller and more of its rows fit in the shared buffer cache than would be the
|
|
case without any out-of-line storage. Sort sets shrink also, and sorts will
|
|
more often be done entirely in memory. A little test showed that a table
|
|
containing typical HTML pages and their URLs was stored in about half of the
|
|
raw data size including the <acronym>TOAST</acronym> table, and that the main table
|
|
contained only about 10% of the entire data (the URLs and some small HTML
|
|
pages). There was no run time difference compared to an un-<acronym>TOAST</acronym>ed
|
|
comparison table, in which all the HTML pages were cut down to 7 kB to fit.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="storage-toast-inmemory">
|
|
<title>Out-of-line, in-memory TOAST storage</title>
|
|
|
|
<para>
|
|
<acronym>TOAST</acronym> pointers can point to data that is not on disk, but is
|
|
elsewhere in the memory of the current server process. Such pointers
|
|
obviously cannot be long-lived, but they are nonetheless useful. There
|
|
are currently two sub-cases:
|
|
pointers to <firstterm>indirect</firstterm> data and
|
|
pointers to <firstterm>expanded</firstterm> data.
|
|
</para>
|
|
|
|
<para>
|
|
Indirect <acronym>TOAST</acronym> pointers simply point at a non-indirect varlena
|
|
value stored somewhere in memory. This case was originally created merely
|
|
as a proof of concept, but it is currently used during logical decoding to
|
|
avoid possibly having to create physical tuples exceeding 1 GB (as pulling
|
|
all out-of-line field values into the tuple might do). The case is of
|
|
limited use since the creator of the pointer datum is entirely responsible
|
|
that the referenced data survives for as long as the pointer could exist,
|
|
and there is no infrastructure to help with this.
|
|
</para>
|
|
|
|
<para>
|
|
Expanded <acronym>TOAST</acronym> pointers are useful for complex data types
|
|
whose on-disk representation is not especially suited for computational
|
|
purposes. As an example, the standard varlena representation of a
|
|
<productname>PostgreSQL</productname> array includes dimensionality information, a
|
|
nulls bitmap if there are any null elements, then the values of all the
|
|
elements in order. When the element type itself is variable-length, the
|
|
only way to find the <replaceable>N</replaceable>'th element is to scan through all the
|
|
preceding elements. This representation is appropriate for on-disk storage
|
|
because of its compactness, but for computations with the array it's much
|
|
nicer to have an <quote>expanded</quote> or <quote>deconstructed</quote>
|
|
representation in which all the element starting locations have been
|
|
identified. The <acronym>TOAST</acronym> pointer mechanism supports this need by
|
|
allowing a pass-by-reference Datum to point to either a standard varlena
|
|
value (the on-disk representation) or a <acronym>TOAST</acronym> pointer that
|
|
points to an expanded representation somewhere in memory. The details of
|
|
this expanded representation are up to the data type, though it must have
|
|
a standard header and meet the other API requirements given
|
|
in <filename>src/include/utils/expandeddatum.h</filename>. C-level functions
|
|
working with the data type can choose to handle either representation.
|
|
Functions that do not know about the expanded representation, but simply
|
|
apply <function>PG_DETOAST_DATUM</function> to their inputs, will automatically
|
|
receive the traditional varlena representation; so support for an expanded
|
|
representation can be introduced incrementally, one function at a time.
|
|
</para>
|
|
|
|
<para>
|
|
<acronym>TOAST</acronym> pointers to expanded values are further broken down
|
|
into <firstterm>read-write</firstterm> and <firstterm>read-only</firstterm> pointers.
|
|
The pointed-to representation is the same either way, but a function that
|
|
receives a read-write pointer is allowed to modify the referenced value
|
|
in-place, whereas one that receives a read-only pointer must not; it must
|
|
first create a copy if it wants to make a modified version of the value.
|
|
This distinction and some associated conventions make it possible to avoid
|
|
unnecessary copying of expanded values during query execution.
|
|
</para>
|
|
|
|
<para>
|
|
For all types of in-memory <acronym>TOAST</acronym> pointer, the <acronym>TOAST</acronym>
|
|
management code ensures that no such pointer datum can accidentally get
|
|
stored on disk. In-memory <acronym>TOAST</acronym> pointers are automatically
|
|
expanded to normal in-line varlena values before storage — and then
|
|
possibly converted to on-disk <acronym>TOAST</acronym> pointers, if the containing
|
|
tuple would otherwise be too big.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="storage-fsm">
|
|
|
|
<title>Free Space Map</title>
|
|
|
|
<indexterm>
|
|
<primary>Free Space Map</primary>
|
|
</indexterm>
|
|
<indexterm><primary>FSM</primary><see>Free Space Map</see></indexterm>
|
|
|
|
<para>
|
|
Each heap and index relation, except for hash indexes, has a Free Space Map
|
|
(FSM) to keep track of available space in the relation. It's stored
|
|
alongside the main relation data in a separate relation fork, named after the
|
|
filenode number of the relation, plus a <literal>_fsm</literal> suffix. For example,
|
|
if the filenode of a relation is 12345, the FSM is stored in a file called
|
|
<filename>12345_fsm</filename>, in the same directory as the main relation file.
|
|
</para>
|
|
|
|
<para>
|
|
The Free Space Map is organized as a tree of <acronym>FSM</acronym> pages. The
|
|
bottom level <acronym>FSM</acronym> pages store the free space available on each
|
|
heap (or index) page, using one byte to represent each such page. The upper
|
|
levels aggregate information from the lower levels.
|
|
</para>
|
|
|
|
<para>
|
|
Within each <acronym>FSM</acronym> page is a binary tree, stored in an array with
|
|
one byte per node. Each leaf node represents a heap page, or a lower level
|
|
<acronym>FSM</acronym> page. In each non-leaf node, the higher of its children's
|
|
values is stored. The maximum value in the leaf nodes is therefore stored
|
|
at the root.
|
|
</para>
|
|
|
|
<para>
|
|
See <filename>src/backend/storage/freespace/README</filename> for more details on
|
|
how the <acronym>FSM</acronym> is structured, and how it's updated and searched.
|
|
The <xref linkend="pgfreespacemap"/> module
|
|
can be used to examine the information stored in free space maps.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="storage-vm">
|
|
|
|
<title>Visibility Map</title>
|
|
|
|
<indexterm>
|
|
<primary>Visibility Map</primary>
|
|
</indexterm>
|
|
<indexterm><primary>VM</primary><see>Visibility Map</see></indexterm>
|
|
|
|
<para>
|
|
Each heap relation has a Visibility Map
|
|
(VM) to keep track of which pages contain only tuples that are known to be
|
|
visible to all active transactions; it also keeps track of which pages contain
|
|
only frozen tuples. It's stored
|
|
alongside the main relation data in a separate relation fork, named after the
|
|
filenode number of the relation, plus a <literal>_vm</literal> suffix. For example,
|
|
if the filenode of a relation is 12345, the VM is stored in a file called
|
|
<filename>12345_vm</filename>, in the same directory as the main relation file.
|
|
Note that indexes do not have VMs.
|
|
</para>
|
|
|
|
<para>
|
|
The visibility map stores two bits per heap page. The first bit, if set,
|
|
indicates that the page is all-visible, or in other words that the page does
|
|
not contain any tuples that need to be vacuumed.
|
|
This information can also be used
|
|
by <link linkend="indexes-index-only-scans"><firstterm>index-only
|
|
scans</firstterm></link> to answer queries using only the index tuple.
|
|
The second bit, if set, means that all tuples on the page have been frozen.
|
|
That means that even an anti-wraparound vacuum need not revisit the page.
|
|
</para>
|
|
|
|
<para>
|
|
The map is conservative in the sense that we make sure that whenever a bit is
|
|
set, we know the condition is true, but if a bit is not set, it might or
|
|
might not be true. Visibility map bits are only set by vacuum, but are
|
|
cleared by any data-modifying operations on a page.
|
|
</para>
|
|
|
|
<para>
|
|
The <xref linkend="pgvisibility"/> module can be used to examine the
|
|
information stored in the visibility map.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="storage-init">
|
|
|
|
<title>The Initialization Fork</title>
|
|
|
|
<indexterm>
|
|
<primary>Initialization Fork</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Each unlogged table, and each index on an unlogged table, has an initialization
|
|
fork. The initialization fork is an empty table or index of the appropriate
|
|
type. When an unlogged table must be reset to empty due to a crash, the
|
|
initialization fork is copied over the main fork, and any other forks are
|
|
erased (they will be recreated automatically as needed).
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="storage-page-layout">
|
|
|
|
<title>Database Page Layout</title>
|
|
|
|
<para>
|
|
This section provides an overview of the page format used within
|
|
<productname>PostgreSQL</productname> tables and indexes.<footnote>
|
|
<para>
|
|
Actually, use of this page format is not required for either table or
|
|
index access methods. The <literal>heap</literal> table access method
|
|
always uses this format. All the existing index methods also use the
|
|
basic format, but the data kept on index metapages usually doesn't follow
|
|
the item layout rules.
|
|
</para>
|
|
</footnote>
|
|
Sequences and <acronym>TOAST</acronym> tables are formatted just like a regular table.
|
|
</para>
|
|
|
|
<para>
|
|
In the following explanation, a
|
|
<firstterm>byte</firstterm>
|
|
is assumed to contain 8 bits. In addition, the term
|
|
<firstterm>item</firstterm>
|
|
refers to an individual data value that is stored on a page. In a table,
|
|
an item is a row; in an index, an item is an index entry.
|
|
</para>
|
|
|
|
<para>
|
|
Every table and index is stored as an array of <firstterm>pages</firstterm> of a
|
|
fixed size (usually 8 kB, although a different page size can be selected
|
|
when compiling the server). In a table, all the pages are logically
|
|
equivalent, so a particular item (row) can be stored in any page. In
|
|
indexes, the first page is generally reserved as a <firstterm>metapage</firstterm>
|
|
holding control information, and there can be different types of pages
|
|
within the index, depending on the index access method.
|
|
</para>
|
|
|
|
<para>
|
|
<xref linkend="page-table"/> shows the overall layout of a page.
|
|
There are five parts to each page.
|
|
</para>
|
|
|
|
<table tocentry="1" id="page-table">
|
|
<title>Overall Page Layout</title>
|
|
<titleabbrev>Page Layout</titleabbrev>
|
|
<tgroup cols="2">
|
|
<thead>
|
|
<row>
|
|
<entry>
|
|
Item
|
|
</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
|
|
<row>
|
|
<entry>PageHeaderData</entry>
|
|
<entry>24 bytes long. Contains general information about the page, including
|
|
free space pointers.</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry>ItemIdData</entry>
|
|
<entry>Array of item identifiers pointing to the actual items. Each
|
|
entry is an (offset,length) pair. 4 bytes per item.</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry>Free space</entry>
|
|
<entry>The unallocated space. New item identifiers are allocated from
|
|
the start of this area, new items from the end.</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry>Items</entry>
|
|
<entry>The actual items themselves.</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry>Special space</entry>
|
|
<entry>Index access method specific data. Different methods store different
|
|
data. Empty in ordinary tables.</entry>
|
|
</row>
|
|
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
|
|
The first 24 bytes of each page consists of a page header
|
|
(<structname>PageHeaderData</structname>). Its format is detailed in <xref
|
|
linkend="pageheaderdata-table"/>. The first field tracks the most
|
|
recent WAL entry related to this page. The second field contains
|
|
the page checksum if <xref linkend="app-initdb-data-checksums"/> are
|
|
enabled. Next is a 2-byte field containing flag bits. This is followed
|
|
by three 2-byte integer fields (<structfield>pd_lower</structfield>,
|
|
<structfield>pd_upper</structfield>, and
|
|
<structfield>pd_special</structfield>). These contain byte offsets
|
|
from the page start to the start of unallocated space, to the end of
|
|
unallocated space, and to the start of the special space. The next 2
|
|
bytes of the page header, <structfield>pd_pagesize_version</structfield>,
|
|
store both the page size and a version indicator. Beginning with
|
|
<productname>PostgreSQL</productname> 8.3 the version number is 4;
|
|
<productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3;
|
|
<productname>PostgreSQL</productname> 8.0 used version number 2;
|
|
<productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
|
|
prior releases used version number 0.
|
|
(The basic page layout and header format has not changed in most of these
|
|
versions, but the layout of heap row headers has.) The page size
|
|
is basically only present as a cross-check; there is no support for having
|
|
more than one page size in an installation.
|
|
The last field is a hint that shows whether pruning the page is likely
|
|
to be profitable: it tracks the oldest un-pruned XMAX on the page.
|
|
|
|
</para>
|
|
|
|
<table tocentry="1" id="pageheaderdata-table">
|
|
<title>PageHeaderData Layout</title>
|
|
<titleabbrev>PageHeaderData Layout</titleabbrev>
|
|
<tgroup cols="4">
|
|
<thead>
|
|
<row>
|
|
<entry>Field</entry>
|
|
<entry>Type</entry>
|
|
<entry>Length</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row>
|
|
<entry>pd_lsn</entry>
|
|
<entry>PageXLogRecPtr</entry>
|
|
<entry>8 bytes</entry>
|
|
<entry>LSN: next byte after last byte of WAL record for last change
|
|
to this page</entry>
|
|
</row>
|
|
<row>
|
|
<entry>pd_checksum</entry>
|
|
<entry>uint16</entry>
|
|
<entry>2 bytes</entry>
|
|
<entry>Page checksum</entry>
|
|
</row>
|
|
<row>
|
|
<entry>pd_flags</entry>
|
|
<entry>uint16</entry>
|
|
<entry>2 bytes</entry>
|
|
<entry>Flag bits</entry>
|
|
</row>
|
|
<row>
|
|
<entry>pd_lower</entry>
|
|
<entry>LocationIndex</entry>
|
|
<entry>2 bytes</entry>
|
|
<entry>Offset to start of free space</entry>
|
|
</row>
|
|
<row>
|
|
<entry>pd_upper</entry>
|
|
<entry>LocationIndex</entry>
|
|
<entry>2 bytes</entry>
|
|
<entry>Offset to end of free space</entry>
|
|
</row>
|
|
<row>
|
|
<entry>pd_special</entry>
|
|
<entry>LocationIndex</entry>
|
|
<entry>2 bytes</entry>
|
|
<entry>Offset to start of special space</entry>
|
|
</row>
|
|
<row>
|
|
<entry>pd_pagesize_version</entry>
|
|
<entry>uint16</entry>
|
|
<entry>2 bytes</entry>
|
|
<entry>Page size and layout version number information</entry>
|
|
</row>
|
|
<row>
|
|
<entry>pd_prune_xid</entry>
|
|
<entry>TransactionId</entry>
|
|
<entry>4 bytes</entry>
|
|
<entry>Oldest unpruned XMAX on page, or zero if none</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
All the details can be found in
|
|
<filename>src/include/storage/bufpage.h</filename>.
|
|
</para>
|
|
|
|
<para>
|
|
Following the page header are item identifiers
|
|
(<type>ItemIdData</type>), each requiring four bytes.
|
|
An item identifier contains a byte-offset to
|
|
the start of an item, its length in bytes, and a few attribute bits
|
|
which affect its interpretation.
|
|
New item identifiers are allocated
|
|
as needed from the beginning of the unallocated space.
|
|
The number of item identifiers present can be determined by looking at
|
|
<structfield>pd_lower</structfield>, which is increased to allocate a new identifier.
|
|
Because an item
|
|
identifier is never moved until it is freed, its index can be used on a
|
|
long-term basis to reference an item, even when the item itself is moved
|
|
around on the page to compact free space. In fact, every pointer to an
|
|
item (<type>ItemPointer</type>, also known as
|
|
<type>CTID</type>) created by
|
|
<productname>PostgreSQL</productname> consists of a page number and the
|
|
index of an item identifier.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
The items themselves are stored in space allocated backwards from the end
|
|
of unallocated space. The exact structure varies depending on what the
|
|
table is to contain. Tables and sequences both use a structure named
|
|
<type>HeapTupleHeaderData</type>, described below.
|
|
|
|
</para>
|
|
|
|
<para>
|
|
|
|
The final section is the <quote>special section</quote> which can
|
|
contain anything the access method wishes to store. For example,
|
|
b-tree indexes store links to the page's left and right siblings,
|
|
as well as some other data relevant to the index structure.
|
|
Ordinary tables do not use a special section at all (indicated by setting
|
|
<structfield>pd_special</structfield> to equal the page size).
|
|
|
|
</para>
|
|
|
|
<para>
|
|
<xref linkend="storage-page-layout-figure"/> illustrates how these parts are
|
|
laid out in a page.
|
|
</para>
|
|
|
|
<figure id="storage-page-layout-figure">
|
|
<title>Page Layout</title>
|
|
<mediaobject>
|
|
<imageobject>
|
|
<imagedata fileref="images/pagelayout.svg" format="SVG" width="100%"/>
|
|
</imageobject>
|
|
</mediaobject>
|
|
</figure>
|
|
|
|
<sect2 id="storage-tuple-layout">
|
|
|
|
<title>Table Row Layout</title>
|
|
|
|
<para>
|
|
|
|
All table rows are structured in the same way. There is a fixed-size
|
|
header (occupying 23 bytes on most machines), followed by an optional null
|
|
bitmap, an optional object ID field, and the user data. The header is
|
|
detailed
|
|
in <xref linkend="heaptupleheaderdata-table"/>. The actual user data
|
|
(columns of the row) begins at the offset indicated by
|
|
<structfield>t_hoff</structfield>, which must always be a multiple of the MAXALIGN
|
|
distance for the platform.
|
|
The null bitmap is
|
|
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
|
|
<structfield>t_infomask</structfield>. If it is present it begins just after
|
|
the fixed header and occupies enough bytes to have one bit per data column
|
|
(that is, <structfield>t_natts</structfield> bits altogether). In this list of bits, a
|
|
1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
|
|
present, all columns are assumed not-null.
|
|
The object ID is only present if the <firstterm>HEAP_HASOID_OLD</firstterm> bit
|
|
is set in <structfield>t_infomask</structfield>. If present, it appears just
|
|
before the <structfield>t_hoff</structfield> boundary. Any padding needed to make
|
|
<structfield>t_hoff</structfield> a MAXALIGN multiple will appear between the null
|
|
bitmap and the object ID. (This in turn ensures that the object ID is
|
|
suitably aligned.)
|
|
|
|
</para>
|
|
|
|
<table tocentry="1" id="heaptupleheaderdata-table">
|
|
<title>HeapTupleHeaderData Layout</title>
|
|
<titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
|
|
<tgroup cols="4">
|
|
<thead>
|
|
<row>
|
|
<entry>Field</entry>
|
|
<entry>Type</entry>
|
|
<entry>Length</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row>
|
|
<entry>t_xmin</entry>
|
|
<entry>TransactionId</entry>
|
|
<entry>4 bytes</entry>
|
|
<entry>insert XID stamp</entry>
|
|
</row>
|
|
<row>
|
|
<entry>t_xmax</entry>
|
|
<entry>TransactionId</entry>
|
|
<entry>4 bytes</entry>
|
|
<entry>delete XID stamp</entry>
|
|
</row>
|
|
<row>
|
|
<entry>t_cid</entry>
|
|
<entry>CommandId</entry>
|
|
<entry>4 bytes</entry>
|
|
<entry>insert and/or delete CID stamp (overlays with t_xvac)</entry>
|
|
</row>
|
|
<row>
|
|
<entry>t_xvac</entry>
|
|
<entry>TransactionId</entry>
|
|
<entry>4 bytes</entry>
|
|
<entry>XID for VACUUM operation moving a row version</entry>
|
|
</row>
|
|
<row>
|
|
<entry>t_ctid</entry>
|
|
<entry>ItemPointerData</entry>
|
|
<entry>6 bytes</entry>
|
|
<entry>current TID of this or newer row version</entry>
|
|
</row>
|
|
<row>
|
|
<entry>t_infomask2</entry>
|
|
<entry>uint16</entry>
|
|
<entry>2 bytes</entry>
|
|
<entry>number of attributes, plus various flag bits</entry>
|
|
</row>
|
|
<row>
|
|
<entry>t_infomask</entry>
|
|
<entry>uint16</entry>
|
|
<entry>2 bytes</entry>
|
|
<entry>various flag bits</entry>
|
|
</row>
|
|
<row>
|
|
<entry>t_hoff</entry>
|
|
<entry>uint8</entry>
|
|
<entry>1 byte</entry>
|
|
<entry>offset to user data</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
All the details can be found in
|
|
<filename>src/include/access/htup_details.h</filename>.
|
|
</para>
|
|
|
|
<para>
|
|
|
|
Interpreting the actual data can only be done with information obtained
|
|
from other tables, mostly <structname>pg_attribute</structname>. The
|
|
key values needed to identify field locations are
|
|
<structfield>attlen</structfield> and <structfield>attalign</structfield>.
|
|
There is no way to directly get a
|
|
particular attribute, except when there are only fixed width fields and no
|
|
null values. All this trickery is wrapped up in the functions
|
|
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
|
|
and <firstterm>heap_getsysattr</firstterm>.
|
|
|
|
</para>
|
|
<para>
|
|
|
|
To read the data you need to examine each attribute in turn. First check
|
|
whether the field is NULL according to the null bitmap. If it is, go to
|
|
the next. Then make sure you have the right alignment. If the field is a
|
|
fixed width field, then all the bytes are simply placed. If it's a
|
|
variable length field (attlen = -1) then it's a bit more complicated.
|
|
All variable-length data types share the common header structure
|
|
<type>struct varlena</type>, which includes the total length of the stored
|
|
value and some flag bits. Depending on the flags, the data can be either
|
|
inline or in a <acronym>TOAST</acronym> table;
|
|
it might be compressed, too (see <xref linkend="storage-toast"/>).
|
|
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
</chapter>
|