postgresql/src/bin
Tom Lane a45c78e328 Rearrange pg_dump's handling of large objects for better efficiency.
Commit c0d5be5d6 caused pg_dump to create a separate BLOB metadata TOC
entry for each large object (blob), but it did not touch the ancient
decision to put all the blobs' data into a single "BLOBS" TOC entry.
This is bad for a few reasons: for databases with millions of blobs,
the TOC becomes unreasonably large, causing performance issues;
selective restore of just some blobs is quite impossible; and we
cannot parallelize either dump or restore of the blob data, since our
architecture for that relies on farming out whole TOC entries to
worker processes.

To improve matters, let's group multiple blobs into each blob metadata
TOC entry, and then make corresponding per-group blob data TOC entries.
Selective restore using pg_restore's -l/-L switches is then possible,
though only at the group level.  (Perhaps we should provide a switch
to allow forcing one-blob-per-group for users who need precise
selective restore and don't have huge numbers of blobs.  This patch
doesn't do that, instead just hard-wiring the maximum number of blobs
per entry at 1000.)

The blobs in a group must all have the same owner, since the TOC entry
format only allows one owner to be named.  In this implementation
we also require them to all share the same ACL (grants); the archive
format wouldn't require that, but pg_dump's representation of
DumpableObjects does.  It seems unlikely that either restriction
will be problematic for databases with huge numbers of blobs.

The metadata TOC entries now have a "desc" string of "BLOB METADATA",
and their "defn" string is just a newline-separated list of blob OIDs.
The restore code has to generate creation commands, ALTER OWNER
commands, and drop commands (for --clean mode) from that.  We would
need special-case code for ALTER OWNER and drop in any case, so the
alternative of keeping the "defn" as directly executable SQL code
for creation wouldn't buy much, and it seems like it'd bloat the
archive to little purpose.

Since we require the blobs of a metadata group to share the same ACL,
we can furthermore store only one copy of that ACL, and then make
pg_restore regenerate the appropriate commands for each blob.  This
saves space in the dump file not only by removing duplicative SQL
command strings, but by not needing a separate TOC entry for each
blob's ACL.  In turn, that reduces client-side memory requirements for
handling many blobs.

ACL TOC entries that need this special processing are labeled as
"ACL"/"LARGE OBJECTS nnn..nnn".  If we have a blob with a unique ACL,
continue to label it as "ACL"/"LARGE OBJECT nnn".  We don't actually
have to make such a distinction, but it saves a few cycles during
restore for the easy case, and it seems like a good idea to not change
the TOC contents unnecessarily.

The data TOC entries ("BLOBS") are exactly the same as before,
except that now there can be more than one, so we'd better give them
identifying tag strings.

Also, commit c0d5be5d6 put the new BLOB metadata TOC entries into
SECTION_PRE_DATA, which perhaps is defensible in some ways, but
it's a rather odd choice considering that we go out of our way to
treat blobs as data.  Moreover, because parallel restore handles
the PRE_DATA section serially, this means we'd only get part of the
parallelism speedup we could hope for.  Move these entries into
SECTION_DATA, letting us parallelize the lo_create calls not just the
data loading when there are many blobs.  Add dependencies to ensure
that we won't try to load data for a blob we've not yet created.

As this stands, we still generate a separate TOC entry for any comment
or security label attached to a blob.  I feel comfortable in believing
that comments and security labels on blobs are rare, so this patch
should be enough to get most of the useful TOC compression for blobs.

We have to bump the archive file format version number, since existing
versions of pg_restore wouldn't know they need to do something special
for BLOB METADATA, plus they aren't going to work correctly with
multiple BLOBS entries or multiple-large-object ACL entries.

The directory and tar-file format handlers need some work
for multiple BLOBS entries: they used to hard-wire the file name
as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc".
The 002_pg_dump.pl test script also knows about that and requires
minor updates.  (I had to drop the test for manually-compressed
blobs.toc files with LZ4, because lz4's obtuse command line
design requires explicit specification of the output file name
which seems impractical here.  I don't think we're losing any
useful test coverage thereby; that test stanza seems completely
duplicative with the gzip and zstd cases anyway.)

In passing, centralize management of the lo_buf used to hold data
while restoring blobs.  The code previously had each format handler
create lo_buf, which seems rather pointless given that the format
handlers all make it the same way.  Moreover, the format handlers
never use lo_buf directly, making this setup a failure from a
separation-of-concerns standpoint.  Let's move the responsibility into
pg_backup_archiver.c, which is the only module concerned with lo_buf.
The reason to do this in this patch is that it allows a centralized
fix for the now-false assumption that we never restore blobs in
parallel.  Also, get rid of dead code in DropLOIfExists: it's been a
long time since we had any need to be able to restore to a pre-9.0
server.

Discussion: https://postgr.es/m/a9f9376f1c3343a6bb319dce294e20ac@EX13D05UWC001.ant.amazon.com
2024-04-01 16:25:56 -04:00
..
initdb Support C.UTF-8 locale in the new builtin collation provider. 2024-03-19 15:24:41 -07:00
pg_amcheck Escape output of pg_amcheck test 2024-01-13 20:32:18 +01:00
pg_archivecleanup Activate perlcritic InputOutput::RequireCheckedSyscalls and fix resulting warnings 2024-03-19 07:09:31 +01:00
pg_basebackup Fix assorted resource leaks in new pg_createsubscriber code. 2024-04-01 13:47:49 -04:00
pg_checksums Skip .DS_Store files in server side utils 2024-02-13 13:47:12 +01:00
pg_combinebackup Add destroyStringInfo function for cleaning up StringInfos 2024-03-16 23:18:28 +01:00
pg_config Update copyright for 2024 2024-01-03 20:49:05 -05:00
pg_controldata Update copyright for 2024 2024-01-03 20:49:05 -05:00
pg_ctl Activate perlcritic InputOutput::RequireCheckedSyscalls and fix resulting warnings 2024-03-19 07:09:31 +01:00
pg_dump Rearrange pg_dump's handling of large objects for better efficiency. 2024-04-01 16:25:56 -04:00
pg_resetwal Activate perlcritic InputOutput::RequireCheckedSyscalls and fix resulting warnings 2024-03-19 07:09:31 +01:00
pg_rewind Allow dbname to be written as part of connstring via pg_basebackup's -R option. 2024-03-21 10:50:33 +05:30
pg_test_fsync Update copyright for 2024 2024-01-03 20:49:05 -05:00
pg_test_timing Update copyright for 2024 2024-01-03 20:49:05 -05:00
pg_upgrade Fix random failure in 004_subscription. 2024-03-27 14:15:03 +05:30
pg_verifybackup Support json_errdetail in FRONTEND code 2024-03-17 23:56:15 +01:00
pg_waldump Update copyright for 2024 2024-01-03 20:49:05 -05:00
pg_walsummary Revert "Temporary patch to help debug pg_walsummary test failures." 2024-03-20 11:34:00 -05:00
pgbench Adjust pgbench option for debug mode. 2024-03-25 11:08:53 -05:00
pgevent Update copyright for 2024 2024-01-03 20:49:05 -05:00
psql Add new COPY option LOG_VERBOSITY. 2024-04-01 15:25:25 +09:00
scripts reindexdb: Fix warning about uninitialized indices_tables_cell 2024-03-25 11:40:25 +02:00
Makefile Add new pg_walsummary tool. 2024-01-11 12:48:27 -05:00
meson.build Add new pg_walsummary tool. 2024-01-11 12:48:27 -05:00