postgresql

History

Tom Lane a45c78e328 Rearrange pg_dump's handling of large objects for better efficiency. Commit `c0d5be5d6` caused pg_dump to create a separate BLOB metadata TOC entry for each large object (blob), but it did not touch the ancient decision to put all the blobs' data into a single "BLOBS" TOC entry. This is bad for a few reasons: for databases with millions of blobs, the TOC becomes unreasonably large, causing performance issues; selective restore of just some blobs is quite impossible; and we cannot parallelize either dump or restore of the blob data, since our architecture for that relies on farming out whole TOC entries to worker processes. To improve matters, let's group multiple blobs into each blob metadata TOC entry, and then make corresponding per-group blob data TOC entries. Selective restore using pg_restore's -l/-L switches is then possible, though only at the group level. (Perhaps we should provide a switch to allow forcing one-blob-per-group for users who need precise selective restore and don't have huge numbers of blobs. This patch doesn't do that, instead just hard-wiring the maximum number of blobs per entry at 1000.) The blobs in a group must all have the same owner, since the TOC entry format only allows one owner to be named. In this implementation we also require them to all share the same ACL (grants); the archive format wouldn't require that, but pg_dump's representation of DumpableObjects does. It seems unlikely that either restriction will be problematic for databases with huge numbers of blobs. The metadata TOC entries now have a "desc" string of "BLOB METADATA", and their "defn" string is just a newline-separated list of blob OIDs. The restore code has to generate creation commands, ALTER OWNER commands, and drop commands (for --clean mode) from that. We would need special-case code for ALTER OWNER and drop in any case, so the alternative of keeping the "defn" as directly executable SQL code for creation wouldn't buy much, and it seems like it'd bloat the archive to little purpose. Since we require the blobs of a metadata group to share the same ACL, we can furthermore store only one copy of that ACL, and then make pg_restore regenerate the appropriate commands for each blob. This saves space in the dump file not only by removing duplicative SQL command strings, but by not needing a separate TOC entry for each blob's ACL. In turn, that reduces client-side memory requirements for handling many blobs. ACL TOC entries that need this special processing are labeled as "ACL"/"LARGE OBJECTS nnn..nnn". If we have a blob with a unique ACL, continue to label it as "ACL"/"LARGE OBJECT nnn". We don't actually have to make such a distinction, but it saves a few cycles during restore for the easy case, and it seems like a good idea to not change the TOC contents unnecessarily. The data TOC entries ("BLOBS") are exactly the same as before, except that now there can be more than one, so we'd better give them identifying tag strings. Also, commit `c0d5be5d6` put the new BLOB metadata TOC entries into SECTION_PRE_DATA, which perhaps is defensible in some ways, but it's a rather odd choice considering that we go out of our way to treat blobs as data. Moreover, because parallel restore handles the PRE_DATA section serially, this means we'd only get part of the parallelism speedup we could hope for. Move these entries into SECTION_DATA, letting us parallelize the lo_create calls not just the data loading when there are many blobs. Add dependencies to ensure that we won't try to load data for a blob we've not yet created. As this stands, we still generate a separate TOC entry for any comment or security label attached to a blob. I feel comfortable in believing that comments and security labels on blobs are rare, so this patch should be enough to get most of the useful TOC compression for blobs. We have to bump the archive file format version number, since existing versions of pg_restore wouldn't know they need to do something special for BLOB METADATA, plus they aren't going to work correctly with multiple BLOBS entries or multiple-large-object ACL entries. The directory and tar-file format handlers need some work for multiple BLOBS entries: they used to hard-wire the file name as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc". The 002_pg_dump.pl test script also knows about that and requires minor updates. (I had to drop the test for manually-compressed blobs.toc files with LZ4, because lz4's obtuse command line design requires explicit specification of the output file name which seems impractical here. I don't think we're losing any useful test coverage thereby; that test stanza seems completely duplicative with the gzip and zstd cases anyway.) In passing, centralize management of the lo_buf used to hold data while restoring blobs. The code previously had each format handler create lo_buf, which seems rather pointless given that the format handlers all make it the same way. Moreover, the format handlers never use lo_buf directly, making this setup a failure from a separation-of-concerns standpoint. Let's move the responsibility into pg_backup_archiver.c, which is the only module concerned with lo_buf. The reason to do this in this patch is that it allows a centralized fix for the now-false assumption that we never restore blobs in parallel. Also, get rid of dead code in DropLOIfExists: it's been a long time since we had any need to be able to restore to a pre-9.0 server. Discussion: https://postgr.es/m/a9f9376f1c3343a6bb319dce294e20ac@EX13D05UWC001.ant.amazon.com		2024-04-01 16:25:56 -04:00
..
initdb	Support C.UTF-8 locale in the new builtin collation provider.	2024-03-19 15:24:41 -07:00
pg_amcheck	Escape output of pg_amcheck test	2024-01-13 20:32:18 +01:00
pg_archivecleanup	Activate perlcritic InputOutput::RequireCheckedSyscalls and fix resulting warnings	2024-03-19 07:09:31 +01:00
pg_basebackup	Fix assorted resource leaks in new pg_createsubscriber code.	2024-04-01 13:47:49 -04:00
pg_checksums	Skip .DS_Store files in server side utils	2024-02-13 13:47:12 +01:00
pg_combinebackup	Add destroyStringInfo function for cleaning up StringInfos	2024-03-16 23:18:28 +01:00
pg_config	Update copyright for 2024	2024-01-03 20:49:05 -05:00
pg_controldata	Update copyright for 2024	2024-01-03 20:49:05 -05:00
pg_ctl	Activate perlcritic InputOutput::RequireCheckedSyscalls and fix resulting warnings	2024-03-19 07:09:31 +01:00
pg_dump	Rearrange pg_dump's handling of large objects for better efficiency.	2024-04-01 16:25:56 -04:00
pg_resetwal	Activate perlcritic InputOutput::RequireCheckedSyscalls and fix resulting warnings	2024-03-19 07:09:31 +01:00
pg_rewind	Allow dbname to be written as part of connstring via pg_basebackup's -R option.	2024-03-21 10:50:33 +05:30
pg_test_fsync	Update copyright for 2024	2024-01-03 20:49:05 -05:00
pg_test_timing	Update copyright for 2024	2024-01-03 20:49:05 -05:00
pg_upgrade	Fix random failure in 004_subscription.	2024-03-27 14:15:03 +05:30
pg_verifybackup	Support json_errdetail in FRONTEND code	2024-03-17 23:56:15 +01:00
pg_waldump	Update copyright for 2024	2024-01-03 20:49:05 -05:00
pg_walsummary	Revert "Temporary patch to help debug pg_walsummary test failures."	2024-03-20 11:34:00 -05:00
pgbench	Adjust pgbench option for debug mode.	2024-03-25 11:08:53 -05:00
pgevent	Update copyright for 2024	2024-01-03 20:49:05 -05:00
psql	Add new COPY option LOG_VERBOSITY.	2024-04-01 15:25:25 +09:00
scripts	reindexdb: Fix warning about uninitialized indices_tables_cell	2024-03-25 11:40:25 +02:00
Makefile	Add new pg_walsummary tool.	2024-01-11 12:48:27 -05:00
meson.build	Add new pg_walsummary tool.	2024-01-11 12:48:27 -05:00