1996-08-28 03:59:28 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* fd.h
|
1997-09-07 07:04:48 +02:00
|
|
|
* Virtual file descriptor definitions.
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
|
|
|
*
|
2021-01-02 19:06:25 +01:00
|
|
|
* Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/storage/fd.h
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
1999-10-13 17:02:32 +02:00
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/*
|
|
|
|
* calls:
|
1997-09-07 07:04:48 +02:00
|
|
|
*
|
2018-11-06 21:51:50 +01:00
|
|
|
* File {Close, Read, Write, Size, Sync}
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
* {Path Name Open, Allocate, Free} File
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
|
|
|
* These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
|
1999-05-09 02:52:08 +02:00
|
|
|
* Use them for all file activity...
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* File fd;
|
2017-09-23 15:49:22 +02:00
|
|
|
* fd = PathNameOpenFile("foo", O_RDONLY);
|
1996-08-28 03:59:28 +02:00
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* AllocateFile();
|
|
|
|
* FreeFile();
|
1999-05-09 02:52:08 +02:00
|
|
|
*
|
|
|
|
* Use AllocateFile, not fopen, if you need a stdio file (FILE*); then
|
|
|
|
* use FreeFile, not fclose, to close it. AVOID using stdio for files
|
|
|
|
* that you intend to hold open for any length of time, since there is
|
|
|
|
* no way for them to share kernel file descriptors with other files.
|
2004-02-24 00:03:10 +01:00
|
|
|
*
|
|
|
|
* Likewise, use AllocateDir/FreeDir, not opendir/closedir, to allocate
|
2019-07-01 03:00:23 +02:00
|
|
|
* open directories (DIR*), and OpenTransientFile/CloseTransientFile for an
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
* unbuffered file descriptor.
|
Account explicitly for long-lived FDs that are allocated outside fd.c.
The comments in fd.c have long claimed that all file allocations should
go through that module, but in reality that's not always practical.
fd.c doesn't supply APIs for invoking some FD-producing syscalls like
pipe() or epoll_create(); and the APIs it does supply for non-virtual
FDs are mostly insistent on releasing those FDs at transaction end;
and in some cases the actual open() call is in code that can't be made
to use fd.c, such as libpq.
This has led to a situation where, in a modern server, there are likely
to be seven or so long-lived FDs per backend process that are not known
to fd.c. Since NUM_RESERVED_FDS is only 10, that meant we had *very*
few spare FDs if max_files_per_process is >= the system ulimit and
fd.c had opened all the files it thought it safely could. The
contrib/postgres_fdw regression test, in particular, could easily be
made to fall over by running it under a restrictive ulimit.
To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD
that allow outside callers to tell fd.c that they have or want to allocate
a FD that's not directly managed by fd.c. Add calls to track all the
fixed FDs in a standard backend session, so that we are honestly
guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE
limit in a backend's idle state. The coding rules for these functions say
that there's no need to call them in code that just allocates one FD over
a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases.
That means that there aren't all that many places where we need to worry.
But postgres_fdw and dblink must use this facility to account for
long-lived FDs consumed by libpq connections. There may be other places
where it's worth doing such accounting, too, but this seems like enough
to solve the immediate problem.
Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs.
(Callers can choose to ignore this limit, but of course it's unwise
to do so except for fixed file allocations.) I also reduced the limit
on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2).
Conceivably a smarter rule could be used here --- but in practice,
on reasonable systems, max_safe_fds should be large enough that this
isn't much of an issue, so KISS for now. To avoid possible regression
in the number of external or allocated files that can be opened,
increase FD_MINFREE and the lower limit on max_files_per_process a
little bit; we now insist that the effective "ulimit -n" be at least 64.
This seems like pretty clearly a bug fix, but in view of the lack of
field complaints, I'll refrain from risking a back-patch.
Discussion: https://postgr.es/m/E1izCmM-0005pV-Co@gemulon.postgresql.org
2020-02-24 23:28:33 +01:00
|
|
|
*
|
|
|
|
* If you really can't use any of the above, at least call AcquireExternalFD
|
|
|
|
* or ReserveExternalFD to report any file descriptors that are held for any
|
|
|
|
* length of time. Failure to do so risks unnecessary EMFILE errors.
|
1996-08-28 03:59:28 +02:00
|
|
|
*/
|
1997-09-07 07:04:48 +02:00
|
|
|
#ifndef FD_H
|
1996-08-28 03:59:28 +02:00
|
|
|
#define FD_H
|
|
|
|
|
2004-02-24 00:03:10 +01:00
|
|
|
#include <dirent.h>
|
|
|
|
|
2021-01-13 23:10:24 +01:00
|
|
|
struct iovec; /* avoid including port/pg_iovec.h here */
|
2004-02-24 00:03:10 +01:00
|
|
|
|
1997-09-08 04:41:22 +02:00
|
|
|
typedef int File;
|
1996-08-28 03:59:28 +02:00
|
|
|
|
2001-09-30 20:57:45 +02:00
|
|
|
|
|
|
|
/* GUC parameter */
|
2017-12-05 15:23:57 +01:00
|
|
|
extern PGDLLIMPORT int max_files_per_process;
|
PANIC on fsync() failure.
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure. In that case the only remaining copy of the
data is in the WAL. A subsequent fsync() could appear to succeed,
but not have flushed the data. That means that a future checkpoint
could apparently complete successfully but have lost data.
Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure. Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.
Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses. If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail. The GUC defaults
to off, meaning that we panic.
Back-patch to all supported releases.
There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error. A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.
Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
2018-11-19 01:31:10 +01:00
|
|
|
extern PGDLLIMPORT bool data_sync_retry;
|
2001-09-30 20:57:45 +02:00
|
|
|
|
2012-03-29 07:19:11 +02:00
|
|
|
/*
|
|
|
|
* This is private to fd.c, but exported for save/restore_backend_variables()
|
|
|
|
*/
|
|
|
|
extern int max_safe_fds;
|
|
|
|
|
2019-04-04 10:56:03 +02:00
|
|
|
/*
|
|
|
|
* On Windows, we have to interpret EACCES as possibly meaning the same as
|
|
|
|
* ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
|
|
|
|
* that's what you get. Ugh. This code is designed so that we don't
|
|
|
|
* actually believe these cases are okay without further evidence (namely,
|
|
|
|
* a pending fsync request getting canceled ... see ProcessSyncRequests).
|
|
|
|
*/
|
|
|
|
#ifndef WIN32
|
|
|
|
#define FILE_POSSIBLY_DELETED(err) ((err) == ENOENT)
|
|
|
|
#else
|
|
|
|
#define FILE_POSSIBLY_DELETED(err) ((err) == ENOENT || (err) == EACCES)
|
|
|
|
#endif
|
2001-09-30 20:57:45 +02:00
|
|
|
|
1996-08-28 03:59:28 +02:00
|
|
|
/*
|
|
|
|
* prototypes for functions in fd.c
|
|
|
|
*/
|
1999-05-09 02:52:08 +02:00
|
|
|
|
|
|
|
/* Operations on virtual Files --- equivalent to Unix kernel file ops */
|
2017-09-23 15:49:22 +02:00
|
|
|
extern File PathNameOpenFile(const char *fileName, int fileFlags);
|
|
|
|
extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
|
2007-06-07 21:19:57 +02:00
|
|
|
extern File OpenTemporaryFile(bool interXact);
|
1997-09-08 04:41:22 +02:00
|
|
|
extern void FileClose(File file);
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
extern int FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
|
2018-11-06 21:51:50 +01:00
|
|
|
extern int FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
|
|
|
|
extern int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
extern int FileSync(File file, uint32 wait_event_info);
|
2018-11-06 21:51:50 +01:00
|
|
|
extern off_t FileSize(File file);
|
Create and use wait events for read, write, and fsync operations.
Previous commits, notably 53be0b1add7064ca5db3cd884302dfc3268d884e and
6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa, made it possible to see from
pg_stat_activity when a backend was stuck waiting for another backend,
but it's also fairly common for a backend to be stuck waiting for an
I/O. Add wait events for those operations, too.
Rushabh Lathia, with further hacking by me. Reviewed and tested by
Michael Paquier, Amit Kapila, Rajkumar Raghuwanshi, and Rahila Syed.
Discussion: http://postgr.es/m/CAGPqQf0LsYHXREPAZqYGVkDqHSyjf=KsD=k0GTVPAuzyThh-VQ@mail.gmail.com
2017-03-18 12:43:01 +01:00
|
|
|
extern int FileTruncate(File file, off_t offset, uint32 wait_event_info);
|
|
|
|
extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
|
2009-08-05 20:01:54 +02:00
|
|
|
extern char *FilePathName(File file);
|
2016-03-08 16:09:50 +01:00
|
|
|
extern int FileGetRawDesc(File file);
|
2016-06-10 00:02:36 +02:00
|
|
|
extern int FileGetRawFlags(File file);
|
2017-09-23 15:49:22 +02:00
|
|
|
extern mode_t FileGetRawMode(File file);
|
1999-05-09 02:52:08 +02:00
|
|
|
|
2017-12-02 01:30:56 +01:00
|
|
|
/* Operations used for sharing named temporary files */
|
|
|
|
extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
|
2020-08-26 04:06:43 +02:00
|
|
|
extern File PathNameOpenTemporaryFile(const char *path, int mode);
|
2017-12-02 01:30:56 +01:00
|
|
|
extern bool PathNameDeleteTemporaryFile(const char *name, bool error_on_failure);
|
|
|
|
extern void PathNameCreateTemporaryDir(const char *base, const char *name);
|
|
|
|
extern void PathNameDeleteTemporaryDir(const char *name);
|
|
|
|
extern void TempTablespacePath(char *path, Oid tablespace);
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
/* Operations that allow use of regular stdio --- USE WITH CAUTION */
|
2006-03-04 22:32:47 +01:00
|
|
|
extern FILE *AllocateFile(const char *name, const char *mode);
|
2004-02-24 00:03:10 +01:00
|
|
|
extern int FreeFile(FILE *file);
|
|
|
|
|
Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.
In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.
This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.
Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 17:17:21 +01:00
|
|
|
/* Operations that allow use of pipe streams (popen/pclose) */
|
|
|
|
extern FILE *OpenPipeStream(const char *command, const char *mode);
|
|
|
|
extern int ClosePipeStream(FILE *file);
|
|
|
|
|
2004-02-24 00:03:10 +01:00
|
|
|
/* Operations to allow use of the <dirent.h> library routines */
|
|
|
|
extern DIR *AllocateDir(const char *dirname);
|
2005-06-19 23:34:03 +02:00
|
|
|
extern struct dirent *ReadDir(DIR *dir, const char *dirname);
|
Clean up assorted messiness around AllocateDir() usage.
This patch fixes a couple of low-probability bugs that could lead to
reporting an irrelevant errno value (and hence possibly a wrong SQLSTATE)
concerning directory-open or file-open failures. It also fixes places
where we took shortcuts in reporting such errors, either by using elog
instead of ereport or by using ereport but forgetting to specify an
errcode. And it eliminates a lot of just plain redundant error-handling
code.
In service of all this, export fd.c's formerly-static function
ReadDirExtended, so that external callers can make use of the coding
pattern
dir = AllocateDir(path);
while ((de = ReadDirExtended(dir, path, LOG)) != NULL)
if they'd like to treat directory-open failures as mere LOG conditions
rather than errors. Also fix FreeDir to be a no-op if we reach it
with dir == NULL, as such a coding pattern would cause.
Then, remove code at many call sites that was throwing an error or log
message for AllocateDir failure, as ReadDir or ReadDirExtended can handle
that job just fine. Aside from being a net code savings, this gets rid of
a lot of not-quite-up-to-snuff reports, as mentioned above. (In some
places these changes result in replacing a custom error message such as
"could not open tablespace directory" with more generic wording "could not
open directory", but it was agreed that the custom wording buys little as
long as we report the directory name.) In some other call sites where we
can't just remove code, change the error reports to be fully
project-style-compliant.
Also reorder code in restoreTwoPhaseData that was acquiring a lock
between AllocateDir and ReadDir; in the unlikely but surely not
impossible case that LWLockAcquire changes errno, AllocateDir failures
would be misreported. There is no great value in opening the directory
before acquiring TwoPhaseStateLock, so just do it in the other order.
Also fix CheckXLogRemoved to guarantee that it preserves errno,
as quite a number of call sites are implicitly assuming. (Again,
it's unlikely but I think not impossible that errno could change
during a SpinLockAcquire. If so, this function was broken for its
own purposes as well as breaking callers.)
And change a few places that were using not-per-project-style messages,
such as "could not read directory" when "could not open directory" is
more correct.
Back-patch the exporting of ReadDirExtended, in case we have occasion
to back-patch some fix that makes use of it; it's not needed right now
but surely making it global is pretty harmless. Also back-patch the
restoreTwoPhaseData and CheckXLogRemoved fixes. The rest of this is
essentially cosmetic and need not get back-patched.
Michael Paquier, with a bit of additional work by me
Discussion: https://postgr.es/m/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
2017-12-04 23:02:52 +01:00
|
|
|
extern struct dirent *ReadDirExtended(DIR *dir, const char *dirname,
|
2019-05-22 19:04:48 +02:00
|
|
|
int elevel);
|
2004-02-24 00:03:10 +01:00
|
|
|
extern int FreeDir(DIR *dir);
|
1999-05-09 02:52:08 +02:00
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
/* Operations to allow use of a plain kernel FD, with automatic cleanup */
|
2017-09-23 15:49:22 +02:00
|
|
|
extern int OpenTransientFile(const char *fileName, int fileFlags);
|
|
|
|
extern int OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
extern int CloseTransientFile(int fd);
|
|
|
|
|
2000-06-02 05:58:34 +02:00
|
|
|
/* If you've really really gotta have a plain kernel FD, use this */
|
2017-09-23 15:49:22 +02:00
|
|
|
extern int BasicOpenFile(const char *fileName, int fileFlags);
|
|
|
|
extern int BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
|
2000-06-02 05:58:34 +02:00
|
|
|
|
Account explicitly for long-lived FDs that are allocated outside fd.c.
The comments in fd.c have long claimed that all file allocations should
go through that module, but in reality that's not always practical.
fd.c doesn't supply APIs for invoking some FD-producing syscalls like
pipe() or epoll_create(); and the APIs it does supply for non-virtual
FDs are mostly insistent on releasing those FDs at transaction end;
and in some cases the actual open() call is in code that can't be made
to use fd.c, such as libpq.
This has led to a situation where, in a modern server, there are likely
to be seven or so long-lived FDs per backend process that are not known
to fd.c. Since NUM_RESERVED_FDS is only 10, that meant we had *very*
few spare FDs if max_files_per_process is >= the system ulimit and
fd.c had opened all the files it thought it safely could. The
contrib/postgres_fdw regression test, in particular, could easily be
made to fall over by running it under a restrictive ulimit.
To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD
that allow outside callers to tell fd.c that they have or want to allocate
a FD that's not directly managed by fd.c. Add calls to track all the
fixed FDs in a standard backend session, so that we are honestly
guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE
limit in a backend's idle state. The coding rules for these functions say
that there's no need to call them in code that just allocates one FD over
a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases.
That means that there aren't all that many places where we need to worry.
But postgres_fdw and dblink must use this facility to account for
long-lived FDs consumed by libpq connections. There may be other places
where it's worth doing such accounting, too, but this seems like enough
to solve the immediate problem.
Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs.
(Callers can choose to ignore this limit, but of course it's unwise
to do so except for fixed file allocations.) I also reduced the limit
on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2).
Conceivably a smarter rule could be used here --- but in practice,
on reasonable systems, max_safe_fds should be large enough that this
isn't much of an issue, so KISS for now. To avoid possible regression
in the number of external or allocated files that can be opened,
increase FD_MINFREE and the lower limit on max_files_per_process a
little bit; we now insist that the effective "ulimit -n" be at least 64.
This seems like pretty clearly a bug fix, but in view of the lack of
field complaints, I'll refrain from risking a back-patch.
Discussion: https://postgr.es/m/E1izCmM-0005pV-Co@gemulon.postgresql.org
2020-02-24 23:28:33 +01:00
|
|
|
/* Use these for other cases, and also for long-lived BasicOpenFile FDs */
|
|
|
|
extern bool AcquireExternalFD(void);
|
|
|
|
extern void ReserveExternalFD(void);
|
|
|
|
extern void ReleaseExternalFD(void);
|
|
|
|
|
|
|
|
/* Make a directory with default permissions */
|
2018-04-07 23:45:39 +02:00
|
|
|
extern int MakePGDirectory(const char *directoryName);
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
/* Miscellaneous support routines */
|
2005-08-08 05:12:16 +02:00
|
|
|
extern void InitFileAccess(void);
|
2004-02-23 21:45:59 +01:00
|
|
|
extern void set_max_safe_fds(void);
|
1997-09-08 04:41:22 +02:00
|
|
|
extern void closeAllVfds(void);
|
2007-06-07 21:19:57 +02:00
|
|
|
extern void SetTempTablespaces(Oid *tableSpaces, int numSpaces);
|
|
|
|
extern bool TempTablespacesAreSet(void);
|
2017-12-02 01:30:56 +01:00
|
|
|
extern int GetTempTablespaces(Oid *tableSpaces, int numSpaces);
|
2007-06-07 21:19:57 +02:00
|
|
|
extern Oid GetNextTempTableSpace(void);
|
2018-04-28 23:45:02 +02:00
|
|
|
extern void AtEOXact_Files(bool isCommit);
|
2004-09-16 18:58:44 +02:00
|
|
|
extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
|
2019-05-22 19:04:48 +02:00
|
|
|
SubTransactionId parentSubid);
|
2001-06-11 06:12:29 +02:00
|
|
|
extern void RemovePgTempFiles(void);
|
2019-09-11 17:43:01 +02:00
|
|
|
extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
|
|
|
|
bool unlink_all);
|
2018-03-27 15:14:40 +02:00
|
|
|
extern bool looks_like_temp_rel_name(const char *name);
|
2007-06-07 21:19:57 +02:00
|
|
|
|
2000-12-08 23:21:33 +01:00
|
|
|
extern int pg_fsync(int fd);
|
2005-05-20 16:53:26 +02:00
|
|
|
extern int pg_fsync_no_writethrough(int fd);
|
|
|
|
extern int pg_fsync_writethrough(int fd);
|
2001-02-18 05:39:42 +01:00
|
|
|
extern int pg_fdatasync(int fd);
|
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
2016-02-19 21:13:05 +01:00
|
|
|
extern void pg_flush_data(int fd, off_t offset, off_t amount);
|
2021-01-14 06:09:32 +01:00
|
|
|
extern ssize_t pg_pwritev_with_retry(int fd,
|
|
|
|
const struct iovec *iov,
|
|
|
|
int iovcnt,
|
|
|
|
off_t offset);
|
2020-12-01 03:34:57 +01:00
|
|
|
extern int pg_truncate(const char *path, off_t length);
|
2016-03-10 03:53:53 +01:00
|
|
|
extern void fsync_fname(const char *fname, bool isdir);
|
2020-01-24 12:42:52 +01:00
|
|
|
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
|
2016-03-10 03:53:53 +01:00
|
|
|
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
|
2017-03-27 18:33:01 +02:00
|
|
|
extern int durable_unlink(const char *fname, int loglevel);
|
2020-03-11 10:58:02 +01:00
|
|
|
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
extern void SyncDataDirectory(void);
|
2019-05-22 18:55:34 +02:00
|
|
|
extern int data_sync_elevel(int elevel);
|
2001-10-28 07:26:15 +01:00
|
|
|
|
2017-12-02 01:30:56 +01:00
|
|
|
/* Filename components */
|
2003-12-20 18:31:21 +01:00
|
|
|
#define PG_TEMP_FILES_DIR "pgsql_tmp"
|
|
|
|
#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
|
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* FD_H */
|