1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* fd.c
|
1997-09-07 07:04:48 +02:00
|
|
|
* Virtual file descriptor code.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2016-01-02 19:33:40 +01:00
|
|
|
* Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/storage/file/fd.c
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
* NOTES:
|
|
|
|
*
|
|
|
|
* This code manages a cache of 'virtual' file descriptors (VFDs).
|
|
|
|
* The server opens many file descriptors for a variety of reasons,
|
|
|
|
* including base tables, scratch files (e.g., sort and hash spool
|
|
|
|
* files), and random calls to C library routines like system(3); it
|
|
|
|
* is quite easy to exceed system limits on the number of open files a
|
|
|
|
* single process can have. (This is around 256 on many modern
|
|
|
|
* operating systems, but can be as low as 32 on others.)
|
|
|
|
*
|
|
|
|
* VFDs are managed as an LRU pool, with actual OS file descriptors
|
|
|
|
* being opened and closed as needed. Obviously, if a routine is
|
|
|
|
* opened using these interfaces, all subsequent operations must also
|
|
|
|
* be through these interfaces (the File type is not a real file
|
|
|
|
* descriptor).
|
|
|
|
*
|
|
|
|
* For this scheme to work, most (if not all) routines throughout the
|
|
|
|
* server should use these interfaces instead of calling the C library
|
|
|
|
* routines (e.g., open(2) and fopen(3)) themselves. Otherwise, we
|
|
|
|
* may find ourselves short of real file descriptors anyway.
|
|
|
|
*
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
* INTERFACE ROUTINES
|
|
|
|
*
|
|
|
|
* PathNameOpenFile and OpenTemporaryFile are used to open virtual files.
|
|
|
|
* A File opened with OpenTemporaryFile is automatically deleted when the
|
|
|
|
* File is closed, either explicitly or implicitly at end of transaction or
|
|
|
|
* process exit. PathNameOpenFile is intended for files that are held open
|
|
|
|
* for a long time, like relation files. It is the caller's responsibility
|
|
|
|
* to close them, there is no automatic mechanism in fd.c for that.
|
|
|
|
*
|
Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.
In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.
This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.
Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 17:17:21 +01:00
|
|
|
* AllocateFile, AllocateDir, OpenPipeStream and OpenTransientFile are
|
|
|
|
* wrappers around fopen(3), opendir(3), popen(3) and open(2), respectively.
|
|
|
|
* They behave like the corresponding native functions, except that the handle
|
|
|
|
* is registered with the current subtransaction, and will be automatically
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
* closed at abort. These are intended mainly for short operations like
|
|
|
|
* reading a configuration file; there is a limit on the number of files that
|
Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.
In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.
This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.
Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 17:17:21 +01:00
|
|
|
* can be opened using these functions at any one time.
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
*
|
2013-02-08 15:14:40 +01:00
|
|
|
* Finally, BasicOpenFile is just a thin wrapper around open() that can
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
* release file descriptors in use by the virtual file descriptors if
|
|
|
|
* necessary. There is no automatic cleanup of file descriptors returned by
|
|
|
|
* BasicOpenFile, it is solely the caller's responsibility to close the file
|
|
|
|
* descriptor by calling close(2).
|
|
|
|
*
|
1996-07-09 08:22:35 +02:00
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
2000-10-02 21:42:56 +02:00
|
|
|
#include "postgres.h"
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
#include <sys/file.h>
|
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/stat.h>
|
|
|
|
#include <unistd.h>
|
1996-11-06 07:52:23 +01:00
|
|
|
#include <fcntl.h>
|
2009-03-04 10:12:49 +01:00
|
|
|
#ifdef HAVE_SYS_RESOURCE_H
|
|
|
|
#include <sys/resource.h> /* for getrlimit */
|
|
|
|
#endif
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1999-07-16 05:14:30 +02:00
|
|
|
#include "miscadmin.h"
|
2004-07-28 16:23:31 +02:00
|
|
|
#include "access/xact.h"
|
2014-11-06 12:52:08 +01:00
|
|
|
#include "access/xlog.h"
|
2010-01-12 03:42:52 +01:00
|
|
|
#include "catalog/catalog.h"
|
2007-06-03 19:08:34 +02:00
|
|
|
#include "catalog/pg_tablespace.h"
|
2012-01-26 16:11:51 +01:00
|
|
|
#include "pgstat.h"
|
1996-11-08 07:02:30 +01:00
|
|
|
#include "storage/fd.h"
|
2002-05-05 02:03:29 +02:00
|
|
|
#include "storage/ipc.h"
|
2007-01-09 22:31:17 +01:00
|
|
|
#include "utils/guc.h"
|
2012-08-29 00:02:07 +02:00
|
|
|
#include "utils/resowner_private.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2001-06-11 06:12:29 +02:00
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
|
|
|
|
#if defined(HAVE_SYNC_FILE_RANGE)
|
|
|
|
#define PG_FLUSH_DATA_WORKS 1
|
|
|
|
#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
|
|
|
|
#define PG_FLUSH_DATA_WORKS 1
|
|
|
|
#endif
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
2004-02-23 21:45:59 +01:00
|
|
|
* We must leave some file descriptors free for system(), the dynamic loader,
|
|
|
|
* and other code that tries to open files without consulting fd.c. This
|
|
|
|
* is the number left free. (While we can be pretty sure we won't get
|
|
|
|
* EMFILE, there's never any guarantee that we won't get ENFILE due to
|
2014-05-06 18:12:18 +02:00
|
|
|
* other processes chewing up FDs. So it's a bad idea to try to open files
|
2004-02-23 21:45:59 +01:00
|
|
|
* without consulting fd.c. Nonetheless we cannot control all code.)
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2004-02-23 21:45:59 +01:00
|
|
|
* Because this is just a fixed setting, we are effectively assuming that
|
|
|
|
* no such code will leave FDs open over the long term; otherwise the slop
|
|
|
|
* is likely to be insufficient. Note in particular that we expect that
|
|
|
|
* loading a shared library does not result in any permanent increase in
|
|
|
|
* the number of open files. (This appears to be true on most if not
|
|
|
|
* all platforms as of Feb 2004.)
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2004-02-23 21:45:59 +01:00
|
|
|
#define NUM_RESERVED_FDS 10
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
/*
|
2004-02-23 21:45:59 +01:00
|
|
|
* If we have fewer than this many usable FDs after allowing for the reserved
|
|
|
|
* ones, choke.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2004-02-23 21:45:59 +01:00
|
|
|
#define FD_MINFREE 10
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2001-09-30 20:57:45 +02:00
|
|
|
/*
|
2004-02-23 21:45:59 +01:00
|
|
|
* A number of platforms allow individual processes to open many more files
|
|
|
|
* than they can really support when *many* processes do the same thing.
|
|
|
|
* This GUC parameter lets the DBA limit max_safe_fds to something less than
|
|
|
|
* what the postmaster's initial probe suggests will work.
|
2001-09-30 20:57:45 +02:00
|
|
|
*/
|
2001-10-25 07:50:21 +02:00
|
|
|
int max_files_per_process = 1000;
|
2001-09-30 20:57:45 +02:00
|
|
|
|
2004-02-23 21:45:59 +01:00
|
|
|
/*
|
|
|
|
* Maximum number of file descriptors to open for either VFD entries or
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
* AllocateFile/AllocateDir/OpenTransientFile operations. This is initialized
|
|
|
|
* to a conservative value, and remains that way indefinitely in bootstrap or
|
|
|
|
* standalone-backend cases. In normal postmaster operation, the postmaster
|
|
|
|
* calls set_max_safe_fds() late in initialization to update the value, and
|
|
|
|
* that value is then inherited by forked subprocesses.
|
2004-02-23 21:45:59 +01:00
|
|
|
*
|
|
|
|
* Note: the value of max_files_per_process is taken into account while
|
|
|
|
* setting this variable, and so need not be tested separately.
|
|
|
|
*/
|
2012-03-29 07:19:11 +02:00
|
|
|
int max_safe_fds = 32; /* default if not changed */
|
2004-02-23 21:45:59 +01:00
|
|
|
|
2001-09-30 20:57:45 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* Debugging.... */
|
|
|
|
|
|
|
|
#ifdef FDDEBUG
|
2013-05-16 21:04:31 +02:00
|
|
|
#define DO_DB(A) \
|
|
|
|
do { \
|
|
|
|
int _do_db_save_errno = errno; \
|
|
|
|
A; \
|
|
|
|
errno = _do_db_save_errno; \
|
|
|
|
} while (0)
|
1996-07-09 08:22:35 +02:00
|
|
|
#else
|
2013-05-16 21:04:31 +02:00
|
|
|
#define DO_DB(A) \
|
|
|
|
((void) 0)
|
1996-07-09 08:22:35 +02:00
|
|
|
#endif
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
#define VFD_CLOSED (-1)
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
#define FileIsValid(file) \
|
2000-03-17 03:36:41 +01:00
|
|
|
((file) > 0 && (file) < (int) SizeVfdCache && VfdCache[file].fileName != NULL)
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
#define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
|
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
#define FileUnknownPos ((off_t) -1)
|
2000-06-14 05:19:24 +02:00
|
|
|
|
This patch implements holdable cursors, following the proposal
(materialization into a tuple store) discussed on pgsql-hackers earlier.
I've updated the documentation and the regression tests.
Notes on the implementation:
- I needed to change the tuple store API slightly -- it assumes that it
won't be used to hold data across transaction boundaries, so the temp
files that it uses for on-disk storage are automatically reclaimed at
end-of-transaction. I added a flag to tuplestore_begin_heap() to control
this behavior. Is changing the tuple store API in this fashion OK?
- in order to store executor results in a tuple store, I added a new
CommandDest. This works well for the most part, with one exception: the
current DestFunction API doesn't provide enough information to allow the
Executor to store results into an arbitrary tuple store (where the
particular tuple store to use is chosen by the call site of
ExecutorRun). To workaround this, I've temporarily hacked up a solution
that works, but is not ideal: since the receiveTuple DestFunction is
passed the portal name, we can use that to lookup the Portal data
structure for the cursor and then use that to get at the tuple store the
Portal is using. This unnecessarily ties the Portal code with the
tupleReceiver code, but it works...
The proper fix for this is probably to change the DestFunction API --
Tom suggested passing the full QueryDesc to the receiveTuple function.
In that case, callers of ExecutorRun could "subclass" QueryDesc to add
any additional fields that their particular CommandDest needed to get
access to. This approach would work, but I'd like to think about it for
a little bit longer before deciding which route to go. In the mean time,
the code works fine, so I don't think a fix is urgent.
- (semi-related) I added a NO SCROLL keyword to DECLARE CURSOR, and
adjusted the behavior of SCROLL in accordance with the discussion on
-hackers.
- (unrelated) Cleaned up some SGML markup in sql.sgml, copy.sgml
Neil Conway
2003-03-27 17:51:29 +01:00
|
|
|
/* these are the assigned bits in fdstate below: */
|
2003-08-04 02:43:34 +02:00
|
|
|
#define FD_TEMPORARY (1 << 0) /* T = delete when closed */
|
|
|
|
#define FD_XACT_TEMPORARY (1 << 1) /* T = delete at eoXact */
|
This patch implements holdable cursors, following the proposal
(materialization into a tuple store) discussed on pgsql-hackers earlier.
I've updated the documentation and the regression tests.
Notes on the implementation:
- I needed to change the tuple store API slightly -- it assumes that it
won't be used to hold data across transaction boundaries, so the temp
files that it uses for on-disk storage are automatically reclaimed at
end-of-transaction. I added a flag to tuplestore_begin_heap() to control
this behavior. Is changing the tuple store API in this fashion OK?
- in order to store executor results in a tuple store, I added a new
CommandDest. This works well for the most part, with one exception: the
current DestFunction API doesn't provide enough information to allow the
Executor to store results into an arbitrary tuple store (where the
particular tuple store to use is chosen by the call site of
ExecutorRun). To workaround this, I've temporarily hacked up a solution
that works, but is not ideal: since the receiveTuple DestFunction is
passed the portal name, we can use that to lookup the Portal data
structure for the cursor and then use that to get at the tuple store the
Portal is using. This unnecessarily ties the Portal code with the
tupleReceiver code, but it works...
The proper fix for this is probably to change the DestFunction API --
Tom suggested passing the full QueryDesc to the receiveTuple function.
In that case, callers of ExecutorRun could "subclass" QueryDesc to add
any additional fields that their particular CommandDest needed to get
access to. This approach would work, but I'd like to think about it for
a little bit longer before deciding which route to go. In the mean time,
the code works fine, so I don't think a fix is urgent.
- (semi-related) I added a NO SCROLL keyword to DECLARE CURSOR, and
adjusted the behavior of SCROLL in accordance with the discussion on
-hackers.
- (unrelated) Cleaned up some SGML markup in sql.sgml, copy.sgml
Neil Conway
2003-03-27 17:51:29 +01:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
typedef struct vfd
|
|
|
|
{
|
2008-03-10 21:06:27 +01:00
|
|
|
int fd; /* current FD, or VFD_CLOSED if none */
|
1999-05-09 02:52:08 +02:00
|
|
|
unsigned short fdstate; /* bitflags for VFD's state */
|
2009-12-03 12:03:29 +01:00
|
|
|
ResourceOwner resowner; /* owner, for automatic cleanup */
|
1999-05-09 02:52:08 +02:00
|
|
|
File nextFree; /* link to next free VFD, if in freelist */
|
2001-10-28 07:26:15 +01:00
|
|
|
File lruMoreRecently; /* doubly linked recency-of-use list */
|
1997-09-08 04:41:22 +02:00
|
|
|
File lruLessRecently;
|
2008-03-10 21:06:27 +01:00
|
|
|
off_t seekPos; /* current logical file position */
|
2011-07-17 20:19:31 +02:00
|
|
|
off_t fileSize; /* current size of file (0 if not temporary) */
|
1999-05-09 02:52:08 +02:00
|
|
|
char *fileName; /* name of file, or NULL for unused VFD */
|
|
|
|
/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
|
2001-04-03 04:31:52 +02:00
|
|
|
int fileFlags; /* open(2) flags for (re)opening the file */
|
1999-05-09 02:52:08 +02:00
|
|
|
int fileMode; /* mode to pass to open(2) */
|
1997-09-08 23:56:23 +02:00
|
|
|
} Vfd;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Virtual File Descriptor array pointer and size. This grows as
|
|
|
|
* needed. 'File' values are indexes into this array.
|
1999-05-09 02:52:08 +02:00
|
|
|
* Note that VfdCache[0] is not a usable VFD, just a list header.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
1997-09-08 04:41:22 +02:00
|
|
|
static Vfd *VfdCache;
|
|
|
|
static Size SizeVfdCache = 0;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
/*
|
1999-05-09 02:52:08 +02:00
|
|
|
* Number of file descriptors known to be in use by VFD entries.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
1997-09-08 04:41:22 +02:00
|
|
|
static int nfile = 0;
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2012-10-17 18:37:08 +02:00
|
|
|
/*
|
|
|
|
* Flag to tell whether it's worth scanning VfdCache looking for temp files
|
|
|
|
* to close
|
|
|
|
*/
|
|
|
|
static bool have_xact_temporary_files = false;
|
2011-07-17 20:19:31 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Tracks the total size of all temporary files. Note: when temp_file_limit
|
|
|
|
* is being enforced, this cannot overflow since the limit cannot be more
|
2014-05-06 18:12:18 +02:00
|
|
|
* than INT_MAX kilobytes. When not enforcing, it could theoretically
|
2011-07-17 20:19:31 +02:00
|
|
|
* overflow, but we don't care.
|
|
|
|
*/
|
|
|
|
static uint64 temporary_files_size = 0;
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
* List of OS handles opened with AllocateFile, AllocateDir and
|
|
|
|
* OpenTransientFile.
|
1999-05-09 02:52:08 +02:00
|
|
|
*/
|
2004-08-29 07:07:03 +02:00
|
|
|
typedef enum
|
|
|
|
{
|
2004-07-28 16:23:31 +02:00
|
|
|
AllocateDescFile,
|
Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.
In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.
This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.
Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 17:17:21 +01:00
|
|
|
AllocateDescPipe,
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
AllocateDescDir,
|
|
|
|
AllocateDescRawFD
|
2004-07-28 16:23:31 +02:00
|
|
|
} AllocateDescKind;
|
1999-05-09 02:52:08 +02:00
|
|
|
|
2004-08-29 07:07:03 +02:00
|
|
|
typedef struct
|
|
|
|
{
|
|
|
|
AllocateDescKind kind;
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
SubTransactionId create_subid;
|
2004-08-29 07:07:03 +02:00
|
|
|
union
|
|
|
|
{
|
|
|
|
FILE *file;
|
|
|
|
DIR *dir;
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
int fd;
|
2004-08-29 07:07:03 +02:00
|
|
|
} desc;
|
2004-07-28 16:23:31 +02:00
|
|
|
} AllocateDesc;
|
2004-02-24 00:03:10 +01:00
|
|
|
|
2004-08-29 07:07:03 +02:00
|
|
|
static int numAllocatedDescs = 0;
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
static int maxAllocatedDescs = 0;
|
|
|
|
static AllocateDesc *allocatedDescs = NULL;
|
2004-02-24 00:03:10 +01:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
/*
|
2003-04-29 05:21:30 +02:00
|
|
|
* Number of temporary files opened during the current session;
|
1999-05-09 02:52:08 +02:00
|
|
|
* this is used in generation of tempfile names.
|
|
|
|
*/
|
|
|
|
static long tempFileCounter = 0;
|
|
|
|
|
2007-06-07 21:19:57 +02:00
|
|
|
/*
|
|
|
|
* Array of OIDs of temp tablespaces. When numTempTableSpaces is -1,
|
|
|
|
* this has not been set in the current transaction.
|
|
|
|
*/
|
|
|
|
static Oid *tempTableSpaces = NULL;
|
|
|
|
static int numTempTableSpaces = -1;
|
|
|
|
static int nextTempTableSpace = 0;
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
|
|
|
|
/*--------------------
|
|
|
|
*
|
1996-07-09 08:22:35 +02:00
|
|
|
* Private Routines
|
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* Delete - delete a file from the Lru ring
|
1999-05-09 02:52:08 +02:00
|
|
|
* LruDelete - remove a file from the Lru ring and close its FD
|
1997-09-07 07:04:48 +02:00
|
|
|
* Insert - put a file at the front of the Lru ring
|
1999-05-09 02:52:08 +02:00
|
|
|
* LruInsert - put a file at the front of the Lru ring and open it
|
|
|
|
* ReleaseLruFile - Release an fd by closing the last entry in the Lru ring
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
* ReleaseLruFiles - Release fd(s) until we're under the max_safe_fds limit
|
1999-05-09 02:52:08 +02:00
|
|
|
* AllocateVfd - grab a free (or new) file record (from VfdArray)
|
|
|
|
* FreeVfd - free a file record
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
1999-05-09 02:52:08 +02:00
|
|
|
* The Least Recently Used ring is a doubly linked list that begins and
|
1996-12-27 23:57:51 +01:00
|
|
|
* ends on element zero. Element zero is special -- it doesn't represent
|
2014-05-06 18:12:18 +02:00
|
|
|
* a file and its "fd" field always == VFD_CLOSED. Element zero is just an
|
1996-12-27 23:57:51 +01:00
|
|
|
* anchor that shows us the beginning/end of the ring.
|
1999-05-09 02:52:08 +02:00
|
|
|
* Only VFD elements that are currently really open (have an FD assigned) are
|
|
|
|
* in the Lru ring. Elements that are "virtually" open can be recognized
|
|
|
|
* by having a non-null fileName field.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
* example:
|
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* /--less----\ /---------\
|
|
|
|
* v \ v \
|
|
|
|
* #0 --more---> LeastRecentlyUsed --more-\ \
|
|
|
|
* ^\ | |
|
|
|
|
* \\less--> MostRecentlyUsedFile <---/ |
|
|
|
|
* \more---/ \--less--/
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
1999-05-09 02:52:08 +02:00
|
|
|
*--------------------
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
1997-09-08 04:41:22 +02:00
|
|
|
static void Delete(File file);
|
|
|
|
static void LruDelete(File file);
|
|
|
|
static void Insert(File file);
|
|
|
|
static int LruInsert(File file);
|
2000-08-27 23:48:00 +02:00
|
|
|
static bool ReleaseLruFile(void);
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
static void ReleaseLruFiles(void);
|
1997-09-08 04:41:22 +02:00
|
|
|
static File AllocateVfd(void);
|
|
|
|
static void FreeVfd(File file);
|
|
|
|
|
|
|
|
static int FileAccess(File file);
|
2007-06-03 19:08:34 +02:00
|
|
|
static File OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError);
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
static bool reserveAllocatedDesc(void);
|
|
|
|
static int FreeDesc(AllocateDesc *desc);
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
static struct dirent *ReadDirExtended(DIR *dir, const char *dirname, int elevel);
|
|
|
|
|
2003-12-12 19:45:10 +01:00
|
|
|
static void AtProcExit_Files(int code, Datum arg);
|
2003-04-29 05:21:30 +02:00
|
|
|
static void CleanupTempFiles(bool isProcExit);
|
2004-12-29 22:36:09 +01:00
|
|
|
static void RemovePgTempFilesInDir(const char *tmpdirname);
|
2010-08-13 22:10:54 +02:00
|
|
|
static void RemovePgTempRelationFiles(const char *tsdirname);
|
|
|
|
static void RemovePgTempRelationFilesInDbspace(const char *dbspacedirname);
|
|
|
|
static bool looks_like_temp_rel_name(const char *name);
|
2003-04-29 05:21:30 +02:00
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
static void walkdir(const char *path,
|
|
|
|
void (*action) (const char *fname, bool isdir, int elevel),
|
|
|
|
bool process_symlinks,
|
|
|
|
int elevel);
|
|
|
|
#ifdef PG_FLUSH_DATA_WORKS
|
|
|
|
static void pre_sync_fname(const char *fname, bool isdir, int elevel);
|
|
|
|
#endif
|
|
|
|
static void fsync_fname_ext(const char *fname, bool isdir, int elevel);
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2000-12-08 23:21:33 +01:00
|
|
|
/*
|
2005-05-20 16:53:26 +02:00
|
|
|
* pg_fsync --- do fsync with or without writethrough
|
2000-12-08 23:21:33 +01:00
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_fsync(int fd)
|
2005-05-20 16:53:26 +02:00
|
|
|
{
|
2010-12-09 02:01:09 +01:00
|
|
|
/* #if is to skip the sync_method test if there's no need for it */
|
|
|
|
#if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
|
|
|
|
if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
|
|
|
|
return pg_fsync_writethrough(fd);
|
2005-05-20 16:53:26 +02:00
|
|
|
else
|
|
|
|
#endif
|
2010-12-09 02:01:09 +01:00
|
|
|
return pg_fsync_no_writethrough(fd);
|
2005-05-20 16:53:26 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pg_fsync_no_writethrough --- same as fsync except does nothing if
|
|
|
|
* enableFsync is off
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_fsync_no_writethrough(int fd)
|
2000-12-08 23:21:33 +01:00
|
|
|
{
|
|
|
|
if (enableFsync)
|
|
|
|
return fsync(fd);
|
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2005-05-20 16:53:26 +02:00
|
|
|
/*
|
|
|
|
* pg_fsync_writethrough
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_fsync_writethrough(int fd)
|
|
|
|
{
|
|
|
|
if (enableFsync)
|
2006-01-18 00:52:31 +01:00
|
|
|
{
|
2005-05-20 16:53:26 +02:00
|
|
|
#ifdef WIN32
|
|
|
|
return _commit(fd);
|
2006-01-18 00:52:31 +01:00
|
|
|
#elif defined(F_FULLFSYNC)
|
|
|
|
return (fcntl(fd, F_FULLFSYNC, 0) == -1) ? -1 : 0;
|
2005-05-20 16:53:26 +02:00
|
|
|
#else
|
2010-02-22 16:26:14 +01:00
|
|
|
errno = ENOSYS;
|
2005-05-20 16:53:26 +02:00
|
|
|
return -1;
|
|
|
|
#endif
|
2006-01-18 00:52:31 +01:00
|
|
|
}
|
2005-05-20 16:53:26 +02:00
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2001-02-18 05:39:42 +01:00
|
|
|
/*
|
|
|
|
* pg_fdatasync --- same as fdatasync except does nothing if enableFsync is off
|
|
|
|
*
|
|
|
|
* Not all platforms have fdatasync; treat as fsync if not available.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_fdatasync(int fd)
|
|
|
|
{
|
|
|
|
if (enableFsync)
|
|
|
|
{
|
|
|
|
#ifdef HAVE_FDATASYNC
|
|
|
|
return fdatasync(fd);
|
|
|
|
#else
|
|
|
|
return fsync(fd);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
else
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-02-15 01:50:57 +01:00
|
|
|
/*
|
|
|
|
* pg_flush_data --- advise OS that the data described won't be needed soon
|
|
|
|
*
|
2012-07-13 23:16:58 +02:00
|
|
|
* Not all platforms have sync_file_range or posix_fadvise; treat as no-op
|
2012-07-22 02:10:29 +02:00
|
|
|
* if not available. Also, treat as no-op if enableFsync is off; this is
|
|
|
|
* because the call isn't free, and some platforms such as Linux will actually
|
|
|
|
* block the requestor until the write is scheduled.
|
2010-02-15 01:50:57 +01:00
|
|
|
*/
|
|
|
|
int
|
|
|
|
pg_flush_data(int fd, off_t offset, off_t amount)
|
|
|
|
{
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
#ifdef PG_FLUSH_DATA_WORKS
|
2012-07-22 02:10:29 +02:00
|
|
|
if (enableFsync)
|
|
|
|
{
|
2012-07-13 23:16:58 +02:00
|
|
|
#if defined(HAVE_SYNC_FILE_RANGE)
|
2012-07-22 02:10:29 +02:00
|
|
|
return sync_file_range(fd, offset, amount, SYNC_FILE_RANGE_WRITE);
|
2012-07-13 23:16:58 +02:00
|
|
|
#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
|
2012-07-22 02:10:29 +02:00
|
|
|
return posix_fadvise(fd, offset, amount, POSIX_FADV_DONTNEED);
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
#else
|
|
|
|
#error PG_FLUSH_DATA_WORKS should not have been defined
|
2010-02-15 01:50:57 +01:00
|
|
|
#endif
|
2012-07-22 02:10:29 +02:00
|
|
|
}
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
#endif
|
2012-07-22 02:10:29 +02:00
|
|
|
return 0;
|
2010-02-15 01:50:57 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2013-09-04 17:15:00 +02:00
|
|
|
/*
|
|
|
|
* fsync_fname -- fsync a file or directory, handling errors properly
|
|
|
|
*
|
|
|
|
* Try to fsync a file or directory. When doing the latter, ignore errors that
|
|
|
|
* indicate the OS just doesn't allow/require fsyncing directories.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
fsync_fname(char *fname, bool isdir)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
int returncode;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Some OSs require directories to be opened read-only whereas other
|
|
|
|
* systems don't allow us to fsync files opened read-only; so we need both
|
|
|
|
* cases here
|
|
|
|
*/
|
|
|
|
if (!isdir)
|
|
|
|
fd = OpenTransientFile(fname,
|
|
|
|
O_RDWR | PG_BINARY,
|
|
|
|
S_IRUSR | S_IWUSR);
|
|
|
|
else
|
|
|
|
fd = OpenTransientFile(fname,
|
|
|
|
O_RDONLY | PG_BINARY,
|
|
|
|
S_IRUSR | S_IWUSR);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Some OSs don't allow us to open directories at all (Windows returns
|
|
|
|
* EACCES)
|
|
|
|
*/
|
|
|
|
if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
|
|
|
|
return;
|
|
|
|
|
|
|
|
else if (fd < 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\": %m", fname)));
|
|
|
|
|
|
|
|
returncode = pg_fsync(fd);
|
|
|
|
|
|
|
|
/* Some OSs don't allow us to fsync directories at all */
|
|
|
|
if (returncode != 0 && isdir && errno == EBADF)
|
|
|
|
{
|
|
|
|
CloseTransientFile(fd);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (returncode != 0)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not fsync file \"%s\": %m", fname)));
|
|
|
|
|
|
|
|
CloseTransientFile(fd);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2005-08-08 05:12:16 +02:00
|
|
|
/*
|
|
|
|
* InitFileAccess --- initialize this module during backend startup
|
|
|
|
*
|
|
|
|
* This is called during either normal or standalone backend start.
|
|
|
|
* It is *not* called in the postmaster.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
InitFileAccess(void)
|
|
|
|
{
|
2005-10-15 04:49:52 +02:00
|
|
|
Assert(SizeVfdCache == 0); /* call me only once */
|
2005-08-08 05:12:16 +02:00
|
|
|
|
|
|
|
/* initialize cache header entry */
|
|
|
|
VfdCache = (Vfd *) malloc(sizeof(Vfd));
|
|
|
|
if (VfdCache == NULL)
|
|
|
|
ereport(FATAL,
|
|
|
|
(errcode(ERRCODE_OUT_OF_MEMORY),
|
|
|
|
errmsg("out of memory")));
|
|
|
|
|
|
|
|
MemSet((char *) &(VfdCache[0]), 0, sizeof(Vfd));
|
|
|
|
VfdCache->fd = VFD_CLOSED;
|
|
|
|
|
|
|
|
SizeVfdCache = 1;
|
|
|
|
|
|
|
|
/* register proc-exit hook to ensure temp files are dropped at exit */
|
|
|
|
on_proc_exit(AtProcExit_Files, 0);
|
|
|
|
}
|
|
|
|
|
2004-02-23 21:45:59 +01:00
|
|
|
/*
|
|
|
|
* count_usable_fds --- count how many FDs the system will let us open,
|
|
|
|
* and estimate how many are already open.
|
|
|
|
*
|
2005-08-07 20:47:19 +02:00
|
|
|
* We stop counting if usable_fds reaches max_to_probe. Note: a small
|
|
|
|
* value of max_to_probe might result in an underestimate of already_open;
|
|
|
|
* we must fill in any "gaps" in the set of used FDs before the calculation
|
2014-05-06 18:12:18 +02:00
|
|
|
* of already_open will give the right answer. In practice, max_to_probe
|
2005-08-07 20:47:19 +02:00
|
|
|
* of a couple of dozen should be enough to ensure good results.
|
|
|
|
*
|
2004-02-23 21:45:59 +01:00
|
|
|
* We assume stdin (FD 0) is available for dup'ing
|
|
|
|
*/
|
|
|
|
static void
|
2005-08-07 20:47:19 +02:00
|
|
|
count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
|
2004-02-23 21:45:59 +01:00
|
|
|
{
|
|
|
|
int *fd;
|
|
|
|
int size;
|
|
|
|
int used = 0;
|
|
|
|
int highestfd = 0;
|
|
|
|
int j;
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2009-03-04 10:12:49 +01:00
|
|
|
#ifdef HAVE_GETRLIMIT
|
|
|
|
struct rlimit rlim;
|
|
|
|
int getrlimit_status;
|
|
|
|
#endif
|
2004-02-23 21:45:59 +01:00
|
|
|
|
|
|
|
size = 1024;
|
|
|
|
fd = (int *) palloc(size * sizeof(int));
|
|
|
|
|
2009-03-04 10:12:49 +01:00
|
|
|
#ifdef HAVE_GETRLIMIT
|
2009-06-11 16:49:15 +02:00
|
|
|
#ifdef RLIMIT_NOFILE /* most platforms use RLIMIT_NOFILE */
|
2009-03-04 10:12:49 +01:00
|
|
|
getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
|
2010-02-26 03:01:40 +01:00
|
|
|
#else /* but BSD doesn't ... */
|
2009-03-04 10:12:49 +01:00
|
|
|
getrlimit_status = getrlimit(RLIMIT_OFILE, &rlim);
|
2009-06-11 16:49:15 +02:00
|
|
|
#endif /* RLIMIT_NOFILE */
|
2009-03-04 10:12:49 +01:00
|
|
|
if (getrlimit_status != 0)
|
|
|
|
ereport(WARNING, (errmsg("getrlimit failed: %m")));
|
2009-06-11 16:49:15 +02:00
|
|
|
#endif /* HAVE_GETRLIMIT */
|
2009-03-04 10:12:49 +01:00
|
|
|
|
2005-08-07 20:47:19 +02:00
|
|
|
/* dup until failure or probe limit reached */
|
2004-02-23 21:45:59 +01:00
|
|
|
for (;;)
|
|
|
|
{
|
2004-08-29 07:07:03 +02:00
|
|
|
int thisfd;
|
2004-02-23 21:45:59 +01:00
|
|
|
|
2009-03-04 10:12:49 +01:00
|
|
|
#ifdef HAVE_GETRLIMIT
|
2009-06-11 16:49:15 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
|
|
|
|
* some platforms
|
|
|
|
*/
|
2009-03-04 10:12:49 +01:00
|
|
|
if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
|
2004-02-23 21:45:59 +01:00
|
|
|
thisfd = dup(0);
|
|
|
|
if (thisfd < 0)
|
|
|
|
{
|
|
|
|
/* Expect EMFILE or ENFILE, else it's fishy */
|
|
|
|
if (errno != EMFILE && errno != ENFILE)
|
|
|
|
elog(WARNING, "dup(0) failed after %d successes: %m", used);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (used >= size)
|
|
|
|
{
|
|
|
|
size *= 2;
|
|
|
|
fd = (int *) repalloc(fd, size * sizeof(int));
|
|
|
|
}
|
|
|
|
fd[used++] = thisfd;
|
|
|
|
|
|
|
|
if (highestfd < thisfd)
|
|
|
|
highestfd = thisfd;
|
2005-08-07 20:47:19 +02:00
|
|
|
|
|
|
|
if (used >= max_to_probe)
|
|
|
|
break;
|
2004-02-23 21:45:59 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* release the files we opened */
|
|
|
|
for (j = 0; j < used; j++)
|
|
|
|
close(fd[j]);
|
|
|
|
|
|
|
|
pfree(fd);
|
|
|
|
|
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Return results. usable_fds is just the number of successful dups. We
|
2005-10-15 04:49:52 +02:00
|
|
|
* assume that the system limit is highestfd+1 (remember 0 is a legal FD
|
|
|
|
* number) and so already_open is highestfd+1 - usable_fds.
|
2004-02-23 21:45:59 +01:00
|
|
|
*/
|
|
|
|
*usable_fds = used;
|
2004-08-29 07:07:03 +02:00
|
|
|
*already_open = highestfd + 1 - used;
|
2004-02-23 21:45:59 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* set_max_safe_fds
|
|
|
|
* Determine number of filedescriptors that fd.c is allowed to use
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
set_max_safe_fds(void)
|
|
|
|
{
|
|
|
|
int usable_fds;
|
|
|
|
int already_open;
|
|
|
|
|
2005-08-07 20:47:19 +02:00
|
|
|
/*----------
|
|
|
|
* We want to set max_safe_fds to
|
|
|
|
* MIN(usable_fds, max_files_per_process - already_open)
|
|
|
|
* less the slop factor for files that are opened without consulting
|
|
|
|
* fd.c. This ensures that we won't exceed either max_files_per_process
|
|
|
|
* or the experimentally-determined EMFILE limit.
|
|
|
|
*----------
|
2004-02-23 21:45:59 +01:00
|
|
|
*/
|
2005-08-07 20:47:19 +02:00
|
|
|
count_usable_fds(max_files_per_process,
|
|
|
|
&usable_fds, &already_open);
|
2004-02-23 21:45:59 +01:00
|
|
|
|
|
|
|
max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Take off the FDs reserved for system() etc.
|
|
|
|
*/
|
|
|
|
max_safe_fds -= NUM_RESERVED_FDS;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure we still have enough to get by.
|
|
|
|
*/
|
|
|
|
if (max_safe_fds < FD_MINFREE)
|
|
|
|
ereport(FATAL,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
|
|
|
errmsg("insufficient file descriptors available to start server process"),
|
|
|
|
errdetail("System allows %d, we need at least %d.",
|
|
|
|
max_safe_fds + NUM_RESERVED_FDS,
|
|
|
|
FD_MINFREE + NUM_RESERVED_FDS)));
|
|
|
|
|
|
|
|
elog(DEBUG2, "max_safe_fds = %d, usable_fds = %d, already_open = %d",
|
|
|
|
max_safe_fds, usable_fds, already_open);
|
|
|
|
}
|
|
|
|
|
2000-06-02 05:58:34 +02:00
|
|
|
/*
|
|
|
|
* BasicOpenFile --- same as open(2) except can free other FDs if needed
|
|
|
|
*
|
|
|
|
* This is exported for use by places that really want a plain kernel FD,
|
|
|
|
* but need to be proof against running out of FDs. Once an FD has been
|
|
|
|
* successfully returned, it is the caller's responsibility to ensure that
|
2003-08-04 02:43:34 +02:00
|
|
|
* it will not be leaked on ereport()! Most users should *not* call this
|
2000-06-02 05:58:34 +02:00
|
|
|
* routine directly, but instead use the VFD abstraction level, which
|
|
|
|
* provides protection against descriptor leaks as well as management of
|
|
|
|
* files that need to be open for more than a short period of time.
|
|
|
|
*
|
|
|
|
* Ideally this should be the *only* direct call of open() in the backend.
|
|
|
|
* In practice, the postmaster calls open() directly, and there are some
|
|
|
|
* direct open() calls done early in backend startup. Those are OK since
|
|
|
|
* this module wouldn't have any open files to close at that point anyway.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
BasicOpenFile(FileName fileName, int fileFlags, int fileMode)
|
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int fd;
|
2000-06-02 05:58:34 +02:00
|
|
|
|
|
|
|
tryAgain:
|
|
|
|
fd = open(fileName, fileFlags, fileMode);
|
|
|
|
|
|
|
|
if (fd >= 0)
|
|
|
|
return fd; /* success! */
|
|
|
|
|
2000-08-27 23:48:00 +02:00
|
|
|
if (errno == EMFILE || errno == ENFILE)
|
2000-06-02 05:58:34 +02:00
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int save_errno = errno;
|
2000-08-27 23:48:00 +02:00
|
|
|
|
2003-07-25 00:04:15 +02:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
2005-10-15 04:49:52 +02:00
|
|
|
errmsg("out of file descriptors: %m; release and retry")));
|
2000-06-02 05:58:34 +02:00
|
|
|
errno = 0;
|
2000-08-27 23:48:00 +02:00
|
|
|
if (ReleaseLruFile())
|
|
|
|
goto tryAgain;
|
|
|
|
errno = save_errno;
|
2000-06-02 05:58:34 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return -1; /* failure */
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
#if defined(FDDEBUG)
|
1999-05-09 02:52:08 +02:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
static void
|
2000-08-27 23:48:00 +02:00
|
|
|
_dump_lru(void)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
int mru = VfdCache[0].lruLessRecently;
|
|
|
|
Vfd *vfdP = &VfdCache[mru];
|
|
|
|
char buf[2048];
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2002-09-02 08:11:43 +02:00
|
|
|
snprintf(buf, sizeof(buf), "LRU: MOST %d ", mru);
|
1997-09-07 07:04:48 +02:00
|
|
|
while (mru != 0)
|
|
|
|
{
|
|
|
|
mru = vfdP->lruLessRecently;
|
|
|
|
vfdP = &VfdCache[mru];
|
2002-09-02 08:11:43 +02:00
|
|
|
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "%d ", mru);
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
2002-09-02 08:11:43 +02:00
|
|
|
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "LEAST");
|
2011-06-10 19:37:06 +02:00
|
|
|
elog(LOG, "%s", buf);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
2001-11-05 18:46:40 +01:00
|
|
|
#endif /* FDDEBUG */
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
static void
|
|
|
|
Delete(File file)
|
|
|
|
{
|
1999-05-09 02:52:08 +02:00
|
|
|
Vfd *vfdP;
|
|
|
|
|
|
|
|
Assert(file != 0);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "Delete %d (%s)",
|
1997-09-07 07:04:48 +02:00
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
DO_DB(_dump_lru());
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
vfdP = &VfdCache[file];
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
VfdCache[vfdP->lruLessRecently].lruMoreRecently = vfdP->lruMoreRecently;
|
|
|
|
VfdCache[vfdP->lruMoreRecently].lruLessRecently = vfdP->lruLessRecently;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
DO_DB(_dump_lru());
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
LruDelete(File file)
|
|
|
|
{
|
1999-05-09 02:52:08 +02:00
|
|
|
Vfd *vfdP;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(file != 0);
|
|
|
|
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "LruDelete %d (%s)",
|
1997-09-07 07:04:48 +02:00
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
vfdP = &VfdCache[file];
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
/* delete the vfd record from the LRU ring */
|
|
|
|
Delete(file);
|
|
|
|
|
|
|
|
/* save the seek position */
|
2008-03-10 21:06:27 +01:00
|
|
|
vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
|
|
|
|
Assert(vfdP->seekPos != (off_t) -1);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
/* close the file */
|
2002-02-10 23:56:31 +01:00
|
|
|
if (close(vfdP->fd))
|
2007-07-26 17:15:18 +02:00
|
|
|
elog(ERROR, "could not close file \"%s\": %m", vfdP->fileName);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
--nfile;
|
1999-05-09 02:52:08 +02:00
|
|
|
vfdP->fd = VFD_CLOSED;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
Insert(File file)
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
Vfd *vfdP;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(file != 0);
|
|
|
|
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "Insert %d (%s)",
|
1997-09-07 07:04:48 +02:00
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
DO_DB(_dump_lru());
|
|
|
|
|
|
|
|
vfdP = &VfdCache[file];
|
|
|
|
|
|
|
|
vfdP->lruMoreRecently = 0;
|
|
|
|
vfdP->lruLessRecently = VfdCache[0].lruLessRecently;
|
|
|
|
VfdCache[0].lruLessRecently = file;
|
|
|
|
VfdCache[vfdP->lruLessRecently].lruMoreRecently = file;
|
|
|
|
|
|
|
|
DO_DB(_dump_lru());
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
/* returns 0 on success, -1 on re-open failure (with errno set) */
|
1996-07-09 08:22:35 +02:00
|
|
|
static int
|
1997-09-07 07:04:48 +02:00
|
|
|
LruInsert(File file)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
Vfd *vfdP;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(file != 0);
|
|
|
|
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "LruInsert %d (%s)",
|
1997-09-07 07:04:48 +02:00
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
|
|
|
|
vfdP = &VfdCache[file];
|
|
|
|
|
|
|
|
if (FileIsNotOpen(file))
|
|
|
|
{
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/* Close excess kernel FDs. */
|
|
|
|
ReleaseLruFiles();
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* The open could still fail for lack of file descriptors, eg due to
|
|
|
|
* overall system file table being full. So, be prepared to release
|
|
|
|
* another FD if necessary...
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
2000-06-02 05:58:34 +02:00
|
|
|
vfdP->fd = BasicOpenFile(vfdP->fileName, vfdP->fileFlags,
|
|
|
|
vfdP->fileMode);
|
1997-09-07 07:04:48 +02:00
|
|
|
if (vfdP->fd < 0)
|
|
|
|
{
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "RE_OPEN FAILED: %d", errno));
|
2013-05-16 21:04:31 +02:00
|
|
|
return -1;
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "RE_OPEN SUCCESS"));
|
1997-09-07 07:04:48 +02:00
|
|
|
++nfile;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* seek to the right position */
|
2008-03-10 21:06:27 +01:00
|
|
|
if (vfdP->seekPos != (off_t) 0)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2012-06-10 21:20:04 +02:00
|
|
|
off_t returnValue PG_USED_FOR_ASSERTS_ONLY;
|
2002-02-10 23:56:31 +01:00
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
returnValue = lseek(vfdP->fd, vfdP->seekPos, SEEK_SET);
|
|
|
|
Assert(returnValue != (off_t) -1);
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* put it at the head of the Lru ring
|
|
|
|
*/
|
|
|
|
|
|
|
|
Insert(file);
|
|
|
|
|
1998-09-01 05:29:17 +02:00
|
|
|
return 0;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/*
|
|
|
|
* Release one kernel FD by closing the least-recently-used VFD.
|
|
|
|
*/
|
2000-08-27 23:48:00 +02:00
|
|
|
static bool
|
|
|
|
ReleaseLruFile(void)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "ReleaseLruFile. Opened %d", nfile));
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2000-08-27 23:48:00 +02:00
|
|
|
if (nfile > 0)
|
|
|
|
{
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* There are opened files and so there should be at least one used vfd
|
|
|
|
* in the ring.
|
2000-08-27 23:48:00 +02:00
|
|
|
*/
|
|
|
|
Assert(VfdCache[0].lruMoreRecently != 0);
|
|
|
|
LruDelete(VfdCache[0].lruMoreRecently);
|
|
|
|
return true; /* freed a file */
|
|
|
|
}
|
|
|
|
return false; /* no files available to free */
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/*
|
|
|
|
* Release kernel FDs as needed to get under the max_safe_fds limit.
|
|
|
|
* After calling this, it's OK to try to open another file.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ReleaseLruFiles(void)
|
|
|
|
{
|
|
|
|
while (nfile + numAllocatedDescs >= max_safe_fds)
|
|
|
|
{
|
|
|
|
if (!ReleaseLruFile())
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1997-09-08 04:41:22 +02:00
|
|
|
static File
|
2000-08-27 23:48:00 +02:00
|
|
|
AllocateVfd(void)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
Index i;
|
|
|
|
File file;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2014-01-23 23:18:23 +01:00
|
|
|
DO_DB(elog(LOG, "AllocateVfd. Size %zu", SizeVfdCache));
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2005-10-15 04:49:52 +02:00
|
|
|
Assert(SizeVfdCache > 0); /* InitFileAccess not called? */
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
if (VfdCache[0].nextFree == 0)
|
|
|
|
{
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* The free list is empty so it is time to increase the size of the
|
|
|
|
* array. We choose to double it each time this happens. However,
|
|
|
|
* there's not much point in starting *real* small.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
1999-05-25 18:15:34 +02:00
|
|
|
Size newCacheSize = SizeVfdCache * 2;
|
2001-04-03 06:07:02 +02:00
|
|
|
Vfd *newVfdCache;
|
1999-05-09 02:52:08 +02:00
|
|
|
|
|
|
|
if (newCacheSize < 32)
|
|
|
|
newCacheSize = 32;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2001-04-03 06:07:02 +02:00
|
|
|
/*
|
2003-07-25 00:04:15 +02:00
|
|
|
* Be careful not to clobber VfdCache ptr if realloc fails.
|
2001-04-03 06:07:02 +02:00
|
|
|
*/
|
|
|
|
newVfdCache = (Vfd *) realloc(VfdCache, sizeof(Vfd) * newCacheSize);
|
|
|
|
if (newVfdCache == NULL)
|
2003-07-25 00:04:15 +02:00
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OUT_OF_MEMORY),
|
|
|
|
errmsg("out of memory")));
|
2001-04-03 06:07:02 +02:00
|
|
|
VfdCache = newVfdCache;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
/*
|
1999-05-09 02:52:08 +02:00
|
|
|
* Initialize the new entries and link them into the free list.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
1999-05-09 02:52:08 +02:00
|
|
|
for (i = SizeVfdCache; i < newCacheSize; i++)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
1999-05-09 02:52:08 +02:00
|
|
|
MemSet((char *) &(VfdCache[i]), 0, sizeof(Vfd));
|
1997-09-07 07:04:48 +02:00
|
|
|
VfdCache[i].nextFree = i + 1;
|
|
|
|
VfdCache[i].fd = VFD_CLOSED;
|
|
|
|
}
|
1999-05-09 02:52:08 +02:00
|
|
|
VfdCache[newCacheSize - 1].nextFree = 0;
|
1997-09-07 07:04:48 +02:00
|
|
|
VfdCache[0].nextFree = SizeVfdCache;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Record the new size
|
|
|
|
*/
|
1999-05-09 02:52:08 +02:00
|
|
|
SizeVfdCache = newCacheSize;
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
1999-05-09 02:52:08 +02:00
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
file = VfdCache[0].nextFree;
|
|
|
|
|
|
|
|
VfdCache[0].nextFree = VfdCache[file].nextFree;
|
|
|
|
|
|
|
|
return file;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
FreeVfd(File file)
|
|
|
|
{
|
1999-05-09 02:52:08 +02:00
|
|
|
Vfd *vfdP = &VfdCache[file];
|
|
|
|
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "FreeVfd: %d (%s)",
|
1999-05-09 02:52:08 +02:00
|
|
|
file, vfdP->fileName ? vfdP->fileName : ""));
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
if (vfdP->fileName != NULL)
|
|
|
|
{
|
|
|
|
free(vfdP->fileName);
|
|
|
|
vfdP->fileName = NULL;
|
|
|
|
}
|
2001-04-03 06:07:02 +02:00
|
|
|
vfdP->fdstate = 0x0;
|
1999-05-09 02:52:08 +02:00
|
|
|
|
|
|
|
vfdP->nextFree = VfdCache[0].nextFree;
|
1997-09-07 07:04:48 +02:00
|
|
|
VfdCache[0].nextFree = file;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
/* returns 0 on success, -1 on re-open failure (with errno set) */
|
1996-07-09 08:22:35 +02:00
|
|
|
static int
|
|
|
|
FileAccess(File file)
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
int returnValue;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "FileAccess %d (%s)",
|
1997-09-07 07:04:48 +02:00
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Is the file open? If not, open it and put it at the head of the LRU
|
|
|
|
* ring (possibly closing the least recently used file to get an FD).
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
|
|
|
|
|
|
|
if (FileIsNotOpen(file))
|
|
|
|
{
|
|
|
|
returnValue = LruInsert(file);
|
|
|
|
if (returnValue != 0)
|
|
|
|
return returnValue;
|
|
|
|
}
|
1999-05-09 02:52:08 +02:00
|
|
|
else if (VfdCache[0].lruLessRecently != file)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* We now know that the file is open and that it is not the last one
|
|
|
|
* accessed, so we need to move it to the head of the Lru ring.
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
|
|
|
|
|
|
|
Delete(file);
|
|
|
|
Insert(file);
|
|
|
|
}
|
|
|
|
|
1998-09-01 05:29:17 +02:00
|
|
|
return 0;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
1997-09-07 07:04:48 +02:00
|
|
|
* Called when we get a shared invalidation message on some relation.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
1997-08-19 23:40:56 +02:00
|
|
|
#ifdef NOT_USED
|
1996-07-09 08:22:35 +02:00
|
|
|
void
|
|
|
|
FileInvalidate(File file)
|
|
|
|
{
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(FileIsValid(file));
|
1997-09-07 07:04:48 +02:00
|
|
|
if (!FileIsNotOpen(file))
|
|
|
|
LruDelete(file);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
1997-08-19 23:40:56 +02:00
|
|
|
#endif
|
1996-07-09 08:22:35 +02:00
|
|
|
|
2005-07-04 06:51:52 +02:00
|
|
|
/*
|
|
|
|
* open a file in an arbitrary directory
|
|
|
|
*
|
|
|
|
* NB: if the passed pathname is relative (which it usually is),
|
|
|
|
* it will be interpreted relative to the process' working directory
|
|
|
|
* (which should always be $PGDATA when this code is running).
|
|
|
|
*/
|
|
|
|
File
|
|
|
|
PathNameOpenFile(FileName fileName, int fileFlags, int fileMode)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2003-07-25 00:04:15 +02:00
|
|
|
char *fnamecopy;
|
1997-09-08 04:41:22 +02:00
|
|
|
File file;
|
|
|
|
Vfd *vfdP;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2005-07-04 06:51:52 +02:00
|
|
|
DO_DB(elog(LOG, "PathNameOpenFile: %s %x %o",
|
1997-09-07 07:04:48 +02:00
|
|
|
fileName, fileFlags, fileMode));
|
|
|
|
|
2003-07-25 00:04:15 +02:00
|
|
|
/*
|
|
|
|
* We need a malloc'd copy of the file name; fail cleanly if no room.
|
|
|
|
*/
|
|
|
|
fnamecopy = strdup(fileName);
|
|
|
|
if (fnamecopy == NULL)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OUT_OF_MEMORY),
|
|
|
|
errmsg("out of memory")));
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
file = AllocateVfd();
|
|
|
|
vfdP = &VfdCache[file];
|
|
|
|
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/* Close excess kernel FDs. */
|
|
|
|
ReleaseLruFiles();
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2000-06-02 05:58:34 +02:00
|
|
|
vfdP->fd = BasicOpenFile(fileName, fileFlags, fileMode);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
if (vfdP->fd < 0)
|
|
|
|
{
|
2013-05-16 21:04:31 +02:00
|
|
|
int save_errno = errno;
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
FreeVfd(file);
|
2003-07-25 00:04:15 +02:00
|
|
|
free(fnamecopy);
|
2013-05-16 21:04:31 +02:00
|
|
|
errno = save_errno;
|
1997-09-07 07:04:48 +02:00
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
++nfile;
|
2005-07-04 06:51:52 +02:00
|
|
|
DO_DB(elog(LOG, "PathNameOpenFile: success %d",
|
1997-09-07 07:04:48 +02:00
|
|
|
vfdP->fd));
|
|
|
|
|
|
|
|
Insert(file);
|
|
|
|
|
2003-07-25 00:04:15 +02:00
|
|
|
vfdP->fileName = fnamecopy;
|
2001-04-03 04:31:52 +02:00
|
|
|
/* Saved flags are adjusted to be OK for re-opening file */
|
|
|
|
vfdP->fileFlags = fileFlags & ~(O_CREAT | O_TRUNC | O_EXCL);
|
1997-09-07 07:04:48 +02:00
|
|
|
vfdP->fileMode = fileMode;
|
|
|
|
vfdP->seekPos = 0;
|
2011-07-17 20:19:31 +02:00
|
|
|
vfdP->fileSize = 0;
|
2002-08-06 04:36:35 +02:00
|
|
|
vfdP->fdstate = 0x0;
|
2009-12-03 12:03:29 +01:00
|
|
|
vfdP->resowner = NULL;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
return file;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
/*
|
|
|
|
* Open a temporary file that will disappear when we close it.
|
|
|
|
*
|
|
|
|
* This routine takes care of generating an appropriate tempfile name.
|
|
|
|
* There's no need to pass in fileFlags or fileMode either, since only
|
|
|
|
* one setting makes any sense for a temp file.
|
This patch implements holdable cursors, following the proposal
(materialization into a tuple store) discussed on pgsql-hackers earlier.
I've updated the documentation and the regression tests.
Notes on the implementation:
- I needed to change the tuple store API slightly -- it assumes that it
won't be used to hold data across transaction boundaries, so the temp
files that it uses for on-disk storage are automatically reclaimed at
end-of-transaction. I added a flag to tuplestore_begin_heap() to control
this behavior. Is changing the tuple store API in this fashion OK?
- in order to store executor results in a tuple store, I added a new
CommandDest. This works well for the most part, with one exception: the
current DestFunction API doesn't provide enough information to allow the
Executor to store results into an arbitrary tuple store (where the
particular tuple store to use is chosen by the call site of
ExecutorRun). To workaround this, I've temporarily hacked up a solution
that works, but is not ideal: since the receiveTuple DestFunction is
passed the portal name, we can use that to lookup the Portal data
structure for the cursor and then use that to get at the tuple store the
Portal is using. This unnecessarily ties the Portal code with the
tupleReceiver code, but it works...
The proper fix for this is probably to change the DestFunction API --
Tom suggested passing the full QueryDesc to the receiveTuple function.
In that case, callers of ExecutorRun could "subclass" QueryDesc to add
any additional fields that their particular CommandDest needed to get
access to. This approach would work, but I'd like to think about it for
a little bit longer before deciding which route to go. In the mean time,
the code works fine, so I don't think a fix is urgent.
- (semi-related) I added a NO SCROLL keyword to DECLARE CURSOR, and
adjusted the behavior of SCROLL in accordance with the discussion on
-hackers.
- (unrelated) Cleaned up some SGML markup in sql.sgml, copy.sgml
Neil Conway
2003-03-27 17:51:29 +01:00
|
|
|
*
|
2009-12-03 12:03:29 +01:00
|
|
|
* Unless interXact is true, the file is remembered by CurrentResourceOwner
|
|
|
|
* to ensure it's closed and deleted when it's no longer needed, typically at
|
|
|
|
* the end-of-transaction. In most cases, you don't want temporary files to
|
|
|
|
* outlive the transaction that created them, so this should be false -- but
|
|
|
|
* if you need "somewhat" temporary storage, this might be useful. In either
|
|
|
|
* case, the file is removed when the File is explicitly closed.
|
1999-05-09 02:52:08 +02:00
|
|
|
*/
|
|
|
|
File
|
2007-06-07 21:19:57 +02:00
|
|
|
OpenTemporaryFile(bool interXact)
|
1999-05-09 02:52:08 +02:00
|
|
|
{
|
2007-06-03 19:08:34 +02:00
|
|
|
File file = 0;
|
|
|
|
|
|
|
|
/*
|
2007-06-07 21:19:57 +02:00
|
|
|
* If some temp tablespace(s) have been given to us, try to use the next
|
2007-11-15 22:14:46 +01:00
|
|
|
* one. If a given tablespace can't be found, we silently fall back to
|
|
|
|
* the database's default tablespace.
|
2007-06-07 21:19:57 +02:00
|
|
|
*
|
|
|
|
* BUT: if the temp file is slated to outlive the current transaction,
|
2007-11-15 22:14:46 +01:00
|
|
|
* force it into the database's default tablespace, so that it will not
|
|
|
|
* pose a threat to possible tablespace drop attempts.
|
2007-06-03 19:08:34 +02:00
|
|
|
*/
|
2007-06-07 21:19:57 +02:00
|
|
|
if (numTempTableSpaces > 0 && !interXact)
|
|
|
|
{
|
2007-11-15 22:14:46 +01:00
|
|
|
Oid tblspcOid = GetNextTempTableSpace();
|
2007-06-07 21:19:57 +02:00
|
|
|
|
|
|
|
if (OidIsValid(tblspcOid))
|
|
|
|
file = OpenTemporaryFileInTablespace(tblspcOid, false);
|
|
|
|
}
|
2007-06-03 19:08:34 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If not, or if tablespace is bad, create in database's default
|
2014-05-06 18:12:18 +02:00
|
|
|
* tablespace. MyDatabaseTableSpace should normally be set before we get
|
2007-06-03 19:08:34 +02:00
|
|
|
* here, but just in case it isn't, fall back to pg_default tablespace.
|
|
|
|
*/
|
|
|
|
if (file <= 0)
|
|
|
|
file = OpenTemporaryFileInTablespace(MyDatabaseTableSpace ?
|
|
|
|
MyDatabaseTableSpace :
|
|
|
|
DEFAULTTABLESPACE_OID,
|
|
|
|
true);
|
|
|
|
|
|
|
|
/* Mark it for deletion at close */
|
|
|
|
VfdCache[file].fdstate |= FD_TEMPORARY;
|
|
|
|
|
2009-12-03 12:03:29 +01:00
|
|
|
/* Register it with the current resource owner */
|
2007-06-03 19:08:34 +02:00
|
|
|
if (!interXact)
|
|
|
|
{
|
|
|
|
VfdCache[file].fdstate |= FD_XACT_TEMPORARY;
|
2009-12-03 12:03:29 +01:00
|
|
|
|
|
|
|
ResourceOwnerEnlargeFiles(CurrentResourceOwner);
|
|
|
|
ResourceOwnerRememberFile(CurrentResourceOwner, file);
|
|
|
|
VfdCache[file].resowner = CurrentResourceOwner;
|
2008-09-19 06:57:10 +02:00
|
|
|
|
|
|
|
/* ensure cleanup happens at eoxact */
|
2012-10-17 18:37:08 +02:00
|
|
|
have_xact_temporary_files = true;
|
2007-06-03 19:08:34 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return file;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Open a temporary file in a specific tablespace.
|
|
|
|
* Subroutine for OpenTemporaryFile, which see for details.
|
|
|
|
*/
|
|
|
|
static File
|
|
|
|
OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
|
|
|
|
{
|
|
|
|
char tempdirpath[MAXPGPATH];
|
2003-04-29 05:21:30 +02:00
|
|
|
char tempfilepath[MAXPGPATH];
|
1999-05-25 18:15:34 +02:00
|
|
|
File file;
|
1999-05-09 02:52:08 +02:00
|
|
|
|
2007-06-03 19:08:34 +02:00
|
|
|
/*
|
|
|
|
* Identify the tempfile directory for this tablespace.
|
|
|
|
*
|
|
|
|
* If someone tries to specify pg_global, use pg_default instead.
|
|
|
|
*/
|
|
|
|
if (tblspcOid == DEFAULTTABLESPACE_OID ||
|
|
|
|
tblspcOid == GLOBALTABLESPACE_OID)
|
|
|
|
{
|
|
|
|
/* The default tablespace is {datadir}/base */
|
|
|
|
snprintf(tempdirpath, sizeof(tempdirpath), "base/%s",
|
|
|
|
PG_TEMP_FILES_DIR);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/* All other tablespaces are accessed via symlinks */
|
2010-01-12 03:42:52 +01:00
|
|
|
snprintf(tempdirpath, sizeof(tempdirpath), "pg_tblspc/%u/%s/%s",
|
|
|
|
tblspcOid, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
|
2007-06-03 19:08:34 +02:00
|
|
|
}
|
|
|
|
|
1999-05-25 18:15:34 +02:00
|
|
|
/*
|
2007-03-06 03:06:15 +01:00
|
|
|
* Generate a tempfile name that should be unique within the current
|
|
|
|
* database instance.
|
1999-05-25 18:15:34 +02:00
|
|
|
*/
|
2007-06-03 19:08:34 +02:00
|
|
|
snprintf(tempfilepath, sizeof(tempfilepath), "%s/%s%d.%ld",
|
|
|
|
tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, tempFileCounter++);
|
1999-05-09 02:52:08 +02:00
|
|
|
|
2001-06-11 06:12:29 +02:00
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* Open the file. Note: we don't use O_EXCL, in case there is an orphaned
|
|
|
|
* temp file that can be reused.
|
2001-06-11 06:12:29 +02:00
|
|
|
*/
|
2007-06-03 19:08:34 +02:00
|
|
|
file = PathNameOpenFile(tempfilepath,
|
2001-06-11 06:12:29 +02:00
|
|
|
O_RDWR | O_CREAT | O_TRUNC | PG_BINARY,
|
|
|
|
0600);
|
1999-05-09 02:52:08 +02:00
|
|
|
if (file <= 0)
|
2001-06-11 06:12:29 +02:00
|
|
|
{
|
|
|
|
/*
|
2007-11-15 22:14:46 +01:00
|
|
|
* We might need to create the tablespace's tempfile directory, if no
|
|
|
|
* one has yet done so.
|
2001-06-11 06:12:29 +02:00
|
|
|
*
|
2005-11-22 19:17:34 +01:00
|
|
|
* Don't check for error from mkdir; it could fail if someone else
|
|
|
|
* just did the same thing. If it doesn't work then we'll bomb out on
|
|
|
|
* the second create attempt, instead.
|
2001-06-11 06:12:29 +02:00
|
|
|
*/
|
2007-06-03 19:08:34 +02:00
|
|
|
mkdir(tempdirpath, S_IRWXU);
|
2001-06-11 06:12:29 +02:00
|
|
|
|
2007-06-03 19:08:34 +02:00
|
|
|
file = PathNameOpenFile(tempfilepath,
|
2001-06-11 06:12:29 +02:00
|
|
|
O_RDWR | O_CREAT | O_TRUNC | PG_BINARY,
|
|
|
|
0600);
|
2007-06-03 19:08:34 +02:00
|
|
|
if (file <= 0 && rejectError)
|
2003-07-25 00:04:15 +02:00
|
|
|
elog(ERROR, "could not create temporary file \"%s\": %m",
|
2003-04-29 05:21:30 +02:00
|
|
|
tempfilepath);
|
2007-03-06 03:06:15 +01:00
|
|
|
}
|
1999-05-09 02:52:08 +02:00
|
|
|
|
|
|
|
return file;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* close a file when done with it
|
|
|
|
*/
|
1996-07-09 08:22:35 +02:00
|
|
|
void
|
|
|
|
FileClose(File file)
|
|
|
|
{
|
2007-11-15 22:14:46 +01:00
|
|
|
Vfd *vfdP;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(FileIsValid(file));
|
|
|
|
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "FileClose: %d (%s)",
|
1997-09-07 07:04:48 +02:00
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
|
2002-02-10 23:56:31 +01:00
|
|
|
vfdP = &VfdCache[file];
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
if (!FileIsNotOpen(file))
|
|
|
|
{
|
|
|
|
/* remove the file from the lru ring */
|
|
|
|
Delete(file);
|
|
|
|
|
|
|
|
/* close the file */
|
2002-02-10 23:56:31 +01:00
|
|
|
if (close(vfdP->fd))
|
2007-07-26 17:15:18 +02:00
|
|
|
elog(ERROR, "could not close file \"%s\": %m", vfdP->fileName);
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
--nfile;
|
2002-02-10 23:56:31 +01:00
|
|
|
vfdP->fd = VFD_CLOSED;
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice. This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps. An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.
To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be? I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.
Back-patch to all versions that have log_temp_files. The code was OK
before that.
2010-11-09 04:14:48 +01:00
|
|
|
* Delete the file if it was temporary, and make a log entry if wanted
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
2002-02-10 23:56:31 +01:00
|
|
|
if (vfdP->fdstate & FD_TEMPORARY)
|
2001-01-12 22:54:01 +01:00
|
|
|
{
|
2012-01-26 14:41:19 +01:00
|
|
|
struct stat filestats;
|
|
|
|
int stat_errno;
|
|
|
|
|
Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice. This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps. An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.
To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be? I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.
Back-patch to all versions that have log_temp_files. The code was OK
before that.
2010-11-09 04:14:48 +01:00
|
|
|
/*
|
|
|
|
* If we get an error, as could happen within the ereport/elog calls,
|
|
|
|
* we'll come right back here during transaction abort. Reset the
|
|
|
|
* flag to ensure that we can't get into an infinite loop. This code
|
2011-04-10 17:42:00 +02:00
|
|
|
* is arranged to ensure that the worst-case consequence is failing to
|
|
|
|
* emit log message(s), not failing to attempt the unlink.
|
Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice. This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps. An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.
To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be? I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.
Back-patch to all versions that have log_temp_files. The code was OK
before that.
2010-11-09 04:14:48 +01:00
|
|
|
*/
|
2002-02-10 23:56:31 +01:00
|
|
|
vfdP->fdstate &= ~FD_TEMPORARY;
|
Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice. This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps. An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.
To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be? I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.
Back-patch to all versions that have log_temp_files. The code was OK
before that.
2010-11-09 04:14:48 +01:00
|
|
|
|
2011-07-17 21:05:44 +02:00
|
|
|
/* Subtract its size from current usage (do first in case of error) */
|
|
|
|
temporary_files_size -= vfdP->fileSize;
|
|
|
|
vfdP->fileSize = 0;
|
|
|
|
|
2012-01-26 14:41:19 +01:00
|
|
|
/* first try the stat() */
|
|
|
|
if (stat(vfdP->fileName, &filestats))
|
|
|
|
stat_errno = errno;
|
|
|
|
else
|
|
|
|
stat_errno = 0;
|
Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice. This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps. An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.
To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be? I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.
Back-patch to all versions that have log_temp_files. The code was OK
before that.
2010-11-09 04:14:48 +01:00
|
|
|
|
2012-01-26 14:41:19 +01:00
|
|
|
/* in any case do the unlink */
|
|
|
|
if (unlink(vfdP->fileName))
|
|
|
|
elog(LOG, "could not unlink file \"%s\": %m", vfdP->fileName);
|
Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice. This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps. An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.
To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be? I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.
Back-patch to all versions that have log_temp_files. The code was OK
before that.
2010-11-09 04:14:48 +01:00
|
|
|
|
2012-01-26 14:41:19 +01:00
|
|
|
/* and last report the stat results */
|
|
|
|
if (stat_errno == 0)
|
|
|
|
{
|
|
|
|
pgstat_report_tempfile(filestats.st_size);
|
Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice. This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps. An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.
To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be? I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.
Back-patch to all versions that have log_temp_files. The code was OK
before that.
2010-11-09 04:14:48 +01:00
|
|
|
|
2012-01-26 14:41:19 +01:00
|
|
|
if (log_temp_files >= 0)
|
2007-01-09 22:31:17 +01:00
|
|
|
{
|
2010-07-07 00:55:26 +02:00
|
|
|
if ((filestats.st_size / 1024) >= log_temp_files)
|
2007-01-09 22:31:17 +01:00
|
|
|
ereport(LOG,
|
2007-12-13 12:55:44 +01:00
|
|
|
(errmsg("temporary file: path \"%s\", size %lu",
|
2007-07-26 17:15:18 +02:00
|
|
|
vfdP->fileName,
|
|
|
|
(unsigned long) filestats.st_size)));
|
2007-01-09 22:31:17 +01:00
|
|
|
}
|
2012-01-28 10:01:17 +01:00
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
errno = stat_errno;
|
|
|
|
elog(LOG, "could not stat file \"%s\": %m", vfdP->fileName);
|
Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice. This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps. An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.
To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be? I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.
Back-patch to all versions that have log_temp_files. The code was OK
before that.
2010-11-09 04:14:48 +01:00
|
|
|
}
|
2001-01-12 22:54:01 +01:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2009-12-03 12:03:29 +01:00
|
|
|
/* Unregister it from the resource owner */
|
|
|
|
if (vfdP->resowner)
|
|
|
|
ResourceOwnerForgetFile(vfdP->resowner, file);
|
|
|
|
|
1997-09-07 07:04:48 +02:00
|
|
|
/*
|
1999-05-09 02:52:08 +02:00
|
|
|
* Return the Vfd slot to the free list
|
1997-09-07 07:04:48 +02:00
|
|
|
*/
|
1999-05-09 02:52:08 +02:00
|
|
|
FreeVfd(file);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2009-01-12 06:10:45 +01:00
|
|
|
/*
|
|
|
|
* FilePrefetch - initiate asynchronous read of a given range of the file.
|
|
|
|
* The logical seek position is unaffected.
|
|
|
|
*
|
|
|
|
* Currently the only implementation of this function is using posix_fadvise
|
|
|
|
* which is the simplest standardized interface that accomplishes this.
|
|
|
|
* We could add an implementation using libaio in the future; but note that
|
|
|
|
* this API is inappropriate for libaio, which wants to have a buffer provided
|
|
|
|
* to read into.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
FilePrefetch(File file, off_t offset, int amount)
|
|
|
|
{
|
|
|
|
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
|
|
|
|
int returnCode;
|
|
|
|
|
|
|
|
Assert(FileIsValid(file));
|
2009-06-11 16:49:15 +02:00
|
|
|
|
2009-01-12 06:10:45 +01:00
|
|
|
DO_DB(elog(LOG, "FilePrefetch: %d (%s) " INT64_FORMAT " %d",
|
|
|
|
file, VfdCache[file].fileName,
|
|
|
|
(int64) offset, amount));
|
|
|
|
|
|
|
|
returnCode = FileAccess(file);
|
|
|
|
if (returnCode < 0)
|
|
|
|
return returnCode;
|
|
|
|
|
|
|
|
returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
|
|
|
|
POSIX_FADV_WILLNEED);
|
|
|
|
|
|
|
|
return returnCode;
|
|
|
|
#else
|
|
|
|
Assert(FileIsValid(file));
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
int
|
|
|
|
FileRead(File file, char *buffer, int amount)
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
int returnCode;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(FileIsValid(file));
|
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %d %p",
|
2001-02-17 02:00:04 +01:00
|
|
|
file, VfdCache[file].fileName,
|
2008-03-10 21:06:27 +01:00
|
|
|
(int64) VfdCache[file].seekPos,
|
|
|
|
amount, buffer));
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
returnCode = FileAccess(file);
|
|
|
|
if (returnCode < 0)
|
|
|
|
return returnCode;
|
|
|
|
|
2005-12-01 21:24:18 +01:00
|
|
|
retry:
|
1997-09-07 07:04:48 +02:00
|
|
|
returnCode = read(VfdCache[file].fd, buffer, amount);
|
2005-12-01 21:24:18 +01:00
|
|
|
|
|
|
|
if (returnCode >= 0)
|
1997-09-07 07:04:48 +02:00
|
|
|
VfdCache[file].seekPos += returnCode;
|
2000-06-14 05:19:24 +02:00
|
|
|
else
|
2005-12-01 21:24:18 +01:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Windows may run out of kernel buffers and return "Insufficient
|
|
|
|
* system resources" error. Wait a bit and retry to solve it.
|
|
|
|
*
|
|
|
|
* It is rumored that EINTR is also possible on some Unix filesystems,
|
|
|
|
* in which case immediate retry is indicated.
|
|
|
|
*/
|
|
|
|
#ifdef WIN32
|
2006-10-04 02:30:14 +02:00
|
|
|
DWORD error = GetLastError();
|
2005-12-01 21:24:18 +01:00
|
|
|
|
|
|
|
switch (error)
|
|
|
|
{
|
|
|
|
case ERROR_NO_SYSTEM_RESOURCES:
|
|
|
|
pg_usleep(1000L);
|
|
|
|
errno = EINTR;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
_dosmaperr(error);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
/* OK to retry if interrupted */
|
|
|
|
if (errno == EINTR)
|
|
|
|
goto retry;
|
|
|
|
|
|
|
|
/* Trouble, so assume we don't know the file position anymore */
|
2000-06-14 05:19:24 +02:00
|
|
|
VfdCache[file].seekPos = FileUnknownPos;
|
2005-12-01 21:24:18 +01:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
return returnCode;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
FileWrite(File file, char *buffer, int amount)
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
int returnCode;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(FileIsValid(file));
|
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
|
2001-02-17 02:00:04 +01:00
|
|
|
file, VfdCache[file].fileName,
|
2008-03-10 21:06:27 +01:00
|
|
|
(int64) VfdCache[file].seekPos,
|
|
|
|
amount, buffer));
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
returnCode = FileAccess(file);
|
|
|
|
if (returnCode < 0)
|
|
|
|
return returnCode;
|
2001-06-06 19:07:46 +02:00
|
|
|
|
2011-07-17 20:19:31 +02:00
|
|
|
/*
|
|
|
|
* If enforcing temp_file_limit and it's a temp file, check to see if the
|
2014-05-06 18:12:18 +02:00
|
|
|
* write would overrun temp_file_limit, and throw error if so. Note: it's
|
2011-07-17 20:19:31 +02:00
|
|
|
* really a modularity violation to throw error here; we should set errno
|
|
|
|
* and return -1. However, there's no way to report a suitable error
|
|
|
|
* message if we do that. All current callers would just throw error
|
|
|
|
* immediately anyway, so this is safe at present.
|
|
|
|
*/
|
|
|
|
if (temp_file_limit >= 0 && (VfdCache[file].fdstate & FD_TEMPORARY))
|
|
|
|
{
|
2012-06-10 21:20:04 +02:00
|
|
|
off_t newPos = VfdCache[file].seekPos + amount;
|
2011-07-17 20:19:31 +02:00
|
|
|
|
|
|
|
if (newPos > VfdCache[file].fileSize)
|
|
|
|
{
|
2012-06-10 21:20:04 +02:00
|
|
|
uint64 newTotal = temporary_files_size;
|
2011-07-17 20:19:31 +02:00
|
|
|
|
|
|
|
newTotal += newPos - VfdCache[file].fileSize;
|
|
|
|
if (newTotal > (uint64) temp_file_limit * (uint64) 1024)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
|
2012-06-10 21:20:04 +02:00
|
|
|
errmsg("temporary file size exceeds temp_file_limit (%dkB)",
|
|
|
|
temp_file_limit)));
|
2011-07-17 20:19:31 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-12-01 21:24:18 +01:00
|
|
|
retry:
|
2001-06-06 19:07:46 +02:00
|
|
|
errno = 0;
|
1997-09-07 07:04:48 +02:00
|
|
|
returnCode = write(VfdCache[file].fd, buffer, amount);
|
2001-06-06 19:07:46 +02:00
|
|
|
|
|
|
|
/* if write didn't set errno, assume problem is no disk space */
|
|
|
|
if (returnCode != amount && errno == 0)
|
|
|
|
errno = ENOSPC;
|
|
|
|
|
2005-12-01 21:24:18 +01:00
|
|
|
if (returnCode >= 0)
|
2011-07-17 20:19:31 +02:00
|
|
|
{
|
1997-09-07 07:04:48 +02:00
|
|
|
VfdCache[file].seekPos += returnCode;
|
2011-07-17 20:19:31 +02:00
|
|
|
|
|
|
|
/* maintain fileSize and temporary_files_size if it's a temp file */
|
|
|
|
if (VfdCache[file].fdstate & FD_TEMPORARY)
|
|
|
|
{
|
2012-06-10 21:20:04 +02:00
|
|
|
off_t newPos = VfdCache[file].seekPos;
|
2011-07-17 20:19:31 +02:00
|
|
|
|
|
|
|
if (newPos > VfdCache[file].fileSize)
|
|
|
|
{
|
|
|
|
temporary_files_size += newPos - VfdCache[file].fileSize;
|
|
|
|
VfdCache[file].fileSize = newPos;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2000-07-05 23:10:05 +02:00
|
|
|
else
|
2005-12-01 21:24:18 +01:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* See comments in FileRead()
|
|
|
|
*/
|
|
|
|
#ifdef WIN32
|
2006-10-04 02:30:14 +02:00
|
|
|
DWORD error = GetLastError();
|
2005-12-01 21:24:18 +01:00
|
|
|
|
|
|
|
switch (error)
|
|
|
|
{
|
|
|
|
case ERROR_NO_SYSTEM_RESOURCES:
|
|
|
|
pg_usleep(1000L);
|
|
|
|
errno = EINTR;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
_dosmaperr(error);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
/* OK to retry if interrupted */
|
|
|
|
if (errno == EINTR)
|
|
|
|
goto retry;
|
|
|
|
|
|
|
|
/* Trouble, so assume we don't know the file position anymore */
|
2000-06-14 05:19:24 +02:00
|
|
|
VfdCache[file].seekPos = FileUnknownPos;
|
2005-12-01 21:24:18 +01:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
return returnCode;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
int
|
|
|
|
FileSync(File file)
|
|
|
|
{
|
|
|
|
int returnCode;
|
|
|
|
|
|
|
|
Assert(FileIsValid(file));
|
|
|
|
|
|
|
|
DO_DB(elog(LOG, "FileSync: %d (%s)",
|
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
|
|
|
|
returnCode = FileAccess(file);
|
|
|
|
if (returnCode < 0)
|
|
|
|
return returnCode;
|
|
|
|
|
|
|
|
return pg_fsync(VfdCache[file].fd);
|
|
|
|
}
|
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
off_t
|
|
|
|
FileSeek(File file, off_t offset, int whence)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2004-05-31 05:48:10 +02:00
|
|
|
int returnCode;
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(FileIsValid(file));
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
DO_DB(elog(LOG, "FileSeek: %d (%s) " INT64_FORMAT " " INT64_FORMAT " %d",
|
2001-02-17 02:00:04 +01:00
|
|
|
file, VfdCache[file].fileName,
|
2008-03-10 21:06:27 +01:00
|
|
|
(int64) VfdCache[file].seekPos,
|
|
|
|
(int64) offset, whence));
|
1997-09-07 07:04:48 +02:00
|
|
|
|
|
|
|
if (FileIsNotOpen(file))
|
|
|
|
{
|
|
|
|
switch (whence)
|
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
case SEEK_SET:
|
2000-07-05 23:10:05 +02:00
|
|
|
if (offset < 0)
|
2008-03-10 21:06:27 +01:00
|
|
|
elog(ERROR, "invalid seek offset: " INT64_FORMAT,
|
|
|
|
(int64) offset);
|
1997-09-08 04:41:22 +02:00
|
|
|
VfdCache[file].seekPos = offset;
|
1999-05-09 02:52:08 +02:00
|
|
|
break;
|
1997-09-08 04:41:22 +02:00
|
|
|
case SEEK_CUR:
|
1999-05-09 02:52:08 +02:00
|
|
|
VfdCache[file].seekPos += offset;
|
|
|
|
break;
|
1997-09-08 04:41:22 +02:00
|
|
|
case SEEK_END:
|
2004-05-31 05:48:10 +02:00
|
|
|
returnCode = FileAccess(file);
|
|
|
|
if (returnCode < 0)
|
|
|
|
return returnCode;
|
|
|
|
VfdCache[file].seekPos = lseek(VfdCache[file].fd,
|
|
|
|
offset, whence);
|
1999-05-09 02:52:08 +02:00
|
|
|
break;
|
1997-09-08 04:41:22 +02:00
|
|
|
default:
|
2003-07-25 00:04:15 +02:00
|
|
|
elog(ERROR, "invalid whence: %d", whence);
|
1997-09-08 04:41:22 +02:00
|
|
|
break;
|
2000-07-05 23:10:05 +02:00
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
2000-07-05 23:10:05 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
switch (whence)
|
|
|
|
{
|
2000-06-14 05:19:24 +02:00
|
|
|
case SEEK_SET:
|
|
|
|
if (offset < 0)
|
2008-03-10 21:06:27 +01:00
|
|
|
elog(ERROR, "invalid seek offset: " INT64_FORMAT,
|
|
|
|
(int64) offset);
|
2000-06-14 05:19:24 +02:00
|
|
|
if (VfdCache[file].seekPos != offset)
|
2004-05-31 05:48:10 +02:00
|
|
|
VfdCache[file].seekPos = lseek(VfdCache[file].fd,
|
|
|
|
offset, whence);
|
2000-06-14 05:19:24 +02:00
|
|
|
break;
|
|
|
|
case SEEK_CUR:
|
2000-07-05 23:10:05 +02:00
|
|
|
if (offset != 0 || VfdCache[file].seekPos == FileUnknownPos)
|
2004-05-31 05:48:10 +02:00
|
|
|
VfdCache[file].seekPos = lseek(VfdCache[file].fd,
|
|
|
|
offset, whence);
|
2000-06-14 05:19:24 +02:00
|
|
|
break;
|
|
|
|
case SEEK_END:
|
2004-05-31 05:48:10 +02:00
|
|
|
VfdCache[file].seekPos = lseek(VfdCache[file].fd,
|
|
|
|
offset, whence);
|
2000-06-14 05:19:24 +02:00
|
|
|
break;
|
|
|
|
default:
|
2003-07-25 00:04:15 +02:00
|
|
|
elog(ERROR, "invalid whence: %d", whence);
|
2000-06-14 05:19:24 +02:00
|
|
|
break;
|
|
|
|
}
|
2000-07-05 23:10:05 +02:00
|
|
|
}
|
1999-05-09 02:52:08 +02:00
|
|
|
return VfdCache[file].seekPos;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* XXX not actually used but here for completeness
|
|
|
|
*/
|
1997-08-19 23:40:56 +02:00
|
|
|
#ifdef NOT_USED
|
2008-03-10 21:06:27 +01:00
|
|
|
off_t
|
1996-07-09 08:22:35 +02:00
|
|
|
FileTell(File file)
|
|
|
|
{
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(FileIsValid(file));
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "FileTell %d (%s)",
|
1997-09-07 07:04:48 +02:00
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
return VfdCache[file].seekPos;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
1997-08-19 23:40:56 +02:00
|
|
|
#endif
|
1996-07-09 08:22:35 +02:00
|
|
|
|
|
|
|
int
|
2008-03-10 21:06:27 +01:00
|
|
|
FileTruncate(File file, off_t offset)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
int returnCode;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
Assert(FileIsValid(file));
|
|
|
|
|
Commit to match discussed elog() changes. Only update is that LOG is
now just below FATAL in server_min_messages. Added more text to
highlight ordering difference between it and client_min_messages.
---------------------------------------------------------------------------
REALLYFATAL => PANIC
STOP => PANIC
New INFO level the prints to client by default
New LOG level the prints to server log by default
Cause VACUUM information to print only to the client
NOTICE => INFO where purely information messages are sent
DEBUG => LOG for purely server status messages
DEBUG removed, kept as backward compatible
DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added
DebugLvl removed in favor of new DEBUG[1-5] symbols
New server_min_messages GUC parameter with values:
DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC
New client_min_messages GUC parameter with values:
DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC
Server startup now logged with LOG instead of DEBUG
Remove debug_level GUC parameter
elog() numbers now start at 10
Add test to print error message if older elog() values are passed to elog()
Bootstrap mode now has a -d that requires an argument, like postmaster
2002-03-02 22:39:36 +01:00
|
|
|
DO_DB(elog(LOG, "FileTruncate %d (%s)",
|
1997-09-07 07:04:48 +02:00
|
|
|
file, VfdCache[file].fileName));
|
|
|
|
|
2004-05-31 05:48:10 +02:00
|
|
|
returnCode = FileAccess(file);
|
|
|
|
if (returnCode < 0)
|
|
|
|
return returnCode;
|
|
|
|
|
2008-03-10 21:06:27 +01:00
|
|
|
returnCode = ftruncate(VfdCache[file].fd, offset);
|
2011-07-17 21:05:44 +02:00
|
|
|
|
|
|
|
if (returnCode == 0 && VfdCache[file].fileSize > offset)
|
|
|
|
{
|
|
|
|
/* adjust our state for truncation of a temp file */
|
|
|
|
Assert(VfdCache[file].fdstate & FD_TEMPORARY);
|
|
|
|
temporary_files_size -= VfdCache[file].fileSize - offset;
|
|
|
|
VfdCache[file].fileSize = offset;
|
|
|
|
}
|
|
|
|
|
1998-09-01 05:29:17 +02:00
|
|
|
return returnCode;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2009-08-05 20:01:54 +02:00
|
|
|
/*
|
|
|
|
* Return the pathname associated with an open file.
|
|
|
|
*
|
|
|
|
* The returned string points to an internal buffer, which is valid until
|
|
|
|
* the file is closed.
|
|
|
|
*/
|
|
|
|
char *
|
|
|
|
FilePathName(File file)
|
|
|
|
{
|
|
|
|
Assert(FileIsValid(file));
|
|
|
|
|
|
|
|
return VfdCache[file].fileName;
|
|
|
|
}
|
|
|
|
|
2000-04-09 06:43:20 +02:00
|
|
|
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/*
|
|
|
|
* Make room for another allocatedDescs[] array entry if needed and possible.
|
|
|
|
* Returns true if an array element is available.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
reserveAllocatedDesc(void)
|
|
|
|
{
|
|
|
|
AllocateDesc *newDescs;
|
|
|
|
int newMax;
|
|
|
|
|
|
|
|
/* Quick out if array already has a free slot. */
|
|
|
|
if (numAllocatedDescs < maxAllocatedDescs)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the array hasn't yet been created in the current process, initialize
|
|
|
|
* it with FD_MINFREE / 2 elements. In many scenarios this is as many as
|
|
|
|
* we will ever need, anyway. We don't want to look at max_safe_fds
|
|
|
|
* immediately because set_max_safe_fds() may not have run yet.
|
|
|
|
*/
|
|
|
|
if (allocatedDescs == NULL)
|
|
|
|
{
|
|
|
|
newMax = FD_MINFREE / 2;
|
|
|
|
newDescs = (AllocateDesc *) malloc(newMax * sizeof(AllocateDesc));
|
|
|
|
/* Out of memory already? Treat as fatal error. */
|
|
|
|
if (newDescs == NULL)
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OUT_OF_MEMORY),
|
|
|
|
errmsg("out of memory")));
|
|
|
|
allocatedDescs = newDescs;
|
|
|
|
maxAllocatedDescs = newMax;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Consider enlarging the array beyond the initial allocation used above.
|
|
|
|
* By the time this happens, max_safe_fds should be known accurately.
|
|
|
|
*
|
|
|
|
* We mustn't let allocated descriptors hog all the available FDs, and in
|
|
|
|
* practice we'd better leave a reasonable number of FDs for VFD use. So
|
|
|
|
* set the maximum to max_safe_fds / 2. (This should certainly be at
|
|
|
|
* least as large as the initial size, FD_MINFREE / 2.)
|
|
|
|
*/
|
|
|
|
newMax = max_safe_fds / 2;
|
|
|
|
if (newMax > maxAllocatedDescs)
|
|
|
|
{
|
|
|
|
newDescs = (AllocateDesc *) realloc(allocatedDescs,
|
|
|
|
newMax * sizeof(AllocateDesc));
|
|
|
|
/* Treat out-of-memory as a non-fatal error. */
|
|
|
|
if (newDescs == NULL)
|
|
|
|
return false;
|
|
|
|
allocatedDescs = newDescs;
|
|
|
|
maxAllocatedDescs = newMax;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Can't enlarge allocatedDescs[] any more. */
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/*
|
1999-05-09 02:52:08 +02:00
|
|
|
* Routines that want to use stdio (ie, FILE*) should use AllocateFile
|
|
|
|
* rather than plain fopen(). This lets fd.c deal with freeing FDs if
|
2014-05-06 18:12:18 +02:00
|
|
|
* necessary to open the file. When done, call FreeFile rather than fclose.
|
1999-05-09 02:52:08 +02:00
|
|
|
*
|
|
|
|
* Note that files that will be open for any significant length of time
|
|
|
|
* should NOT be handled this way, since they cannot share kernel file
|
|
|
|
* descriptors with other files; there is grave risk of running out of FDs
|
|
|
|
* if anyone locks down too many FDs. Most callers of this routine are
|
|
|
|
* simply reading a config file that they will read and close immediately.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
1999-05-09 02:52:08 +02:00
|
|
|
* fd.c will automatically close all files opened with AllocateFile at
|
|
|
|
* transaction commit or abort; this prevents FD leakage if a routine
|
2003-07-25 00:04:15 +02:00
|
|
|
* that calls AllocateFile is terminated prematurely by ereport(ERROR).
|
2000-06-02 05:58:34 +02:00
|
|
|
*
|
|
|
|
* Ideally this should be the *only* direct call of fopen() in the backend.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
1998-02-26 05:46:47 +01:00
|
|
|
FILE *
|
2006-03-04 22:32:47 +01:00
|
|
|
AllocateFile(const char *name, const char *mode)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1997-09-08 04:41:22 +02:00
|
|
|
FILE *file;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2004-07-28 16:23:31 +02:00
|
|
|
DO_DB(elog(LOG, "AllocateFile: Allocated %d (%s)",
|
|
|
|
numAllocatedDescs, name));
|
1999-05-09 02:52:08 +02:00
|
|
|
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/* Can we allocate another non-virtual FD? */
|
|
|
|
if (!reserveAllocatedDesc())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
|
|
|
errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",
|
|
|
|
maxAllocatedDescs, name)));
|
|
|
|
|
|
|
|
/* Close excess kernel FDs. */
|
|
|
|
ReleaseLruFiles();
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1997-08-18 04:15:04 +02:00
|
|
|
TryAgain:
|
2000-08-27 23:48:00 +02:00
|
|
|
if ((file = fopen(name, mode)) != NULL)
|
1997-09-07 07:04:48 +02:00
|
|
|
{
|
2004-07-28 16:23:31 +02:00
|
|
|
AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
|
|
|
|
|
|
|
|
desc->kind = AllocateDescFile;
|
|
|
|
desc->desc.file = file;
|
2004-09-16 18:58:44 +02:00
|
|
|
desc->create_subid = GetCurrentSubTransactionId();
|
2004-07-28 16:23:31 +02:00
|
|
|
numAllocatedDescs++;
|
|
|
|
return desc->desc.file;
|
2000-08-27 23:48:00 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
if (errno == EMFILE || errno == ENFILE)
|
|
|
|
{
|
2001-03-22 05:01:46 +01:00
|
|
|
int save_errno = errno;
|
2000-08-27 23:48:00 +02:00
|
|
|
|
2003-07-25 00:04:15 +02:00
|
|
|
ereport(LOG,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
2005-10-15 04:49:52 +02:00
|
|
|
errmsg("out of file descriptors: %m; release and retry")));
|
2000-08-27 23:48:00 +02:00
|
|
|
errno = 0;
|
|
|
|
if (ReleaseLruFile())
|
1997-09-07 07:04:48 +02:00
|
|
|
goto TryAgain;
|
2000-08-27 23:48:00 +02:00
|
|
|
errno = save_errno;
|
1997-09-07 07:04:48 +02:00
|
|
|
}
|
2000-08-27 23:48:00 +02:00
|
|
|
|
|
|
|
return NULL;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
|
2004-07-28 16:23:31 +02:00
|
|
|
/*
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
* Like AllocateFile, but returns an unbuffered fd like open(2)
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
OpenTransientFile(FileName fileName, int fileFlags, int fileMode)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
|
|
|
|
DO_DB(elog(LOG, "OpenTransientFile: Allocated %d (%s)",
|
|
|
|
numAllocatedDescs, fileName));
|
|
|
|
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/* Can we allocate another non-virtual FD? */
|
|
|
|
if (!reserveAllocatedDesc())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
|
|
|
errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",
|
|
|
|
maxAllocatedDescs, fileName)));
|
|
|
|
|
|
|
|
/* Close excess kernel FDs. */
|
|
|
|
ReleaseLruFiles();
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
|
|
|
|
fd = BasicOpenFile(fileName, fileFlags, fileMode);
|
|
|
|
|
|
|
|
if (fd >= 0)
|
|
|
|
{
|
|
|
|
AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
|
|
|
|
|
|
|
|
desc->kind = AllocateDescRawFD;
|
|
|
|
desc->desc.fd = fd;
|
|
|
|
desc->create_subid = GetCurrentSubTransactionId();
|
|
|
|
numAllocatedDescs++;
|
|
|
|
|
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
|
|
|
|
return -1; /* failure */
|
|
|
|
}
|
|
|
|
|
Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.
In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.
This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.
Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 17:17:21 +01:00
|
|
|
/*
|
|
|
|
* Routines that want to initiate a pipe stream should use OpenPipeStream
|
|
|
|
* rather than plain popen(). This lets fd.c deal with freeing FDs if
|
|
|
|
* necessary. When done, call ClosePipeStream rather than pclose.
|
|
|
|
*/
|
|
|
|
FILE *
|
|
|
|
OpenPipeStream(const char *command, const char *mode)
|
|
|
|
{
|
|
|
|
FILE *file;
|
|
|
|
|
|
|
|
DO_DB(elog(LOG, "OpenPipeStream: Allocated %d (%s)",
|
|
|
|
numAllocatedDescs, command));
|
|
|
|
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/* Can we allocate another non-virtual FD? */
|
|
|
|
if (!reserveAllocatedDesc())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
|
|
|
errmsg("exceeded maxAllocatedDescs (%d) while trying to execute command \"%s\"",
|
|
|
|
maxAllocatedDescs, command)));
|
|
|
|
|
|
|
|
/* Close excess kernel FDs. */
|
|
|
|
ReleaseLruFiles();
|
Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.
In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.
This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.
Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 17:17:21 +01:00
|
|
|
|
|
|
|
TryAgain:
|
|
|
|
fflush(stdout);
|
|
|
|
fflush(stderr);
|
|
|
|
errno = 0;
|
|
|
|
if ((file = popen(command, mode)) != NULL)
|
|
|
|
{
|
|
|
|
AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
|
|
|
|
|
|
|
|
desc->kind = AllocateDescPipe;
|
|
|
|
desc->desc.file = file;
|
|
|
|
desc->create_subid = GetCurrentSubTransactionId();
|
|
|
|
numAllocatedDescs++;
|
|
|
|
return desc->desc.file;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (errno == EMFILE || errno == ENFILE)
|
|
|
|
{
|
|
|
|
int save_errno = errno;
|
|
|
|
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
|
|
|
errmsg("out of file descriptors: %m; release and retry")));
|
|
|
|
errno = 0;
|
|
|
|
if (ReleaseLruFile())
|
|
|
|
goto TryAgain;
|
|
|
|
errno = save_errno;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
/*
|
|
|
|
* Free an AllocateDesc of any type.
|
2004-07-28 16:23:31 +02:00
|
|
|
*
|
|
|
|
* The argument *must* point into the allocatedDescs[] array.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
FreeDesc(AllocateDesc *desc)
|
|
|
|
{
|
2004-08-29 07:07:03 +02:00
|
|
|
int result;
|
2004-07-28 16:23:31 +02:00
|
|
|
|
|
|
|
/* Close the underlying object */
|
|
|
|
switch (desc->kind)
|
|
|
|
{
|
|
|
|
case AllocateDescFile:
|
|
|
|
result = fclose(desc->desc.file);
|
|
|
|
break;
|
Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.
In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.
This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.
Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 17:17:21 +01:00
|
|
|
case AllocateDescPipe:
|
|
|
|
result = pclose(desc->desc.file);
|
|
|
|
break;
|
2004-07-28 16:23:31 +02:00
|
|
|
case AllocateDescDir:
|
|
|
|
result = closedir(desc->desc.dir);
|
|
|
|
break;
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
case AllocateDescRawFD:
|
|
|
|
result = close(desc->desc.fd);
|
|
|
|
break;
|
2004-07-28 16:23:31 +02:00
|
|
|
default:
|
|
|
|
elog(ERROR, "AllocateDesc kind not recognized");
|
|
|
|
result = 0; /* keep compiler quiet */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Compact storage in the allocatedDescs array */
|
|
|
|
numAllocatedDescs--;
|
|
|
|
*desc = allocatedDescs[numAllocatedDescs];
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2004-01-26 23:35:32 +01:00
|
|
|
/*
|
|
|
|
* Close a file returned by AllocateFile.
|
|
|
|
*
|
|
|
|
* Note we do not check fclose's return value --- it is up to the caller
|
|
|
|
* to handle close errors.
|
|
|
|
*/
|
|
|
|
int
|
1997-09-08 23:56:23 +02:00
|
|
|
FreeFile(FILE *file)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1999-05-25 18:15:34 +02:00
|
|
|
int i;
|
1999-05-09 02:52:08 +02:00
|
|
|
|
2004-07-28 16:23:31 +02:00
|
|
|
DO_DB(elog(LOG, "FreeFile: Allocated %d", numAllocatedDescs));
|
1999-05-09 02:52:08 +02:00
|
|
|
|
|
|
|
/* Remove file from list of allocated files, if it's present */
|
2004-07-28 16:23:31 +02:00
|
|
|
for (i = numAllocatedDescs; --i >= 0;)
|
1999-05-09 02:52:08 +02:00
|
|
|
{
|
2004-07-28 16:23:31 +02:00
|
|
|
AllocateDesc *desc = &allocatedDescs[i];
|
|
|
|
|
|
|
|
if (desc->kind == AllocateDescFile && desc->desc.file == file)
|
|
|
|
return FreeDesc(desc);
|
1999-05-09 02:52:08 +02:00
|
|
|
}
|
2004-07-28 16:23:31 +02:00
|
|
|
|
|
|
|
/* Only get here if someone passes us a file not in allocatedDescs */
|
|
|
|
elog(WARNING, "file passed to FreeFile was not obtained from AllocateFile");
|
1997-05-22 18:51:19 +02:00
|
|
|
|
2004-01-26 23:35:32 +01:00
|
|
|
return fclose(file);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
/*
|
|
|
|
* Close a file returned by OpenTransientFile.
|
|
|
|
*
|
|
|
|
* Note we do not check close's return value --- it is up to the caller
|
|
|
|
* to handle close errors.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
CloseTransientFile(int fd)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
DO_DB(elog(LOG, "CloseTransientFile: Allocated %d", numAllocatedDescs));
|
|
|
|
|
|
|
|
/* Remove fd from list of allocated files, if it's present */
|
|
|
|
for (i = numAllocatedDescs; --i >= 0;)
|
|
|
|
{
|
|
|
|
AllocateDesc *desc = &allocatedDescs[i];
|
|
|
|
|
|
|
|
if (desc->kind == AllocateDescRawFD && desc->desc.fd == fd)
|
|
|
|
return FreeDesc(desc);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Only get here if someone passes us a file not in allocatedDescs */
|
|
|
|
elog(WARNING, "fd passed to CloseTransientFile was not obtained from OpenTransientFile");
|
|
|
|
|
|
|
|
return close(fd);
|
|
|
|
}
|
2004-02-24 00:03:10 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Routines that want to use <dirent.h> (ie, DIR*) should use AllocateDir
|
|
|
|
* rather than plain opendir(). This lets fd.c deal with freeing FDs if
|
|
|
|
* necessary to open the directory, and with closing it after an elog.
|
|
|
|
* When done, call FreeDir rather than closedir.
|
|
|
|
*
|
|
|
|
* Ideally this should be the *only* direct call of opendir() in the backend.
|
|
|
|
*/
|
|
|
|
DIR *
|
|
|
|
AllocateDir(const char *dirname)
|
|
|
|
{
|
2004-08-29 07:07:03 +02:00
|
|
|
DIR *dir;
|
2004-02-24 00:03:10 +01:00
|
|
|
|
2004-07-28 16:23:31 +02:00
|
|
|
DO_DB(elog(LOG, "AllocateDir: Allocated %d (%s)",
|
|
|
|
numAllocatedDescs, dirname));
|
2004-02-24 00:03:10 +01:00
|
|
|
|
Remove fixed limit on the number of concurrent AllocateFile() requests.
AllocateFile(), AllocateDir(), and some sister routines share a small array
for remembering requests, so that the files can be closed on transaction
failure. Previously that array had a fixed size, MAX_ALLOCATED_DESCS (32).
While historically that had seemed sufficient, Steve Toutant pointed out
that this meant you couldn't scan more than 32 file_fdw foreign tables in
one query, because file_fdw depends on the COPY code which uses
AllocateFile(). There are probably other cases, or will be in the future,
where this nonconfigurable limit impedes users.
We can't completely remove any such limit, at least not without a lot of
work, since each such request requires a kernel file descriptor and most
platforms limit the number we can have. (In principle we could
"virtualize" these descriptors, as fd.c already does for the main VFD pool,
but not without an additional layer of overhead and a lot of notational
impact on the calling code.) But we can at least let the array size be
configurable. Hence, change the code to allow up to max_safe_fds/2
allocated file requests. On modern platforms this should allow several
hundred concurrent file_fdw scans, or more if one increases the value of
max_files_per_process. To go much further than that, we'd need to do some
more work on the data structure, since the current code for closing
requests has potentially O(N^2) runtime; but it should still be all right
for request counts in this range.
Back-patch to 9.1 where contrib/file_fdw was introduced.
2013-06-09 19:46:54 +02:00
|
|
|
/* Can we allocate another non-virtual FD? */
|
|
|
|
if (!reserveAllocatedDesc())
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
|
|
|
errmsg("exceeded maxAllocatedDescs (%d) while trying to open directory \"%s\"",
|
|
|
|
maxAllocatedDescs, dirname)));
|
|
|
|
|
|
|
|
/* Close excess kernel FDs. */
|
|
|
|
ReleaseLruFiles();
|
2004-02-24 00:03:10 +01:00
|
|
|
|
|
|
|
TryAgain:
|
|
|
|
if ((dir = opendir(dirname)) != NULL)
|
|
|
|
{
|
2004-07-28 16:23:31 +02:00
|
|
|
AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
|
|
|
|
|
|
|
|
desc->kind = AllocateDescDir;
|
|
|
|
desc->desc.dir = dir;
|
2004-09-16 18:58:44 +02:00
|
|
|
desc->create_subid = GetCurrentSubTransactionId();
|
2004-07-28 16:23:31 +02:00
|
|
|
numAllocatedDescs++;
|
|
|
|
return desc->desc.dir;
|
2004-02-24 00:03:10 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (errno == EMFILE || errno == ENFILE)
|
|
|
|
{
|
|
|
|
int save_errno = errno;
|
|
|
|
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
|
2005-10-15 04:49:52 +02:00
|
|
|
errmsg("out of file descriptors: %m; release and retry")));
|
2004-02-24 00:03:10 +01:00
|
|
|
errno = 0;
|
|
|
|
if (ReleaseLruFile())
|
|
|
|
goto TryAgain;
|
|
|
|
errno = save_errno;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2005-06-19 23:34:03 +02:00
|
|
|
/*
|
|
|
|
* Read a directory opened with AllocateDir, ereport'ing any error.
|
|
|
|
*
|
|
|
|
* This is easier to use than raw readdir() since it takes care of some
|
2014-05-06 18:12:18 +02:00
|
|
|
* otherwise rather tedious and error-prone manipulation of errno. Also,
|
2005-06-19 23:34:03 +02:00
|
|
|
* if you are happy with a generic error message for AllocateDir failure,
|
|
|
|
* you can just do
|
|
|
|
*
|
|
|
|
* dir = AllocateDir(path);
|
|
|
|
* while ((dirent = ReadDir(dir, path)) != NULL)
|
|
|
|
* process dirent;
|
2005-12-08 16:38:29 +01:00
|
|
|
* FreeDir(dir);
|
2005-06-19 23:34:03 +02:00
|
|
|
*
|
|
|
|
* since a NULL dir parameter is taken as indicating AllocateDir failed.
|
|
|
|
* (Make sure errno hasn't been changed since AllocateDir if you use this
|
|
|
|
* shortcut.)
|
|
|
|
*
|
|
|
|
* The pathname passed to AllocateDir must be passed to this routine too,
|
|
|
|
* but it is only used for error reporting.
|
|
|
|
*/
|
|
|
|
struct dirent *
|
|
|
|
ReadDir(DIR *dir, const char *dirname)
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
{
|
|
|
|
return ReadDirExtended(dir, dirname, ERROR);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Alternate version that allows caller to specify the elevel for any
|
|
|
|
* error report. If elevel < ERROR, returns NULL on any error.
|
|
|
|
*/
|
|
|
|
static struct dirent *
|
|
|
|
ReadDirExtended(DIR *dir, const char *dirname, int elevel)
|
2005-06-19 23:34:03 +02:00
|
|
|
{
|
|
|
|
struct dirent *dent;
|
|
|
|
|
|
|
|
/* Give a generic message for AllocateDir failure, if caller didn't */
|
|
|
|
if (dir == NULL)
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
{
|
|
|
|
ereport(elevel,
|
2005-06-19 23:34:03 +02:00
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open directory \"%s\": %m",
|
|
|
|
dirname)));
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
return NULL;
|
|
|
|
}
|
2005-06-19 23:34:03 +02:00
|
|
|
|
|
|
|
errno = 0;
|
|
|
|
if ((dent = readdir(dir)) != NULL)
|
|
|
|
return dent;
|
|
|
|
|
|
|
|
if (errno)
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
ereport(elevel,
|
2005-06-19 23:34:03 +02:00
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not read directory \"%s\": %m",
|
|
|
|
dirname)));
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2004-02-24 00:03:10 +01:00
|
|
|
/*
|
|
|
|
* Close a directory opened with AllocateDir.
|
|
|
|
*
|
|
|
|
* Note we do not check closedir's return value --- it is up to the caller
|
|
|
|
* to handle close errors.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
FreeDir(DIR *dir)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2004-07-28 16:23:31 +02:00
|
|
|
DO_DB(elog(LOG, "FreeDir: Allocated %d", numAllocatedDescs));
|
2004-02-24 00:03:10 +01:00
|
|
|
|
|
|
|
/* Remove dir from list of allocated dirs, if it's present */
|
2004-07-28 16:23:31 +02:00
|
|
|
for (i = numAllocatedDescs; --i >= 0;)
|
2004-02-24 00:03:10 +01:00
|
|
|
{
|
2004-07-28 16:23:31 +02:00
|
|
|
AllocateDesc *desc = &allocatedDescs[i];
|
|
|
|
|
|
|
|
if (desc->kind == AllocateDescDir && desc->desc.dir == dir)
|
|
|
|
return FreeDesc(desc);
|
2004-02-24 00:03:10 +01:00
|
|
|
}
|
2004-07-28 16:23:31 +02:00
|
|
|
|
|
|
|
/* Only get here if someone passes us a dir not in allocatedDescs */
|
|
|
|
elog(WARNING, "dir passed to FreeDir was not obtained from AllocateDir");
|
2004-02-24 00:03:10 +01:00
|
|
|
|
|
|
|
return closedir(dir);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.
In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.
This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.
Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 17:17:21 +01:00
|
|
|
/*
|
|
|
|
* Close a pipe stream returned by OpenPipeStream.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
ClosePipeStream(FILE *file)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
DO_DB(elog(LOG, "ClosePipeStream: Allocated %d", numAllocatedDescs));
|
|
|
|
|
|
|
|
/* Remove file from list of allocated files, if it's present */
|
|
|
|
for (i = numAllocatedDescs; --i >= 0;)
|
|
|
|
{
|
|
|
|
AllocateDesc *desc = &allocatedDescs[i];
|
|
|
|
|
|
|
|
if (desc->kind == AllocateDescPipe && desc->desc.file == file)
|
|
|
|
return FreeDesc(desc);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Only get here if someone passes us a file not in allocatedDescs */
|
|
|
|
elog(WARNING, "file passed to ClosePipeStream was not obtained from OpenPipeStream");
|
|
|
|
|
|
|
|
return pclose(file);
|
|
|
|
}
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
/*
|
|
|
|
* closeAllVfds
|
|
|
|
*
|
|
|
|
* Force all VFDs into the physically-closed state, so that the fewest
|
|
|
|
* possible number of kernel file descriptors are in use. There is no
|
|
|
|
* change in the logical state of the VFDs.
|
|
|
|
*/
|
1996-07-09 08:22:35 +02:00
|
|
|
void
|
2000-08-27 23:48:00 +02:00
|
|
|
closeAllVfds(void)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
1999-05-09 02:52:08 +02:00
|
|
|
Index i;
|
|
|
|
|
|
|
|
if (SizeVfdCache > 0)
|
|
|
|
{
|
1999-05-25 18:15:34 +02:00
|
|
|
Assert(FileIsNotOpen(0)); /* Make sure ring not corrupted */
|
1999-05-09 02:52:08 +02:00
|
|
|
for (i = 1; i < SizeVfdCache; i++)
|
|
|
|
{
|
|
|
|
if (!FileIsNotOpen(i))
|
|
|
|
LruDelete(i);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-06-07 21:19:57 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* SetTempTablespaces
|
|
|
|
*
|
|
|
|
* Define a list (actually an array) of OIDs of tablespaces to use for
|
|
|
|
* temporary files. This list will be used until end of transaction,
|
|
|
|
* unless this function is called again before then. It is caller's
|
|
|
|
* responsibility that the passed-in array has adequate lifespan (typically
|
|
|
|
* it'd be allocated in TopTransactionContext).
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
SetTempTablespaces(Oid *tableSpaces, int numSpaces)
|
|
|
|
{
|
|
|
|
Assert(numSpaces >= 0);
|
|
|
|
tempTableSpaces = tableSpaces;
|
|
|
|
numTempTableSpaces = numSpaces;
|
2007-11-15 22:14:46 +01:00
|
|
|
|
2007-06-07 21:19:57 +02:00
|
|
|
/*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Select a random starting point in the list. This is to minimize
|
2007-11-15 22:14:46 +01:00
|
|
|
* conflicts between backends that are most likely sharing the same list
|
|
|
|
* of temp tablespaces. Note that if we create multiple temp files in the
|
|
|
|
* same transaction, we'll advance circularly through the list --- this
|
|
|
|
* ensures that large temporary sort files are nicely spread across all
|
|
|
|
* available tablespaces.
|
2007-06-07 21:19:57 +02:00
|
|
|
*/
|
|
|
|
if (numSpaces > 1)
|
|
|
|
nextTempTableSpace = random() % numSpaces;
|
|
|
|
else
|
|
|
|
nextTempTableSpace = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* TempTablespacesAreSet
|
|
|
|
*
|
|
|
|
* Returns TRUE if SetTempTablespaces has been called in current transaction.
|
|
|
|
* (This is just so that tablespaces.c doesn't need its own per-transaction
|
|
|
|
* state.)
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
TempTablespacesAreSet(void)
|
|
|
|
{
|
|
|
|
return (numTempTableSpaces >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* GetNextTempTableSpace
|
|
|
|
*
|
2014-05-06 18:12:18 +02:00
|
|
|
* Select the next temp tablespace to use. A result of InvalidOid means
|
2007-06-07 21:19:57 +02:00
|
|
|
* to use the current database's default tablespace.
|
|
|
|
*/
|
|
|
|
Oid
|
|
|
|
GetNextTempTableSpace(void)
|
|
|
|
{
|
|
|
|
if (numTempTableSpaces > 0)
|
|
|
|
{
|
|
|
|
/* Advance nextTempTableSpace counter with wraparound */
|
|
|
|
if (++nextTempTableSpace >= numTempTableSpaces)
|
|
|
|
nextTempTableSpace = 0;
|
|
|
|
return tempTableSpaces[nextTempTableSpace];
|
|
|
|
}
|
|
|
|
return InvalidOid;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2004-07-28 16:23:31 +02:00
|
|
|
/*
|
|
|
|
* AtEOSubXact_Files
|
|
|
|
*
|
|
|
|
* Take care of subtransaction commit/abort. At abort, we close temp files
|
|
|
|
* that the subtransaction may have opened. At commit, we reassign the
|
2004-09-16 18:58:44 +02:00
|
|
|
* files that were opened to the parent subtransaction.
|
2004-07-28 16:23:31 +02:00
|
|
|
*/
|
|
|
|
void
|
2004-09-16 18:58:44 +02:00
|
|
|
AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
|
|
|
|
SubTransactionId parentSubid)
|
2004-07-28 16:23:31 +02:00
|
|
|
{
|
2004-08-29 07:07:03 +02:00
|
|
|
Index i;
|
2004-07-28 16:23:31 +02:00
|
|
|
|
|
|
|
for (i = 0; i < numAllocatedDescs; i++)
|
|
|
|
{
|
2004-09-16 18:58:44 +02:00
|
|
|
if (allocatedDescs[i].create_subid == mySubid)
|
2004-07-28 16:23:31 +02:00
|
|
|
{
|
|
|
|
if (isCommit)
|
2004-09-16 18:58:44 +02:00
|
|
|
allocatedDescs[i].create_subid = parentSubid;
|
2004-07-28 16:23:31 +02:00
|
|
|
else
|
|
|
|
{
|
|
|
|
/* have to recheck the item after FreeDesc (ugly) */
|
|
|
|
FreeDesc(&allocatedDescs[i--]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1999-05-09 02:52:08 +02:00
|
|
|
/*
|
|
|
|
* AtEOXact_Files
|
|
|
|
*
|
2003-04-29 05:21:30 +02:00
|
|
|
* This routine is called during transaction commit or abort (it doesn't
|
|
|
|
* particularly care which). All still-open per-transaction temporary file
|
2009-12-03 12:03:29 +01:00
|
|
|
* VFDs are closed, which also causes the underlying files to be deleted
|
|
|
|
* (although they should've been closed already by the ResourceOwner
|
2012-10-17 18:37:08 +02:00
|
|
|
* cleanup). Furthermore, all "allocated" stdio files are closed. We also
|
|
|
|
* forget any transaction-local temp tablespace list.
|
1999-05-09 02:52:08 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
AtEOXact_Files(void)
|
|
|
|
{
|
2003-04-29 05:21:30 +02:00
|
|
|
CleanupTempFiles(false);
|
2007-06-07 21:19:57 +02:00
|
|
|
tempTableSpaces = NULL;
|
|
|
|
numTempTableSpaces = -1;
|
2003-04-29 05:21:30 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* AtProcExit_Files
|
|
|
|
*
|
|
|
|
* on_proc_exit hook to clean up temp files during backend shutdown.
|
|
|
|
* Here, we want to clean up *all* temp files including interXact ones.
|
|
|
|
*/
|
|
|
|
static void
|
2003-12-12 19:45:10 +01:00
|
|
|
AtProcExit_Files(int code, Datum arg)
|
2003-04-29 05:21:30 +02:00
|
|
|
{
|
|
|
|
CleanupTempFiles(true);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2012-10-17 18:37:08 +02:00
|
|
|
* Close temporary files and delete their underlying files.
|
2003-04-29 05:21:30 +02:00
|
|
|
*
|
|
|
|
* isProcExit: if true, this is being called as the backend process is
|
|
|
|
* exiting. If that's the case, we should remove all temporary files; if
|
|
|
|
* that's not the case, we are being called for transaction commit/abort
|
|
|
|
* and should only remove transaction-local temp files. In either case,
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
* also clean up "allocated" stdio files, dirs and fds.
|
2003-04-29 05:21:30 +02:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
CleanupTempFiles(bool isProcExit)
|
|
|
|
{
|
2003-08-04 02:43:34 +02:00
|
|
|
Index i;
|
1999-05-09 02:52:08 +02:00
|
|
|
|
2008-09-19 06:57:10 +02:00
|
|
|
/*
|
|
|
|
* Careful here: at proc_exit we need extra cleanup, not just
|
|
|
|
* xact_temporary files.
|
|
|
|
*/
|
2012-10-17 18:37:08 +02:00
|
|
|
if (isProcExit || have_xact_temporary_files)
|
1999-05-09 02:52:08 +02:00
|
|
|
{
|
1999-05-25 18:15:34 +02:00
|
|
|
Assert(FileIsNotOpen(0)); /* Make sure ring not corrupted */
|
1999-05-09 02:52:08 +02:00
|
|
|
for (i = 1; i < SizeVfdCache; i++)
|
|
|
|
{
|
2003-04-29 05:21:30 +02:00
|
|
|
unsigned short fdstate = VfdCache[i].fdstate;
|
|
|
|
|
2012-10-17 18:37:08 +02:00
|
|
|
if ((fdstate & FD_TEMPORARY) && VfdCache[i].fileName != NULL)
|
2003-04-29 05:21:30 +02:00
|
|
|
{
|
2012-10-17 18:37:08 +02:00
|
|
|
/*
|
|
|
|
* If we're in the process of exiting a backend process, close
|
|
|
|
* all temporary files. Otherwise, only close temporary files
|
|
|
|
* local to the current transaction. They should be closed by
|
|
|
|
* the ResourceOwner mechanism already, so this is just a
|
|
|
|
* debugging cross-check.
|
|
|
|
*/
|
|
|
|
if (isProcExit)
|
|
|
|
FileClose(i);
|
|
|
|
else if (fdstate & FD_XACT_TEMPORARY)
|
2009-12-03 12:03:29 +01:00
|
|
|
{
|
2012-10-17 18:37:08 +02:00
|
|
|
elog(WARNING,
|
|
|
|
"temporary file %s not closed at end-of-transaction",
|
|
|
|
VfdCache[i].fileName);
|
|
|
|
FileClose(i);
|
2009-12-03 12:03:29 +01:00
|
|
|
}
|
2003-04-29 05:21:30 +02:00
|
|
|
}
|
1999-05-09 02:52:08 +02:00
|
|
|
}
|
2008-09-19 06:57:10 +02:00
|
|
|
|
2012-10-17 18:37:08 +02:00
|
|
|
have_xact_temporary_files = false;
|
1999-05-09 02:52:08 +02:00
|
|
|
}
|
|
|
|
|
Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
This plugs a few rare fd leaks in error cases:
1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
use OpenTransientFile here because the fd is supposed to persist over
transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
PathNameOpenFile.
In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.
The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 09:25:50 +01:00
|
|
|
/* Clean up "allocated" stdio files, dirs and fds. */
|
2004-07-28 16:23:31 +02:00
|
|
|
while (numAllocatedDescs > 0)
|
|
|
|
FreeDesc(&allocatedDescs[0]);
|
1999-05-09 02:52:08 +02:00
|
|
|
}
|
2001-06-11 06:12:29 +02:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
2010-08-13 22:10:54 +02:00
|
|
|
* Remove temporary and temporary relation files left over from a prior
|
|
|
|
* postmaster session
|
2001-06-11 06:12:29 +02:00
|
|
|
*
|
|
|
|
* This should be called during postmaster startup. It will forcibly
|
2010-08-13 22:10:54 +02:00
|
|
|
* remove any leftover files created by OpenTemporaryFile and any leftover
|
|
|
|
* temporary relation files created by mdcreate.
|
2003-04-29 05:21:30 +02:00
|
|
|
*
|
|
|
|
* NOTE: we could, but don't, call this during a post-backend-crash restart
|
|
|
|
* cycle. The argument for not doing it is that someone might want to examine
|
|
|
|
* the temp files for debugging purposes. This does however mean that
|
|
|
|
* OpenTemporaryFile had better allow for collision with an existing temp
|
|
|
|
* file name.
|
2001-06-11 06:12:29 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
RemovePgTempFiles(void)
|
|
|
|
{
|
2001-10-25 07:50:21 +02:00
|
|
|
char temp_path[MAXPGPATH];
|
2007-06-03 19:08:34 +02:00
|
|
|
DIR *spc_dir;
|
|
|
|
struct dirent *spc_de;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First process temp files in pg_default ($PGDATA/base)
|
|
|
|
*/
|
|
|
|
snprintf(temp_path, sizeof(temp_path), "base/%s", PG_TEMP_FILES_DIR);
|
|
|
|
RemovePgTempFilesInDir(temp_path);
|
2010-08-13 22:10:54 +02:00
|
|
|
RemovePgTempRelationFiles("base");
|
2001-06-11 06:12:29 +02:00
|
|
|
|
|
|
|
/*
|
2007-06-03 19:08:34 +02:00
|
|
|
* Cycle through temp directories for all non-default tablespaces.
|
2001-06-11 06:12:29 +02:00
|
|
|
*/
|
2007-06-03 19:08:34 +02:00
|
|
|
spc_dir = AllocateDir("pg_tblspc");
|
2004-12-29 22:36:09 +01:00
|
|
|
|
2007-06-03 19:08:34 +02:00
|
|
|
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
|
2004-12-29 22:36:09 +01:00
|
|
|
{
|
2007-06-03 19:08:34 +02:00
|
|
|
if (strcmp(spc_de->d_name, ".") == 0 ||
|
|
|
|
strcmp(spc_de->d_name, "..") == 0)
|
2004-12-29 22:36:09 +01:00
|
|
|
continue;
|
|
|
|
|
2010-01-12 03:42:52 +01:00
|
|
|
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s/%s",
|
2010-02-26 03:01:40 +01:00
|
|
|
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY, PG_TEMP_FILES_DIR);
|
2004-12-29 22:36:09 +01:00
|
|
|
RemovePgTempFilesInDir(temp_path);
|
2010-08-13 22:10:54 +02:00
|
|
|
|
|
|
|
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
|
2011-04-10 17:42:00 +02:00
|
|
|
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
|
2010-08-13 22:10:54 +02:00
|
|
|
RemovePgTempRelationFiles(temp_path);
|
2004-12-29 22:36:09 +01:00
|
|
|
}
|
|
|
|
|
2007-06-03 19:08:34 +02:00
|
|
|
FreeDir(spc_dir);
|
2004-12-29 22:36:09 +01:00
|
|
|
|
|
|
|
/*
|
2005-10-15 04:49:52 +02:00
|
|
|
* In EXEC_BACKEND case there is a pgsql_tmp directory at the top level of
|
|
|
|
* DataDir as well.
|
2004-12-29 22:36:09 +01:00
|
|
|
*/
|
|
|
|
#ifdef EXEC_BACKEND
|
2005-07-04 06:51:52 +02:00
|
|
|
RemovePgTempFilesInDir(PG_TEMP_FILES_DIR);
|
2003-12-20 18:31:21 +01:00
|
|
|
#endif
|
2004-12-29 22:36:09 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Process one pgsql_tmp directory for RemovePgTempFiles */
|
|
|
|
static void
|
|
|
|
RemovePgTempFilesInDir(const char *tmpdirname)
|
|
|
|
{
|
|
|
|
DIR *temp_dir;
|
|
|
|
struct dirent *temp_de;
|
|
|
|
char rm_path[MAXPGPATH];
|
|
|
|
|
|
|
|
temp_dir = AllocateDir(tmpdirname);
|
|
|
|
if (temp_dir == NULL)
|
|
|
|
{
|
|
|
|
/* anything except ENOENT is fishy */
|
|
|
|
if (errno != ENOENT)
|
|
|
|
elog(LOG,
|
|
|
|
"could not open temporary-files directory \"%s\": %m",
|
|
|
|
tmpdirname);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2005-06-19 23:34:03 +02:00
|
|
|
while ((temp_de = ReadDir(temp_dir, tmpdirname)) != NULL)
|
2004-12-29 22:36:09 +01:00
|
|
|
{
|
|
|
|
if (strcmp(temp_de->d_name, ".") == 0 ||
|
|
|
|
strcmp(temp_de->d_name, "..") == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
snprintf(rm_path, sizeof(rm_path), "%s/%s",
|
|
|
|
tmpdirname, temp_de->d_name);
|
|
|
|
|
|
|
|
if (strncmp(temp_de->d_name,
|
|
|
|
PG_TEMP_FILE_PREFIX,
|
|
|
|
strlen(PG_TEMP_FILE_PREFIX)) == 0)
|
|
|
|
unlink(rm_path); /* note we ignore any error */
|
|
|
|
else
|
|
|
|
elog(LOG,
|
|
|
|
"unexpected file found in temporary-files directory: \"%s\"",
|
|
|
|
rm_path);
|
2001-06-11 06:12:29 +02:00
|
|
|
}
|
2004-12-29 22:36:09 +01:00
|
|
|
|
|
|
|
FreeDir(temp_dir);
|
2001-06-11 06:12:29 +02:00
|
|
|
}
|
2010-08-13 22:10:54 +02:00
|
|
|
|
|
|
|
/* Process one tablespace directory, look for per-DB subdirectories */
|
|
|
|
static void
|
|
|
|
RemovePgTempRelationFiles(const char *tsdirname)
|
|
|
|
{
|
|
|
|
DIR *ts_dir;
|
|
|
|
struct dirent *de;
|
|
|
|
char dbspace_path[MAXPGPATH];
|
|
|
|
|
|
|
|
ts_dir = AllocateDir(tsdirname);
|
|
|
|
if (ts_dir == NULL)
|
|
|
|
{
|
|
|
|
/* anything except ENOENT is fishy */
|
|
|
|
if (errno != ENOENT)
|
|
|
|
elog(LOG,
|
|
|
|
"could not open tablespace directory \"%s\": %m",
|
|
|
|
tsdirname);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
|
|
|
|
{
|
2011-04-10 17:42:00 +02:00
|
|
|
int i = 0;
|
2010-08-13 22:10:54 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We're only interested in the per-database directories, which have
|
|
|
|
* numeric names. Note that this code will also (properly) ignore "."
|
|
|
|
* and "..".
|
|
|
|
*/
|
|
|
|
while (isdigit((unsigned char) de->d_name[i]))
|
|
|
|
++i;
|
|
|
|
if (de->d_name[i] != '\0' || i == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
snprintf(dbspace_path, sizeof(dbspace_path), "%s/%s",
|
|
|
|
tsdirname, de->d_name);
|
|
|
|
RemovePgTempRelationFilesInDbspace(dbspace_path);
|
|
|
|
}
|
|
|
|
|
|
|
|
FreeDir(ts_dir);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Process one per-dbspace directory for RemovePgTempRelationFiles */
|
|
|
|
static void
|
|
|
|
RemovePgTempRelationFilesInDbspace(const char *dbspacedirname)
|
|
|
|
{
|
|
|
|
DIR *dbspace_dir;
|
|
|
|
struct dirent *de;
|
|
|
|
char rm_path[MAXPGPATH];
|
|
|
|
|
|
|
|
dbspace_dir = AllocateDir(dbspacedirname);
|
|
|
|
if (dbspace_dir == NULL)
|
|
|
|
{
|
|
|
|
/* we just saw this directory, so it really ought to be there */
|
|
|
|
elog(LOG,
|
|
|
|
"could not open dbspace directory \"%s\": %m",
|
|
|
|
dbspacedirname);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
|
|
|
|
{
|
|
|
|
if (!looks_like_temp_rel_name(de->d_name))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
snprintf(rm_path, sizeof(rm_path), "%s/%s",
|
|
|
|
dbspacedirname, de->d_name);
|
|
|
|
|
2011-04-10 17:42:00 +02:00
|
|
|
unlink(rm_path); /* note we ignore any error */
|
2010-08-13 22:10:54 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
FreeDir(dbspace_dir);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* t<digits>_<digits>, or t<digits>_<digits>_<forkname> */
|
|
|
|
static bool
|
|
|
|
looks_like_temp_rel_name(const char *name)
|
|
|
|
{
|
|
|
|
int pos;
|
|
|
|
int savepos;
|
|
|
|
|
|
|
|
/* Must start with "t". */
|
|
|
|
if (name[0] != 't')
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* Followed by a non-empty string of digits and then an underscore. */
|
|
|
|
for (pos = 1; isdigit((unsigned char) name[pos]); ++pos)
|
|
|
|
;
|
|
|
|
if (pos == 1 || name[pos] != '_')
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* Followed by another nonempty string of digits. */
|
|
|
|
for (savepos = ++pos; isdigit((unsigned char) name[pos]); ++pos)
|
|
|
|
;
|
|
|
|
if (savepos == pos)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* We might have _forkname or .segment or both. */
|
|
|
|
if (name[pos] == '_')
|
|
|
|
{
|
2011-04-10 17:42:00 +02:00
|
|
|
int forkchar = forkname_chars(&name[pos + 1], NULL);
|
|
|
|
|
2010-08-13 22:10:54 +02:00
|
|
|
if (forkchar <= 0)
|
|
|
|
return false;
|
|
|
|
pos += forkchar + 1;
|
|
|
|
}
|
|
|
|
if (name[pos] == '.')
|
|
|
|
{
|
2011-04-10 17:42:00 +02:00
|
|
|
int segchar;
|
|
|
|
|
|
|
|
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
|
2010-08-13 22:10:54 +02:00
|
|
|
;
|
|
|
|
if (segchar <= 1)
|
|
|
|
return false;
|
|
|
|
pos += segchar;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now we should be at the end. */
|
|
|
|
if (name[pos] != '\0')
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
2015-05-04 20:13:53 +02:00
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
|
2015-05-04 20:13:53 +02:00
|
|
|
/*
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
* Issue fsync recursively on PGDATA and all its contents.
|
|
|
|
*
|
|
|
|
* We fsync regular files and directories wherever they are, but we
|
|
|
|
* follow symlinks only for pg_xlog and immediately under pg_tblspc.
|
|
|
|
* Other symlinks are presumed to point at files we're not responsible
|
|
|
|
* for fsyncing, and might not have privileges to write at all.
|
2015-05-04 20:13:53 +02:00
|
|
|
*
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
* Errors are logged but not considered fatal; that's because this is used
|
|
|
|
* only during database startup, to deal with the possibility that there are
|
|
|
|
* issued-but-unsynced writes pending against the data directory. We want to
|
|
|
|
* ensure that such writes reach disk before anything that's done in the new
|
|
|
|
* run. However, aborting on error would result in failure to start for
|
|
|
|
* harmless cases such as read-only files in the data directory, and that's
|
|
|
|
* not good either.
|
|
|
|
*
|
|
|
|
* Note we assume we're chdir'd into PGDATA to begin with.
|
2015-05-04 20:13:53 +02:00
|
|
|
*/
|
|
|
|
void
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
SyncDataDirectory(void)
|
2015-05-04 20:13:53 +02:00
|
|
|
{
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
bool xlog_is_symlink;
|
2015-05-04 20:13:53 +02:00
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
/* We can skip this whole thing if fsync is disabled. */
|
|
|
|
if (!enableFsync)
|
|
|
|
return;
|
2015-05-04 20:13:53 +02:00
|
|
|
|
|
|
|
/*
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
* If pg_xlog is a symlink, we'll need to recurse into it separately,
|
|
|
|
* because the first walkdir below will ignore it.
|
2015-05-04 20:13:53 +02:00
|
|
|
*/
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
xlog_is_symlink = false;
|
2015-05-04 20:13:53 +02:00
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
#ifndef WIN32
|
|
|
|
{
|
|
|
|
struct stat st;
|
2015-05-04 20:13:53 +02:00
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
if (lstat("pg_xlog", &st) < 0)
|
|
|
|
ereport(LOG,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not stat file \"%s\": %m",
|
|
|
|
"pg_xlog")));
|
|
|
|
else if (S_ISLNK(st.st_mode))
|
|
|
|
xlog_is_symlink = true;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
if (pgwin32_is_junction("pg_xlog"))
|
|
|
|
xlog_is_symlink = true;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If possible, hint to the kernel that we're soon going to fsync the data
|
|
|
|
* directory and its contents. Errors in this step are even less
|
|
|
|
* interesting than normal, so log them only at DEBUG1.
|
|
|
|
*/
|
|
|
|
#ifdef PG_FLUSH_DATA_WORKS
|
|
|
|
walkdir(".", pre_sync_fname, false, DEBUG1);
|
|
|
|
if (xlog_is_symlink)
|
|
|
|
walkdir("pg_xlog", pre_sync_fname, false, DEBUG1);
|
|
|
|
walkdir("pg_tblspc", pre_sync_fname, true, DEBUG1);
|
|
|
|
#endif
|
2015-05-04 20:13:53 +02:00
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
/*
|
|
|
|
* Now we do the fsync()s in the same order.
|
|
|
|
*
|
|
|
|
* The main call ignores symlinks, so in addition to specially processing
|
|
|
|
* pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
|
|
|
|
* process_symlinks = true. Note that if there are any plain directories
|
|
|
|
* in pg_tblspc, they'll get fsync'd twice. That's not an expected case
|
|
|
|
* so we don't worry about optimizing it.
|
|
|
|
*/
|
|
|
|
walkdir(".", fsync_fname_ext, false, LOG);
|
|
|
|
if (xlog_is_symlink)
|
|
|
|
walkdir("pg_xlog", fsync_fname_ext, false, LOG);
|
|
|
|
walkdir("pg_tblspc", fsync_fname_ext, true, LOG);
|
2015-05-04 20:13:53 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* walkdir: recursively walk a directory, applying the action to each
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
* regular file and directory (including the named directory itself).
|
|
|
|
*
|
|
|
|
* If process_symlinks is true, the action and recursion are also applied
|
|
|
|
* to regular files and directories that are pointed to by symlinks in the
|
|
|
|
* given directory; otherwise symlinks are ignored. Symlinks are always
|
|
|
|
* ignored in subdirectories, ie we intentionally don't pass down the
|
|
|
|
* process_symlinks flag to recursive calls.
|
|
|
|
*
|
|
|
|
* Errors are reported at level elevel, which might be ERROR or less.
|
2015-05-04 20:13:53 +02:00
|
|
|
*
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
* See also walkdir in initdb.c, which is a frontend version of this logic.
|
2015-05-04 20:13:53 +02:00
|
|
|
*/
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
static void
|
|
|
|
walkdir(const char *path,
|
|
|
|
void (*action) (const char *fname, bool isdir, int elevel),
|
|
|
|
bool process_symlinks,
|
|
|
|
int elevel)
|
2015-05-04 20:13:53 +02:00
|
|
|
{
|
|
|
|
DIR *dir;
|
|
|
|
struct dirent *de;
|
|
|
|
|
|
|
|
dir = AllocateDir(path);
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
if (dir == NULL)
|
|
|
|
{
|
|
|
|
ereport(elevel,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open directory \"%s\": %m", path)));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
while ((de = ReadDirExtended(dir, path, elevel)) != NULL)
|
2015-05-04 20:13:53 +02:00
|
|
|
{
|
|
|
|
char subpath[MAXPGPATH];
|
|
|
|
struct stat fst;
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
int sret;
|
2015-05-04 20:13:53 +02:00
|
|
|
|
|
|
|
CHECK_FOR_INTERRUPTS();
|
|
|
|
|
|
|
|
if (strcmp(de->d_name, ".") == 0 ||
|
|
|
|
strcmp(de->d_name, "..") == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
|
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
if (process_symlinks)
|
|
|
|
sret = stat(subpath, &fst);
|
|
|
|
else
|
|
|
|
sret = lstat(subpath, &fst);
|
|
|
|
|
|
|
|
if (sret < 0)
|
|
|
|
{
|
|
|
|
ereport(elevel,
|
2015-05-04 20:13:53 +02:00
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not stat file \"%s\": %m", subpath)));
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
continue;
|
|
|
|
}
|
2015-05-04 20:13:53 +02:00
|
|
|
|
|
|
|
if (S_ISREG(fst.st_mode))
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
(*action) (subpath, false, elevel);
|
2015-05-04 20:13:53 +02:00
|
|
|
else if (S_ISDIR(fst.st_mode))
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
walkdir(subpath, action, false, elevel);
|
|
|
|
}
|
|
|
|
|
|
|
|
FreeDir(dir); /* we ignore any error here */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It's important to fsync the destination directory itself as individual
|
|
|
|
* file fsyncs don't guarantee that the directory entry for the file is
|
|
|
|
* synced.
|
|
|
|
*/
|
|
|
|
(*action) (path, true, elevel);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Hint to the OS that it should get ready to fsync() this file.
|
|
|
|
*
|
|
|
|
* Ignores errors trying to open unreadable files, and logs other errors at a
|
|
|
|
* caller-specified level.
|
|
|
|
*/
|
|
|
|
#ifdef PG_FLUSH_DATA_WORKS
|
|
|
|
|
|
|
|
static void
|
|
|
|
pre_sync_fname(const char *fname, bool isdir, int elevel)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
|
|
|
|
fd = OpenTransientFile((char *) fname, O_RDONLY | PG_BINARY, 0);
|
|
|
|
|
|
|
|
if (fd < 0)
|
|
|
|
{
|
|
|
|
if (errno == EACCES || (isdir && errno == EISDIR))
|
|
|
|
return;
|
|
|
|
ereport(elevel,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\": %m", fname)));
|
|
|
|
return;
|
|
|
|
}
|
2015-05-04 20:13:53 +02:00
|
|
|
|
2015-05-29 21:11:36 +02:00
|
|
|
/*
|
|
|
|
* We ignore errors from pg_flush_data() because this is only a hint.
|
|
|
|
*/
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
(void) pg_flush_data(fd, 0, 0);
|
|
|
|
|
|
|
|
(void) CloseTransientFile(fd);
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* PG_FLUSH_DATA_WORKS */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* fsync_fname_ext -- Try to fsync a file or directory
|
|
|
|
*
|
|
|
|
* Ignores errors trying to open unreadable files, or trying to fsync
|
|
|
|
* directories on systems where that isn't allowed/required, and logs other
|
|
|
|
* errors at a caller-specified level.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
fsync_fname_ext(const char *fname, bool isdir, int elevel)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
int flags;
|
|
|
|
int returncode;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Some OSs require directories to be opened read-only whereas other
|
|
|
|
* systems don't allow us to fsync files opened read-only; so we need both
|
|
|
|
* cases here. Using O_RDWR will cause us to fail to fsync files that are
|
|
|
|
* not writable by our userid, but we assume that's OK.
|
|
|
|
*/
|
|
|
|
flags = PG_BINARY;
|
|
|
|
if (!isdir)
|
|
|
|
flags |= O_RDWR;
|
|
|
|
else
|
|
|
|
flags |= O_RDONLY;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Open the file, silently ignoring errors about unreadable files (or
|
|
|
|
* unsupported operations, e.g. opening a directory under Windows), and
|
|
|
|
* logging others.
|
|
|
|
*/
|
|
|
|
fd = OpenTransientFile((char *) fname, flags, 0);
|
|
|
|
if (fd < 0)
|
|
|
|
{
|
|
|
|
if (errno == EACCES || (isdir && errno == EISDIR))
|
|
|
|
return;
|
|
|
|
ereport(elevel,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not open file \"%s\": %m", fname)));
|
|
|
|
return;
|
2015-05-04 20:13:53 +02:00
|
|
|
}
|
|
|
|
|
Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f3379aed857517c8ce207485655000fc8e introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.
After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/. The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.
This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.
I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().
The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.
Back-patch to all active branches, like the prior commit.
Abhijit Menon-Sen and Tom Lane
2015-05-28 23:33:03 +02:00
|
|
|
returncode = pg_fsync(fd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Some OSes don't allow us to fsync directories at all, so we can ignore
|
|
|
|
* those errors. Anything else needs to be logged.
|
|
|
|
*/
|
|
|
|
if (returncode != 0 && !(isdir && errno == EBADF))
|
|
|
|
ereport(elevel,
|
|
|
|
(errcode_for_file_access(),
|
|
|
|
errmsg("could not fsync file \"%s\": %m", fname)));
|
|
|
|
|
|
|
|
(void) CloseTransientFile(fd);
|
2015-05-04 20:13:53 +02:00
|
|
|
}
|