postgresql/doc/TODO.detail/mmap

485 lines
22 KiB
Plaintext

From pgsql-hackers-owner+M5149@postgresql.org Mon Feb 26 03:32:49 2001
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA04497
for <pgman@candle.pha.pa.us>; Mon, 26 Feb 2001 03:32:48 -0500 (EST)
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f1Q8TSx48319;
Mon, 26 Feb 2001 03:29:28 -0500 (EST)
(envelope-from pgsql-hackers-owner+M5149@postgresql.org)
Received: from store.d.zembu.com (nat.zembu.com [209.128.96.253])
by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f1Q8LPx47243
for <pgsql-hackers@postgreSQL.org>; Mon, 26 Feb 2001 03:21:25 -0500 (EST)
(envelope-from ncm@zembu.com)
Received: by store.d.zembu.com (Postfix, from userid 509)
id 58E39A782; Mon, 26 Feb 2001 00:21:25 -0800 (PST)
Date: Mon, 26 Feb 2001 00:21:25 -0800
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Re: [PATCHES] A patch for xlog.c
Message-ID: <20010226002125.A2430@store.zembu.com>
Reply-To: pgsql-hackers@postgresql.org
References: <200102260200.VAA17397@candle.pha.pa.us> <22318.983161726@sss.pgh.pa.us>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <22318.983161726@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Sun, Feb 25, 2001 at 11:28:46PM -0500
From: ncm@zembu.com (Nathan Myers)
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr
On Sun, Feb 25, 2001 at 11:28:46PM -0500, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > It allows no backing store on disk.
I.e. it allows you to map memory without an associated inode; the memory
may still be swapped. Of course, there is no problem with mapping an
inode too, so that unrelated processes can join in. Solarix has a flag
to pin the shared pages in RAM so they can't be swapped out.
> > It is the BSD solution to SysV
> > share memory. Here are all the BSDi flags:
>
> > MAP_ANON Map anonymous memory not associated with any specific
> > file. The file descriptor used for creating MAP_ANON
> > must be -1. The offset parameter is ignored.
>
> Hmm. Now that I read down to the "nonstandard extensions" part of the
> HPUX man page for mmap(), I find
>
> If MAP_ANONYMOUS is set in flags:
>
> o A new memory region is created and initialized to all zeros.
> This memory region can be shared only with descendants of
> the current process.
This is supported on Linux and BSD, but not on Solarix 7. It's not
necessary; you can just map /dev/zero on SysV systems that don't
have MAP_ANON.
> While I've said before that I don't think it's really necessary for
> processes that aren't children of the postmaster to access the shared
> memory, I'm not sure that I want to go over to a mechanism that makes it
> *impossible* for that to be done. Especially not if the only motivation
> is to avoid having to configure the kernel's shared memory settings.
There are enormous advantages to avoiding the need to configure kernel
settings. It makes PG a better citizen. PG is much easier to drop in
and use if you don't need attention from the IT department.
But I don't know of any reason to avoid mapping an actual inode,
so using mmap doesn't necessarily mean giving up sharing among
unrelated processes.
> Besides, what makes you think there's not a limit on the size of shmem
> allocatable via mmap()?
I've never seen any mmap limit documented. Since mmap() is how
everybody implements shared libraries, such a limit would be equivalent
to a limit on how much/many shared libraries are used. mmap() with
MAP_ANONYMOUS (or its SysV /dev/zero equivalent) is a common, modern
way to get raw storage for malloc(), so such a limit would be a limit
on malloc() too.
The mmap architecture comes to us from the Mach microkernel memory
manager, backported into BSD and then copied widely. Since it was
the fundamental mechanism for all memory operations in Mach, arbitrary
limits would make no sense. That it worked so well is the reason it
was copied everywhere else, so adding arbitrary limits while copying
it would be silly. I don't think we'll see any systems like that.
Nathan Myers
ncm@zembu.com
From pgsql-hackers-owner+M6138@postgresql.org Mon Mar 19 07:57:59 2001
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id HAA26926
for <pgman@candle.pha.pa.us>; Mon, 19 Mar 2001 07:57:59 -0500 (EST)
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f2JCug641835;
Mon, 19 Mar 2001 07:56:42 -0500 (EST)
(envelope-from pgsql-hackers-owner+M6138@postgresql.org)
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f2JCt7641684
for <pgsql-hackers@postgresql.org>; Mon, 19 Mar 2001 07:55:07 -0500 (EST)
(envelope-from bright@fw.wintelcom.net)
Received: (from bright@localhost)
by fw.wintelcom.net (8.10.0/8.10.0) id f2JCt2325289;
Mon, 19 Mar 2001 04:55:02 -0800 (PST)
Date: Mon, 19 Mar 2001 04:55:01 -0800
From: Alfred Perlstein <bright@wintelcom.net>
To: Rod Taylor <rod.taylor@inquent.com>
Cc: Hackers List <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Fw: [vorbis-dev] ogg123: shared memory by mmap()
Message-ID: <20010319045500.T29888@fw.wintelcom.net>
References: <018301c0b070$16049a40$2205010a@jester>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <018301c0b070$16049a40$2205010a@jester>; from rod.taylor@inquent.com on Mon, Mar 19, 2001 at 07:28:21AM -0500
X-all-your-base: are belong to us.
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr
WOOT WOOT! DANGER WILL ROBINSON!
> ----- Original Message -----
> From: "Christian Weisgerber" <naddy@mips.inka.de>
> Newsgroups: list.vorbis.dev
> To: <vorbis-dev@xiph.org>
> Sent: Saturday, March 17, 2001 12:01 PM
> Subject: [vorbis-dev] ogg123: shared memory by mmap()
>
>
> > The patch below adds:
> >
> > - acinclude.m4: A new macro A_FUNC_SMMAP to check that sharing
> pages
> > through mmap() works. This is taken from Joerg Schilling's star.
> > - configure.in: A_FUNC_SMMAP
> > - ogg123/buffer.c: If we have a working mmap(), use it to create
> > a region of shared memory instead of using System V IPC.
> >
> > Works on BSD. Should also work on SVR4 and offspring (Solaris),
> > and Linux.
This is a really bad idea performance wise. Solaris has a special
code path for SYSV shared memory that doesn't require tons of swap
tracking structures per-page/per-process. FreeBSD also has this
optimization (it's off by default, but should work since FreeBSD
4.2 via the sysctl kern.ipc.shm_use_phys=1)
Both OS's use a trick of making the pages non-pageable, this allows
signifigant savings in kernel space required for each attached
process, as well as the use of large pages which reduce the amount
of TLB faults your processes will incurr.
Anyhow, if you could make this a runtime option it wouldn't be so
evil, but as a compile time option, it's a really bad idea for
Solaris and FreeBSD.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
From pgsql-hackers-owner+M6255@postgresql.org Tue Mar 20 18:46:33 2001
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA02887
for <pgman@candle.pha.pa.us>; Tue, 20 Mar 2001 18:46:33 -0500 (EST)
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
by mail.postgresql.org (8.11.3/8.11.1) with SMTP id f2KNjtH22390;
Tue, 20 Mar 2001 18:45:55 -0500 (EST)
(envelope-from pgsql-hackers-owner+M6255@postgresql.org)
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
by mail.postgresql.org (8.11.3/8.11.1) with ESMTP id f2KNiFH22033
for <pgsql-hackers@postgresql.org>; Tue, 20 Mar 2001 18:44:15 -0500 (EST)
(envelope-from bright@fw.wintelcom.net)
Received: (from bright@localhost)
by fw.wintelcom.net (8.10.0/8.10.0) id f2KNiAW02417;
Tue, 20 Mar 2001 15:44:10 -0800 (PST)
Date: Tue, 20 Mar 2001 15:44:10 -0800
From: Alfred Perlstein <bright@wintelcom.net>
To: Bruce Momjian <pgman@candle.pha.pa.us>
Cc: Rod Taylor <rod.taylor@inquent.com>,
Hackers List <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Fw: [vorbis-dev] ogg123: shared memory by mmap()
Message-ID: <20010320154410.H29888@fw.wintelcom.net>
References: <20010319045500.T29888@fw.wintelcom.net> <200103202210.RAA23981@candle.pha.pa.us>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <200103202210.RAA23981@candle.pha.pa.us>; from pgman@candle.pha.pa.us on Tue, Mar 20, 2001 at 05:10:33PM -0500
X-all-your-base: are belong to us.
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: OR
* Bruce Momjian <pgman@candle.pha.pa.us> [010320 14:10] wrote:
> > > > The patch below adds:
> > > >
> > > > - acinclude.m4: A new macro A_FUNC_SMMAP to check that sharing
> > > pages
> > > > through mmap() works. This is taken from Joerg Schilling's star.
> > > > - configure.in: A_FUNC_SMMAP
> > > > - ogg123/buffer.c: If we have a working mmap(), use it to create
> > > > a region of shared memory instead of using System V IPC.
> > > >
> > > > Works on BSD. Should also work on SVR4 and offspring (Solaris),
> > > > and Linux.
> >
> > This is a really bad idea performance wise. Solaris has a special
> > code path for SYSV shared memory that doesn't require tons of swap
> > tracking structures per-page/per-process. FreeBSD also has this
> > optimization (it's off by default, but should work since FreeBSD
> > 4.2 via the sysctl kern.ipc.shm_use_phys=1)
>
> >
> > Both OS's use a trick of making the pages non-pageable, this allows
> > signifigant savings in kernel space required for each attached
> > process, as well as the use of large pages which reduce the amount
> > of TLB faults your processes will incurr.
>
> That is interesting. BSDi has SysV shared memory as non-pagable, and I
> always thought of that as a bug. Seems you are saying that having it
> pagable has a significant performance penalty. Interesting.
Yes, having it pageable is actually sort of bad.
It doesn't allow you to do several important optimizations.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
From pgsql-general-owner+M14300@postgresql.org Mon Aug 27 13:07:32 2001
Return-path: <pgsql-general-owner+M14300@postgresql.org>
Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f7RH7VF04800
for <pgman@candle.pha.pa.us>; Mon, 27 Aug 2001 13:07:31 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f7RH7Tq17721;
Mon, 27 Aug 2001 12:07:29 -0500 (CDT)
(envelope-from pgsql-general-owner+M14300@postgresql.org)
Received: from svana.org (svana.org [210.9.66.30])
by postgresql.org (8.11.3/8.11.4) with ESMTP id f7RFE1f13269
for <pgsql-general@postgresql.org>; Mon, 27 Aug 2001 11:14:01 -0400 (EDT)
(envelope-from kleptog@svana.org)
Received: from kleptog by svana.org with local (Exim 3.12 #1 (Debian))
id 15bO5x-0000Fd-00; Tue, 28 Aug 2001 01:14:33 +1000
Date: Tue, 28 Aug 2001 01:14:33 +1000
From: Martijn van Oosterhout <kleptog@svana.org>
To: Andrew Snow <andrew@modulus.org>
cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] raw partition
Message-ID: <20010828011433.E32309@svana.org>
Reply-To: Martijn van Oosterhout <kleptog@svana.org>
References: <20010827233815.B32309@svana.org> <000101c12f00$dc5814b0$fa01b5ca@avon>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <000101c12f00$dc5814b0$fa01b5ca@avon>; from andrew@modulus.org on Tue, Aug 28, 2001 at 12:02:08AM +1000
Precedence: bulk
Sender: pgsql-general-owner@postgresql.org
Status: OR
On Tue, Aug 28, 2001 at 12:02:08AM +1000, Andrew Snow wrote:
>
> What I think would be better would be moving postgresql to a system of
> using memory-mapped I/O. instead of the shared buffer cache, files
> would be directly memory-mapped and the OS would do the caching. I
> can't see this happening though because of platform dependancy, but I
> think its worth another look soon because many unix platforms support
> mmap(). I think it would improve the performance of disk-intensive
> tasks noticeably.
Well, this has other problems. Consider tables that are larger than your
system memory. You'd have to continuously map and unmap different sections.
That can have odd side effects (witness mozilla on linux having 15,000
mapped areas or so...)
You would still however get the advantage that you wouldn't have to copy the
data from the disk buffers to user space, you simply get the disk buffer
mapped into your address space.
I think that for commonly used tables that are under 100K in size (most of
the system tables), this is quite a workable idea. If you don't mind keeping
them mapped the whole time.
--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/
> It would be nice if someone came up with a certification system that
> actually separated those who can barely regurgitate what they crammed over
> the last few weeks from those who command secret ninja networking powers.
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
From pgsql-general-owner+M14319@postgresql.org Mon Aug 27 16:57:10 2001
Return-path: <pgsql-general-owner+M14319@postgresql.org>
Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238])
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f7RKv9F16849
for <pgman@candle.pha.pa.us>; Mon, 27 Aug 2001 16:57:09 -0400 (EDT)
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f7RKv9q31456;
Mon, 27 Aug 2001 15:57:09 -0500 (CDT)
(envelope-from pgsql-general-owner+M14319@postgresql.org)
Received: from sss.pgh.pa.us ([192.204.191.242])
by postgresql.org (8.11.3/8.11.4) with ESMTP id f7RJrsf55472
for <pgsql-general@postgresql.org>; Mon, 27 Aug 2001 15:53:54 -0400 (EDT)
(envelope-from tgl@sss.pgh.pa.us)
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id f7RJrGK19431;
Mon, 27 Aug 2001 15:53:16 -0400 (EDT)
To: Martijn van Oosterhout <kleptog@svana.org>
cc: Andrew Snow <andrew@modulus.org>, pgsql-general@postgresql.org
Subject: Re: [GENERAL] raw partition
In-Reply-To: <20010828011433.E32309@svana.org>
References: <20010827233815.B32309@svana.org> <000101c12f00$dc5814b0$fa01b5ca@avon> <20010828011433.E32309@svana.org>
Comments: In-reply-to Martijn van Oosterhout <kleptog@svana.org>
message dated "Tue, 28 Aug 2001 01:14:33 +1000"
Date: Mon, 27 Aug 2001 15:53:15 -0400
Message-ID: <19428.998941995@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-general-owner@postgresql.org
Status: OR
Martijn van Oosterhout <kleptog@svana.org> writes:
> You would still however get the advantage that you wouldn't have to copy the
> data from the disk buffers to user space, you simply get the disk buffer
> mapped into your address space.
AFAICS this would be the *only* advantage. While it's not negligible,
it's quite unclear that it's worth the bookkeeping and portability
headaches of managing lots of mmap'd areas, either.
Before I take this idea seriously at all, I'd want to see a design that
addresses a couple of critical issues:
1. Postgres' shared buffers are *shared*, potentially across many
processes. How will you deal with buffers for files that have been
mmap'd by only some of the processes? (Maybe this means that the
whole concept of shared buffers goes away, and each process does its
own buffer management based on its own mmaps. Not sure. That would be
a pretty radical restructuring though, and would completely invalidate
our present approach to page-level locking.)
2. How do you deal with extending a file? My system's mmap man page
says
If the size of the mapped file changes after the call to mmap(), the
effect of references to portions of the mapped region that correspond
to added or removed portions of the file is unspecified.
This suggests that the only portable way to cope is to issue a separate
mmap for every disk page. Will typical Unix systems perform well with
umpteen thousand small mmap requests?
3. How do you persuade the other backends to drop their mmaps of a table
you are deleting?
There are probably other gotchas, but without an understanding of how
to address these, I doubt it's worth looking further ...
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/users-lounge/docs/faq.html
From pgsql-hackers-owner+M13750=candle.pha.pa.us=pgman@postgresql.org Mon Oct 1 05:59:15 2001
Return-path: <pgsql-hackers-owner+M13750=candle.pha.pa.us=pgman@postgresql.org>
Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238] (may be forged))
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f919xF512590
for <pgman@candle.pha.pa.us>; Mon, 1 Oct 2001 05:59:15 -0400 (EDT)
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f919xA207817
for <pgman@candle.pha.pa.us>; Mon, 1 Oct 2001 04:59:10 -0500 (CDT)
(envelope-from pgsql-hackers-owner+M13750=candle.pha.pa.us=pgman@postgresql.org)
Received: from mrsgntmail01.mediaring.com.sg (mserver.mediaring.com.sg [203.208.141.175])
by postgresql.org (8.11.3/8.11.4) with ESMTP id f919rE320926
for <pgsql-hackers@postgreSQL.org>; Mon, 1 Oct 2001 05:53:15 -0400 (EDT)
(envelope-from jana-reddy@mediaring.com.sg)
Received: by MRSGNTMAIL01 with Internet Mail Service (5.5.2650.21)
id <PMTCM7SJ>; Mon, 1 Oct 2001 18:03:34 +0800
Received: from mediaring.com.sg (10.1.0.131 [10.1.0.131]) by mrsgntmail01.mediaring.com.sg with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21)
id PMTCM7SH; Mon, 1 Oct 2001 18:03:25 +0800
From: Janardhana Reddy <jana-reddy@mediaring.com.sg>
To: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>,
janareddy
<jana-reddy@mediaring.com.sg>
Message-ID: <3BB83DF0.8946973@mediaring.com.sg>
Date: Mon, 01 Oct 2001 17:57:04 +0800
X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.0 i686)
X-Accept-Language: en
MIME-Version: 1.0
Subject: Re: [HACKERS] PERFORMANCE IMPROVEMENT by mapping WAL FILES
References: <200109282137.f8SLbpm01890@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
Status: ORr
I have just completed the functional testing the WAL using mmap , it is
working fine, I have tested by commenting out the "CreateCheckPoint "
functionality so that
when i kill the postgres and restart it will redo all the records from the
WAL log file which
is updated using mmap.
Just i need to clean code and to do some stress testing.
By the end of this week i should able to complete the stress test and
generate the patch file .
As Tom Lane mentioned i see the problem in portability to all platforms,
what i propose is to use mmap for only WAL for some platforms like
linux,freebsd etc . For other platforms we can use the existing method by
slightly modifying the
write() routine to write only the modified part of the page.
Regards
jana
>
>
> OK, I have talked to Tom Lane about this on the phone and we have a few
> ideas.
>
> Historically, we have avoided mmap() because of portability problems,
> and because using mmap() to write to large tables could consume lots of
> address space with little benefit. However, I perhaps can see WAL as
> being a good use of mmap.
>
> First, there is the issue of using mmap(). For OS's that have the
> mmap() MAP_SHARED flag, different backends could mmap the same file and
> each see the changes. However, keep in mind we still have to fsync()
> WAL, so we need to use msync().
>
> So, looking at the benefits of using mmap(), we have overhead of
> different backends having to mmap something that now sits quite easily
> in shared memory. Now, I can see mmap reducing the copy from user to
> kernel, but there are other ways to fix that. We could modify the
> write() routines to write() 8k on first WAL page write and later write
> only the modified part of the page to the kernel buffers. The old
> kernel buffer is probably still around so it is unlikely to require a
> read from the file system to read in the rest of the page. This reduces
> the write from 8k to something probably less than 4k which is better
> than we can do with mmap.
>
> I will add a TODO item to this effect.
>
> As far as reducing the write to disk from 8k to 4k, if we have to
> fsync/msync, we have to wait for the disk to spin to the proper location
> and at that point writing 4k or 8k doesn't seem like much of a win.
>
> In summary, I think it would be nice to reduce the 8k transfer from user
> to kernel on secondary page writes to only the modified part of the
> page. I am uncertain if mmap() or anything else will help the physical
> write to the disk.
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman@candle.pha.pa.us | (610) 853-3000
> + If your life is a hard drive, | 830 Blythe Avenue
> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org