From 8da308036d341b2a98b529e4b8e43dc850b1a5d7 Mon Sep 17 00:00:00 2001 From: Bruce Momjian Date: Thu, 2 Mar 2006 19:20:44 +0000 Subject: [PATCH] Update TODO.detail/qsort. --- doc/TODO.detail/qsort | 406 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 406 insertions(+) diff --git a/doc/TODO.detail/qsort b/doc/TODO.detail/qsort index 4284c73941..a03fa90aa0 100644 --- a/doc/TODO.detail/qsort +++ b/doc/TODO.detail/qsort @@ -582,3 +582,409 @@ broadcast)--------------------------- ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster +From kleptog@svana.org Mon Dec 19 06:37:51 2005 +Return-path: +Received: from svana.org (mail@svana.org [203.20.62.76]) + by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id jBJBboe20936 + for ; Mon, 19 Dec 2005 06:37:51 -0500 (EST) +Received: from kleptog by svana.org with local (Exim 3.35 #1 (Debian)) + id 1EoJKc-00045V-00; Mon, 19 Dec 2005 22:37:30 +1100 +Date: Mon, 19 Dec 2005 12:37:30 +0100 +From: Martijn van Oosterhout +To: Dann Corbit +cc: Tom Lane , Qingqing Zhou , + Bruce Momjian , + Luke Lonergan , Neil Conway , + pgsql-hackers@postgresql.org +Subject: Re: [HACKERS] Re: Which qsort is used +Message-ID: <20051219113724.GD12251@svana.org> +Reply-To: Martijn van Oosterhout +References: +MIME-Version: 1.0 +Content-Type: multipart/signed; micalg=pgp-sha1; + protocol="application/pgp-signature"; boundary="5gxpn/Q6ypwruk0T" +Content-Disposition: inline +In-Reply-To: +User-Agent: Mutt/1.3.28i +X-PGP-Key-ID: Length=1024; ID=0x0DC67BE6 +X-PGP-Key-Fingerprint: 295F A899 A81A 156D B522 48A7 6394 F08A 0DC6 7BE6 +X-PGP-Key-URL: +Status: OR + + +--5gxpn/Q6ypwruk0T +Content-Type: text/plain; charset=us-ascii +Content-Disposition: inline +Content-Transfer-Encoding: quoted-printable + +On Fri, Dec 16, 2005 at 10:43:58PM -0800, Dann Corbit wrote: +> I am actually quite impressed with the excellence of Bentley's sort out +> of the box. It's definitely the best library implementation of a sort I +> have seen. + +I'm not sure whether we have a conclusion here, but I do have one +question: is there a significant difference in the number of times the +comparison routines are called? Comparisons in PostgreSQL are fairly +expensive given the fmgr overhead and when comparing tuples it's even +worse. + +We don't want to accedently pick a routine that saves data shuffling by +adding extra comparisons. The stats at [1] don't say. They try to +factor in CPU cost but they seem to use unrealistically small values. I +would think a number around 50 (or higher) would be more +representative. + +[1] http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html + +Have a nice day, +--=20 +Martijn van Oosterhout http://svana.org/kleptog/ +> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a +> tool for doing 5% of the work and then sitting around waiting for someone +> else to do the other 95% so you can sue them. + +--5gxpn/Q6ypwruk0T +Content-Type: application/pgp-signature +Content-Disposition: inline + +-----BEGIN PGP SIGNATURE----- +Version: GnuPG v1.0.6 (GNU/Linux) +Comment: For info see http://www.gnupg.org + +iD8DBQFDpptzIB7bNG8LQkwRAmC6AJ4qYrIm3SYnBV3BybSmm+Gl4vpEywCfRDxg +bnIK4INRqOVFNBAKR/gDPcM= +=92qA +-----END PGP SIGNATURE----- + +--5gxpn/Q6ypwruk0T-- + +From mkoi-pg@aon.at Wed Dec 21 19:44:03 2005 +Return-path: +Received: from email.aon.at (warsl404pip5.highway.telekom.at [195.3.96.77]) + by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id jBM0i2e05649 + for ; Wed, 21 Dec 2005 19:44:02 -0500 (EST) +Received: (qmail 12703 invoked from network); 22 Dec 2005 00:43:51 -0000 +Received: from m148p015.dipool.highway.telekom.at (HELO Sokrates) ([62.46.8.111]) + (envelope-sender ) + by smarthub78.highway.telekom.at (qmail-ldap-1.03) with SMTP + for ; 22 Dec 2005 00:43:51 -0000 +From: Manfred Koizar +To: Tom Lane +cc: "Dann Corbit" , "Qingqing Zhou" , + "Bruce Momjian" , + "Luke Lonergan" , + "Neil Conway" , pgsql-hackers@postgresql.org +Subject: Re: [HACKERS] Re: Which qsort is used +Date: Thu, 22 Dec 2005 01:43:34 +0100 +Message-ID: +References: <3148.1134795805@sss.pgh.pa.us> +In-Reply-To: <3148.1134795805@sss.pgh.pa.us> +X-Mailer: Forte Agent 3.1/32.783 +MIME-Version: 1.0 +Content-Type: text/plain; charset=us-ascii +Content-Transfer-Encoding: 7bit +Status: OR + +On Sat, 17 Dec 2005 00:03:25 -0500, Tom Lane +wrote: +>I've still got a problem with these checks; I think they are a net +>waste of cycles on average. [...] +> and when they fail, those cycles are entirely wasted; +>you have not advanced the state of the sort at all. + +How can we make the initial check "adavance the state of the sort"? +One answer might be to exclude the sorted sequence at the start of the +array from the qsort, and merge the two sorted lists as the final +stage of the sort. + +Qsorting N elements costs O(N*lnN), so excluding H elements from the +sort reduces the cost by at least O(H*lnN). The merge step costs O(N) +plus some (<=50%) more memory, unless someone knows a fast in-place +merge. So depending on the constant factors involved there might be a +usable solution. + +I've been playing with some numbers and assuming the constant factors +to be equal for all the O()'s this method starts to pay off at + H for N + 20 100 + 130 1000 + 8000 100000 +Servus + Manfred + +From pgsql-hackers-owner+M77795=pgman=candle.pha.pa.us@postgresql.org Thu Dec 22 02:02:28 2005 +Return-path: +Received: from ams.hub.org (ams.hub.org [200.46.204.13]) + by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id jBM72Re16910 + for ; Thu, 22 Dec 2005 02:02:28 -0500 (EST) +Received: from postgresql.org (postgresql.org [200.46.204.71]) + by ams.hub.org (Postfix) with ESMTP id A31E067AAA0 + for ; Thu, 22 Dec 2005 03:02:22 -0400 (AST) +X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org +Received: from localhost (av.hub.org [200.46.204.144]) + by postgresql.org (Postfix) with ESMTP id 2C8EC9DCA92 + for ; Thu, 22 Dec 2005 03:01:56 -0400 (AST) +Received: from postgresql.org ([200.46.204.71]) + by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) + with ESMTP id 26033-04 + for ; + Thu, 22 Dec 2005 03:01:55 -0400 (AST) +X-Greylist: from auto-whitelisted by SQLgrey- +Received: from svana.org (svana.org [203.20.62.76]) + by postgresql.org (Postfix) with ESMTP id 800859DC81D + for ; Thu, 22 Dec 2005 03:01:51 -0400 (AST) +Received: from kleptog by svana.org with local (Exim 3.35 #1 (Debian)) + id 1EpKRg-0005ox-00; Thu, 22 Dec 2005 18:01:00 +1100 +Date: Thu, 22 Dec 2005 08:01:00 +0100 +From: Martijn van Oosterhout +To: Manfred Koizar +cc: Tom Lane , Dann Corbit , + Qingqing Zhou , + Bruce Momjian , + Luke Lonergan , Neil Conway , + pgsql-hackers@postgresql.org +Subject: Re: [HACKERS] Re: Which qsort is used +Message-ID: <20051222070057.GA21783@svana.org> +Reply-To: Martijn van Oosterhout +References: <3148.1134795805@sss.pgh.pa.us> +MIME-Version: 1.0 +Content-Type: multipart/signed; micalg=pgp-sha1; + protocol="application/pgp-signature"; boundary="FL5UXtIhxfXey3p5" +Content-Disposition: inline +In-Reply-To: +User-Agent: Mutt/1.3.28i +X-PGP-Key-ID: Length=1024; ID=0x0DC67BE6 +X-PGP-Key-Fingerprint: 295F A899 A81A 156D B522 48A7 6394 F08A 0DC6 7BE6 +X-PGP-Key-URL: +X-Virus-Scanned: by amavisd-new at hub.org +X-Spam-Status: No, score=0.065 required=5 tests=[AWL=0.065] +X-Spam-Score: 0.065 +X-Mailing-List: pgsql-hackers +List-Archive: +List-Help: +List-Id: +List-Owner: +List-Post: +List-Subscribe: +List-Unsubscribe: +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +Status: OR + + +--FL5UXtIhxfXey3p5 +Content-Type: text/plain; charset=us-ascii +Content-Disposition: inline +Content-Transfer-Encoding: quoted-printable + +On Thu, Dec 22, 2005 at 01:43:34AM +0100, Manfred Koizar wrote: +> Qsorting N elements costs O(N*lnN), so excluding H elements from the +> sort reduces the cost by at least O(H*lnN). The merge step costs O(N) +> plus some (<=3D50%) more memory, unless someone knows a fast in-place +> merge. So depending on the constant factors involved there might be a +> usable solution. + +But where are you including the cost to check how many cells are +already sorted? That would be O(H), right? This is where we come back +to the issue that comparisons in PostgreSQL are expensive. The cpu_cost +in the tests I saw so far is unrealistically low. + +> I've been playing with some numbers and assuming the constant factors +> to be equal for all the O()'s this method starts to pay off at +> H for N +> 20 100 20% +> 130 1000 13% +> 8000 100000 8% + +Hmm, what are the chances you have 100000 unordered items to sort and +that the first 8% will already be in order. ISTM that that probability +will be close enough to zero to not matter... + +Have a nice day, +--=20 +Martijn van Oosterhout http://svana.org/kleptog/ +> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a +> tool for doing 5% of the work and then sitting around waiting for someone +> else to do the other 95% so you can sue them. + +--FL5UXtIhxfXey3p5 +Content-Type: application/pgp-signature +Content-Disposition: inline + +-----BEGIN PGP SIGNATURE----- +Version: GnuPG v1.0.6 (GNU/Linux) +Comment: For info see http://www.gnupg.org + +iD8DBQFDqk8oIB7bNG8LQkwRAjJhAJ47eXRi1DJ02cfKcnN2iPkaBB0eaQCeIiF+ +HOAYIPQrU2gpUUiGT3aGUUw= +=R0hU +-----END PGP SIGNATURE----- + +--FL5UXtIhxfXey3p5-- + +From pgsql-hackers-owner+M77831=pgman=candle.pha.pa.us@postgresql.org Thu Dec 22 16:59:19 2005 +Return-path: +Received: from ams.hub.org (ams.hub.org [200.46.204.13]) + by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id jBMLxJe07480 + for ; Thu, 22 Dec 2005 16:59:19 -0500 (EST) +Received: from postgresql.org (postgresql.org [200.46.204.71]) + by ams.hub.org (Postfix) with ESMTP id D1DBE67AC1B + for ; Thu, 22 Dec 2005 17:59:16 -0400 (AST) +X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org +Received: from localhost (av.hub.org [200.46.204.144]) + by postgresql.org (Postfix) with ESMTP id BE8249DCBEB + for ; Thu, 22 Dec 2005 17:58:53 -0400 (AST) +Received: from postgresql.org ([200.46.204.71]) + by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) + with ESMTP id 64765-01 + for ; + Thu, 22 Dec 2005 17:58:54 -0400 (AST) +X-Greylist: from auto-whitelisted by SQLgrey- +Received: from email.aon.at (warsl404pip7.highway.telekom.at [195.3.96.91]) + by postgresql.org (Postfix) with ESMTP id 3E08E9DCA5C + for ; Thu, 22 Dec 2005 17:58:49 -0400 (AST) +Received: (qmail 6986 invoked from network); 22 Dec 2005 21:58:49 -0000 +Received: from m150p015.dipool.highway.telekom.at (HELO Sokrates) ([62.46.8.175]) + (envelope-sender ) + by smarthub76.highway.telekom.at (qmail-ldap-1.03) with SMTP + for ; 22 Dec 2005 21:58:49 -0000 +From: Manfred Koizar +To: Martijn van Oosterhout +cc: Tom Lane , Dann Corbit , + Qingqing Zhou , + Bruce Momjian , + Luke Lonergan , Neil Conway , + pgsql-hackers@postgresql.org +Subject: Re: [HACKERS] Re: Which qsort is used +Date: Thu, 22 Dec 2005 22:58:31 +0100 +Message-ID: <4r6mq19fe6937mu9130h45ip3oeg135qo3@4ax.com> +References: <3148.1134795805@sss.pgh.pa.us> <20051222070057.GA21783@svana.org> +In-Reply-To: <20051222070057.GA21783@svana.org> +X-Mailer: Forte Agent 3.1/32.783 +MIME-Version: 1.0 +Content-Type: text/plain; charset=us-ascii +Content-Transfer-Encoding: 7bit +X-Virus-Scanned: by amavisd-new at hub.org +X-Spam-Status: No, score=0.398 required=5 tests=[AWL=0.398] +X-Spam-Score: 0.398 +X-Mailing-List: pgsql-hackers +List-Archive: +List-Help: +List-Id: +List-Owner: +List-Post: +List-Subscribe: +List-Unsubscribe: +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +Status: OR + +On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout + wrote: +>But where are you including the cost to check how many cells are +>already sorted? That would be O(H), right? + +Yes. I didn't mention it, because H < N. + +> This is where we come back +>to the issue that comparisons in PostgreSQL are expensive. + +So we agree that we should try to reduce the number of comparisons. +How many comparisons does it take to sort 100000 items? 1.5 million? + +>Hmm, what are the chances you have 100000 unordered items to sort and +>that the first 8% will already be in order. ISTM that that probability +>will be close enough to zero to not matter... + +If the items are totally unordered, the check is so cheap you won't +even notice. OTOH in Tom's example ... + +|What I think is much more probable in the Postgres environment +|is almost-but-not-quite-ordered inputs --- eg, a table that was +|perfectly ordered by key when filled, but some of the tuples have since +|been moved by UPDATEs. + +... I'd not be surprised if H is 90% of N. +Servus + Manfred + +---------------------------(end of broadcast)--------------------------- +TIP 2: Don't 'kill -9' the postmaster + +From DCorbit@connx.com Thu Dec 22 17:22:03 2005 +Return-path: +Received: from postal.corporate.connx.com (postal.corporate.connx.com [65.212.159.187]) + by candle.pha.pa.us (8.11.6/8.11.6) with SMTP id jBMMLve11671 + for ; Thu, 22 Dec 2005 17:22:03 -0500 (EST) +Content-class: urn:content-classes:message +MIME-Version: 1.0 +Content-Type: text/plain; + charset="us-ascii" +Subject: RE: [HACKERS] Re: Which qsort is used +X-MimeOLE: Produced By Microsoft Exchange V6.5 +Date: Thu, 22 Dec 2005 14:21:49 -0800 +Message-ID: +Thread-Topic: [HACKERS] Re: Which qsort is used +Thread-Index: AcYHQuXJdKs8JVgmSKywUqld6KYccQAAfWAA +From: "Dann Corbit" +To: "Manfred Koizar" , + "Martijn van Oosterhout" +cc: "Tom Lane" , "Qingqing Zhou" , + "Bruce Momjian" , + "Luke Lonergan" , + "Neil Conway" , +Content-Transfer-Encoding: 8bit +X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id jBMMLve11671 +Status: OR + +An interesting article on sorting and comparison count: +http://www.acm.org/jea/ARTICLES/Vol7Nbr5.pdf + +Here is the article, the code, and an implementation that I have been +toying with: +http://cap.connx.com/chess-engines/new-approach/algos.zip + +Algorithm quickheap is especially interesting because it does not +require much additional space (just an array of integers up to size +log(element_count) and in addition, it has very few data movements. + +> -----Original Message----- +> From: Manfred Koizar [mailto:mkoi-pg@aon.at] +> Sent: Thursday, December 22, 2005 1:59 PM +> To: Martijn van Oosterhout +> Cc: Tom Lane; Dann Corbit; Qingqing Zhou; Bruce Momjian; Luke +Lonergan; +> Neil Conway; pgsql-hackers@postgresql.org +> Subject: Re: [HACKERS] Re: Which qsort is used +> +> On Thu, 22 Dec 2005 08:01:00 +0100, Martijn van Oosterhout +> wrote: +> >But where are you including the cost to check how many cells are +> >already sorted? That would be O(H), right? +> +> Yes. I didn't mention it, because H < N. +> +> > This is where we come back +> >to the issue that comparisons in PostgreSQL are expensive. +> +> So we agree that we should try to reduce the number of comparisons. +> How many comparisons does it take to sort 100000 items? 1.5 million? +> +> >Hmm, what are the chances you have 100000 unordered items to sort and +> >that the first 8% will already be in order. ISTM that that +probability +> >will be close enough to zero to not matter... +> +> If the items are totally unordered, the check is so cheap you won't +> even notice. OTOH in Tom's example ... +> +> |What I think is much more probable in the Postgres environment +> |is almost-but-not-quite-ordered inputs --- eg, a table that was +> |perfectly ordered by key when filled, but some of the tuples have +since +> |been moved by UPDATEs. +> +> ... I'd not be surprised if H is 90% of N. +> Servus +> Manfred +