mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-10-01 14:11:31 +02:00
c3c69ab4fd
or will never be converted.
451 lines
13 KiB
Plaintext
451 lines
13 KiB
Plaintext
|
|
<sect1 id="seg">
|
|
<title>seg</title>
|
|
|
|
<indexterm zone="seg">
|
|
<primary>seg</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
The <literal>seg</literal> module contains the code for the user-defined
|
|
type, <literal>SEG</literal>, representing laboratory measurements as
|
|
floating point intervals.
|
|
</para>
|
|
|
|
<sect2>
|
|
<title>Rationale</title>
|
|
<para>
|
|
The geometry of measurements is usually more complex than that of a
|
|
point in a numeric continuum. A measurement is usually a segment of
|
|
that continuum with somewhat fuzzy limits. The measurements come out
|
|
as intervals because of uncertainty and randomness, as well as because
|
|
the value being measured may naturally be an interval indicating some
|
|
condition, such as the temperature range of stability of a protein.
|
|
</para>
|
|
<para>
|
|
Using just common sense, it appears more convenient to store such data
|
|
as intervals, rather than pairs of numbers. In practice, it even turns
|
|
out more efficient in most applications.
|
|
</para>
|
|
<para>
|
|
Further along the line of common sense, the fuzziness of the limits
|
|
suggests that the use of traditional numeric data types leads to a
|
|
certain loss of information. Consider this: your instrument reads
|
|
6.50, and you input this reading into the database. What do you get
|
|
when you fetch it? Watch:
|
|
</para>
|
|
<programlisting>
|
|
test=> select 6.50 as "pH";
|
|
pH
|
|
---
|
|
6.5
|
|
(1 row)
|
|
</programlisting>
|
|
<para>
|
|
In the world of measurements, 6.50 is not the same as 6.5. It may
|
|
sometimes be critically different. The experimenters usually write
|
|
down (and publish) the digits they trust. 6.50 is actually a fuzzy
|
|
interval contained within a bigger and even fuzzier interval, 6.5,
|
|
with their center points being (probably) the only common feature they
|
|
share. We definitely do not want such different data items to appear the
|
|
same.
|
|
</para>
|
|
<para>
|
|
Conclusion? It is nice to have a special data type that can record the
|
|
limits of an interval with arbitrarily variable precision. Variable in
|
|
a sense that each data element records its own precision.
|
|
</para>
|
|
<para>
|
|
Check this out:
|
|
</para>
|
|
<programlisting>
|
|
test=> select '6.25 .. 6.50'::seg as "pH";
|
|
pH
|
|
------------
|
|
6.25 .. 6.50
|
|
(1 row)
|
|
</programlisting>
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Syntax</title>
|
|
<para>
|
|
The external representation of an interval is formed using one or two
|
|
floating point numbers joined by the range operator ('..' or '...').
|
|
Optional certainty indicators (<, > and ~) are ignored by the internal
|
|
logics, but are retained in the data.
|
|
</para>
|
|
|
|
<table>
|
|
<title>Rules</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry>rule 1</entry>
|
|
<entry>seg -> boundary PLUMIN deviation</entry>
|
|
</row>
|
|
<row>
|
|
<entry>rule 2</entry>
|
|
<entry>seg -> boundary RANGE boundary</entry>
|
|
</row>
|
|
<row>
|
|
<entry>rule 3</entry>
|
|
<entry>seg -> boundary RANGE</entry>
|
|
</row>
|
|
<row>
|
|
<entry>rule 4</entry>
|
|
<entry>seg -> RANGE boundary</entry>
|
|
</row>
|
|
<row>
|
|
<entry>rule 5</entry>
|
|
<entry>seg -> boundary</entry>
|
|
</row>
|
|
<row>
|
|
<entry>rule 6</entry>
|
|
<entry>boundary -> FLOAT</entry>
|
|
</row>
|
|
<row>
|
|
<entry>rule 7</entry>
|
|
<entry>boundary -> EXTENSION FLOAT</entry>
|
|
</row>
|
|
<row>
|
|
<entry>rule 8</entry>
|
|
<entry>deviation -> FLOAT</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<table>
|
|
<title>Tokens</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry>RANGE</entry>
|
|
<entry>(\.\.)(\.)?</entry>
|
|
</row>
|
|
<row>
|
|
<entry>PLUMIN</entry>
|
|
<entry>\'\+\-\'</entry>
|
|
</row>
|
|
<row>
|
|
<entry>integer</entry>
|
|
<entry>[+-]?[0-9]+</entry>
|
|
</row>
|
|
<row>
|
|
<entry>real</entry>
|
|
<entry>[+-]?[0-9]+\.[0-9]+</entry>
|
|
</row>
|
|
<row>
|
|
<entry>FLOAT</entry>
|
|
<entry>({integer}|{real})([eE]{integer})?</entry>
|
|
</row>
|
|
<row>
|
|
<entry>EXTENSION</entry>
|
|
<entry>[<>~]</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<table>
|
|
<title>Examples of valid <literal>SEG</literal> representations</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry>Any number</entry>
|
|
<entry>
|
|
(rules 5,6) -- creates a zero-length segment (a point,
|
|
if you will)
|
|
</entry>
|
|
</row>
|
|
<row>
|
|
<entry>~5.0</entry>
|
|
<entry>
|
|
(rules 5,7) -- creates a zero-length segment AND records
|
|
'~' in the data. This notation reads 'approximately 5.0',
|
|
but its meaning is not recognized by the code. It is ignored
|
|
until you get the value back. View it is a short-hand comment.
|
|
</entry>
|
|
</row>
|
|
<row>
|
|
<entry><5.0</entry>
|
|
<entry>
|
|
(rules 5,7) -- creates a point at 5.0; '<' is ignored but
|
|
is preserved as a comment
|
|
</entry>
|
|
</row>
|
|
<row>
|
|
<entry>>5.0</entry>
|
|
<entry>
|
|
(rules 5,7) -- creates a point at 5.0; '>' is ignored but
|
|
is preserved as a comment
|
|
</entry>
|
|
</row>
|
|
<row>
|
|
<entry><para>5(+-)0.3</para><para>5'+-'0.3</para></entry>
|
|
<entry>
|
|
<para>
|
|
(rules 1,8) -- creates an interval '4.7..5.3'. As of this
|
|
writing (02/09/2000), this mechanism isn't completely accurate
|
|
in determining the number of significant digits for the
|
|
boundaries. For example, it adds an extra digit to the lower
|
|
boundary if the resulting interval includes a power of ten:
|
|
</para>
|
|
<programlisting>
|
|
postgres=> select '10(+-)1'::seg as seg;
|
|
seg
|
|
---------
|
|
9.0 .. 11 -- should be: 9 .. 11
|
|
</programlisting>
|
|
<para>
|
|
Also, the (+-) notation is not preserved: 'a(+-)b' will
|
|
always be returned as '(a-b) .. (a+b)'. The purpose of this
|
|
notation is to allow input from certain data sources without
|
|
conversion.
|
|
</para>
|
|
</entry>
|
|
</row>
|
|
<row>
|
|
<entry>50 .. </entry>
|
|
<entry>(rule 3) -- everything that is greater than or equal to 50</entry>
|
|
</row>
|
|
<row>
|
|
<entry>.. 0</entry>
|
|
<entry>(rule 4) -- everything that is less than or equal to 0</entry>
|
|
</row>
|
|
<row>
|
|
<entry>1.5e-2 .. 2E-2 </entry>
|
|
<entry>(rule 2) -- creates an interval (0.015 .. 0.02)</entry>
|
|
</row>
|
|
<row>
|
|
<entry>1 ... 2</entry>
|
|
<entry>
|
|
The same as 1...2, or 1 .. 2, or 1..2 (space is ignored).
|
|
Because of the widespread use of '...' in the data sources,
|
|
I decided to stick to is as a range operator. This, and
|
|
also the fact that the white space around the range operator
|
|
is ignored, creates a parsing conflict with numeric constants
|
|
starting with a decimal point.
|
|
</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<table>
|
|
<title>Examples</title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry>.1e7</entry>
|
|
<entry>should be: 0.1e7</entry>
|
|
</row>
|
|
<row>
|
|
<entry>.1 .. .2</entry>
|
|
<entry>should be: 0.1 .. 0.2</entry>
|
|
</row>
|
|
<row>
|
|
<entry>2.4 E4</entry>
|
|
<entry>should be: 2.4E4</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
<para>
|
|
The following, although it is not a syntax error, is disallowed to improve
|
|
the sanity of the data:
|
|
</para>
|
|
<table>
|
|
<title></title>
|
|
<tgroup cols="2">
|
|
<tbody>
|
|
<row>
|
|
<entry>5 .. 2</entry>
|
|
<entry>should be: 2 .. 5</entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Precision</title>
|
|
<para>
|
|
The segments are stored internally as pairs of 32-bit floating point
|
|
numbers. It means that the numbers with more than 7 significant digits
|
|
will be truncated.
|
|
</para>
|
|
<para>
|
|
The numbers with less than or exactly 7 significant digits retain their
|
|
original precision. That is, if your query returns 0.00, you will be
|
|
sure that the trailing zeroes are not the artifacts of formatting: they
|
|
reflect the precision of the original data. The number of leading
|
|
zeroes does not affect precision: the value 0.0067 is considered to
|
|
have just 2 significant digits.
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Usage</title>
|
|
<para>
|
|
The access method for SEG is a GiST index (gist_seg_ops), which is a
|
|
generalization of R-tree. GiSTs allow the postgres implementation of
|
|
R-tree, originally encoded to support 2-D geometric types such as
|
|
boxes and polygons, to be used with any data type whose data domain
|
|
can be partitioned using the concepts of containment, intersection and
|
|
equality. In other words, everything that can intersect or contain
|
|
its own kind can be indexed with a GiST. That includes, among other
|
|
things, all geometric data types, regardless of their dimensionality
|
|
(see also contrib/cube).
|
|
</para>
|
|
<para>
|
|
The operators supported by the GiST access method include:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<programlisting>
|
|
[a, b] << [c, d] Is left of
|
|
</programlisting>
|
|
<para>
|
|
The left operand, [a, b], occurs entirely to the left of the
|
|
right operand, [c, d], on the axis (-inf, inf). It means,
|
|
[a, b] << [c, d] is true if b < c and false otherwise
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<programlisting>
|
|
[a, b] >> [c, d] Is right of
|
|
</programlisting>
|
|
<para>
|
|
[a, b] is occurs entirely to the right of [c, d].
|
|
[a, b] >> [c, d] is true if a > d and false otherwise
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<programlisting>
|
|
[a, b] &< [c, d] Overlaps or is left of
|
|
</programlisting>
|
|
<para>
|
|
This might be better read as "does not extend to right of".
|
|
It is true when b <= d.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<programlisting>
|
|
[a, b] &> [c, d] Overlaps or is right of
|
|
</programlisting>
|
|
<para>
|
|
This might be better read as "does not extend to left of".
|
|
It is true when a >= c.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<programlisting>
|
|
[a, b] = [c, d] Same as
|
|
</programlisting>
|
|
<para>
|
|
The segments [a, b] and [c, d] are identical, that is, a == b
|
|
and c == d
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<programlisting>
|
|
[a, b] && [c, d] Overlaps
|
|
</programlisting>
|
|
<para>
|
|
The segments [a, b] and [c, d] overlap.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<programlisting>
|
|
[a, b] @> [c, d] Contains
|
|
</programlisting>
|
|
<para>
|
|
The segment [a, b] contains the segment [c, d], that is,
|
|
a <= c and b >= d
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<programlisting>
|
|
[a, b] <@ [c, d] Contained in
|
|
</programlisting>
|
|
<para>
|
|
The segment [a, b] is contained in [c, d], that is,
|
|
a >= c and b <= d
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
(Before PostgreSQL 8.2, the containment operators @> and <@ were
|
|
respectively called @ and ~. These names are still available, but are
|
|
deprecated and will eventually be retired. Notice that the old names
|
|
are reversed from the convention formerly followed by the core geometric
|
|
datatypes!)
|
|
</para>
|
|
<para>
|
|
Although the mnemonics of the following operators is questionable, I
|
|
preserved them to maintain visual consistency with other geometric
|
|
data types defined in Postgres.
|
|
</para>
|
|
<para>
|
|
Other operators:
|
|
</para>
|
|
|
|
<programlisting>
|
|
[a, b] < [c, d] Less than
|
|
[a, b] > [c, d] Greater than
|
|
</programlisting>
|
|
<para>
|
|
These operators do not make a lot of sense for any practical
|
|
purpose but sorting. These operators first compare (a) to (c),
|
|
and if these are equal, compare (b) to (d). That accounts for
|
|
reasonably good sorting in most cases, which is useful if
|
|
you want to use ORDER BY with this type
|
|
</para>
|
|
|
|
<para>
|
|
There are a few other potentially useful functions defined in seg.c
|
|
that vanished from the schema because I stopped using them. Some of
|
|
these were meant to support type casting. Let me know if I was wrong:
|
|
I will then add them back to the schema. I would also appreciate
|
|
other ideas that would enhance the type and make it more useful.
|
|
</para>
|
|
<para>
|
|
For examples of usage, see sql/seg.sql
|
|
</para>
|
|
<para>
|
|
NOTE: The performance of an R-tree index can largely depend on the
|
|
order of input values. It may be very helpful to sort the input table
|
|
on the SEG column (see the script sort-segments.pl for an example)
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Credits</title>
|
|
<para>
|
|
My thanks are primarily to Prof. Joe Hellerstein
|
|
(<ulink url="http://db.cs.berkeley.edu/~jmh/"></ulink>) for elucidating the
|
|
gist of the GiST (<ulink url="http://gist.cs.berkeley.edu/"></ulink>). I am
|
|
also grateful to all postgres developers, present and past, for enabling
|
|
myself to create my own world and live undisturbed in it. And I would like
|
|
to acknowledge my gratitude to Argonne Lab and to the U.S. Department of
|
|
Energy for the years of faithful support of my database research.
|
|
</para>
|
|
<programlisting>
|
|
Gene Selkov, Jr.
|
|
Computational Scientist
|
|
Mathematics and Computer Science Division
|
|
Argonne National Laboratory
|
|
9700 S Cass Ave.
|
|
Building 221
|
|
Argonne, IL 60439-4844
|
|
</programlisting>
|
|
<para>
|
|
<email>selkovjr@mcs.anl.gov</email>
|
|
</para>
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|