Commit Graph

20 Commits

Author SHA1 Message Date
Michael Paquier 59f47fb98d unaccent: Add support for quoted translated characters
As reported in bug #18057, the extension unaccent removes in its rule
file whitespace characters that are intentionally specified when
building unaccent.rules from UnicodeData.txt, causing an incorrect
translation for some characters like numeric symbols.  This is caused by
the fact that all whitespaces before and after the origin and target
characters are all discarded (this limitation is documented).

This commit makes possible the use of quotes around target characters,
so as whitespaces can be considered part of target characters.  Some
target characters use a double quote, these require an extra double
quote.

The documentation is updated to show how to use quoted areas,
generate_unaccent_rules.py is updated to generate unaccent.rules and a
couple of tests are added for numeric symbols.  While working on this
patch, I have implemented a fake rule file to test the parsing logic
implemented, which is not included here as it would just consume extra
cycles in the tests, and it requires the manipulation of an installation
tree to be able to work correctly.

As this requires a change of format in unaccent.rules, this cannot be
backpatched, unfortunately.  The idea to use double quotes as escaped
characters comes from Tom Lane.

Reported-by: Martin Schlossarek
Author: Michael Paquier
Discussion: https://postgr.es/m/18057-62712cad01bd202c@postgresql.org
2023-09-20 12:29:36 +09:00
Alvaro Herrera e86c8b728f
Describe each contrib module in its SGML section title
The original titles only had the module name, which is not very useful
when scanning the list.  By adding a very brief description to each
title, the table of contents becomes friendlier.

Also amend the introduction in the "additional modules" appendix, using
the word "Extension" more extensively.  Nowadays, almost all contrib
modules are extensions, so this is also helpful.

Author: Karl O. Pinc <kop@karlpinc.com>
Reviewed-by: Brar Piening <brar@gmx.de>
Discussion: https://postgr.es/m/20230102180015.372995a9@slate.karlpinc.com
2023-01-20 20:01:59 +01:00
Tom Lane 78ee60ed84 Doc: add XML ID attributes to <sectN> and <varlistentry> tags.
This doesn't have any external effect at the moment, but it
will allow adding useful link-discoverability features later.

Brar Piening, reviewed by Karl Pinc.

Discussion: https://postgr.es/m/CAB8KJ=jpuQU9QJe4+RgWENrK5g9jhoysMw2nvTN_esoOU0=a_w@mail.gmail.com
2023-01-09 15:08:24 -05:00
Tom Lane eb67623c96 Mark some contrib modules as "trusted".
This allows these modules to be installed into a database without
superuser privileges (assuming that the DBA or sysadmin has installed
the module's files in the expected place).  You only need CREATE
privilege on the current database, which by default would be
available to the database owner.

The following modules are marked trusted:

btree_gin
btree_gist
citext
cube
dict_int
earthdistance
fuzzystrmatch
hstore
hstore_plperl
intarray
isn
jsonb_plperl
lo
ltree
pg_trgm
pgcrypto
seg
tablefunc
tcn
tsm_system_rows
tsm_system_time
unaccent
uuid-ossp

In the future we might mark some more modules trusted, but there
seems to be no debate about these, and on the whole it seems wise
to be conservative with use of this feature to start out with.

Discussion: https://postgr.es/m/32315.1580326876@sss.pgh.pa.us
2020-02-13 15:02:35 -05:00
Tom Lane a5322ca10f Make contrib/unaccent's unaccent() function work when not in search path.
Since the fixes for CVE-2018-1058, we've advised people to schema-qualify
function references in order to fix failures in code that executes under
a minimal search_path setting.  However, that's insufficient to make the
single-argument form of unaccent() work, because it looks up the "unaccent"
text search dictionary using the search path.

The most expedient answer seems to be to remove the search_path dependency
by making it look in the same schema that the unaccent() function itself
is declared in.  This will definitely work for the normal usage of this
function with the unaccent dictionary provided by the extension.
It's barely possible that there are people who were relying on the
search-path-dependent behavior to select other dictionaries with the same
name; but if there are any such people at all, they can still get that
behavior by writing unaccent('unaccent', ...), or possibly
unaccent('unaccent'::text::regdictionary, ...) if the lookup has to be
postponed to runtime.

Per complaint from Gunnlaugur Thor Briem.  Back-patch to all supported
branches.

Discussion: https://postgr.es/m/CAPs+M8LCex6d=DeneofdsoJVijaG59m9V0ggbb3pOH7hZO4+cQ@mail.gmail.com
2018-09-06 10:49:45 -04:00
Peter Eisentraut c29c578908 Don't use SGML empty tags
For DocBook XML compatibility, don't use SGML empty tags (</>) anymore,
replace by the full tag name.  Add a warning option to catch future
occurrences.

Alexander Lakhin, Jürgen Purtz
2017-10-17 15:10:33 -04:00
Peter Eisentraut 44b3230e82 Use lower-case SGML attribute values
for DocBook XML compatibility
2017-10-10 10:15:57 -04:00
Tom Lane 6376a16ba2 Update contrib/unaccent documentation about its unaccent.rules file.
Commit 1bbd52cb9a didn't bother with such niceties.
2016-04-30 15:06:26 -04:00
Tom Lane 1b2488731c Allow multi-character source strings in contrib/unaccent.
This could be useful in languages where diacritic signs are represented as
separate characters; more generally it supports using unaccent dictionaries
for substring substitutions beyond narrowly conceived "diacritic removal".
In any case, since the rule-file parser doesn't complain about
multi-character source strings, it behooves us to do something unsurprising
with them.
2014-06-30 21:46:29 -04:00
Tom Lane 97c40ce614 Allow empty replacement strings in contrib/unaccent.
This is useful in languages where diacritic signs are represented as
separate characters; it's also one step towards letting unaccent be used
for arbitrary substring substitutions.

In passing, improve the user documentation for unaccent, which was sadly
vague about some important details.

Mohammad Alhashash, reviewed by Abhijit Menon-Sen
2014-06-30 20:51:30 -04:00
Bruce Momjian e567c9ff34 Add xreflabels to /contrib manuals so links appear correct. Also update
README.links to explain xref properly.
2011-05-07 22:29:20 -04:00
Tom Lane f1fb4b0e63 Fix obsolete references to old-style contrib installation methods. 2011-02-14 01:10:44 -05:00
Magnus Hagander 9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Tom Lane 9389ac8928 Document filtering dictionaries in textsearch.sgml.
While at it, copy-edit the description of prefix-match marker support in
synonym dictionaries, and clarify the description of the default unaccent
dictionary a bit more.
2010-08-25 21:42:55 +00:00
Tom Lane 7fc614c698 Docs review for unaccent: fix grammar, markup, etc. 2010-08-25 02:12:00 +00:00
Peter Eisentraut 5194b9d049 Spell and markup checking 2010-08-17 04:37:21 +00:00
Peter Eisentraut 66424a2848 Fix indentation of verbatim block elements
Block elements with verbatim formatting (literallayout, programlisting,
screen, synopsis) should be aligned at column 0 independent of the surrounding
SGML, because whitespace is significant, and indenting them creates erratic
whitespace in the output.  The CSS stylesheets already take care of indenting
the output.

Assorted markup improvements to go along with it.
2010-07-29 19:34:41 +00:00
Bruce Momjian a70d039104 Hot Standby documentation updates
Greg Smith
2010-02-19 00:15:25 +00:00
Bruce Momjian 1710800813 Remove tabs from SGML. 2009-08-20 12:12:37 +00:00
Teodor Sigaev 92e05bc6a5 Unaccent dictionary. 2009-08-18 10:34:39 +00:00