Author: | Daniele Varrazzo |
---|---|
Contact: | piro (at) develer.com |
Organization: | Develer S.r.l. |
Date: | 2011-04-23 |
Version: | 1.2 |
Copyright: | 2001, 2002 Gianluca Turconi |
Copyright: | 2002, 2003, 2004 Gianluca Turconi and Davide Prina |
Copyright: | 2004, 2005, 2006 Davide Prina |
Copyright: | 2007-2011 Daniele Varrazzo |
Abstract
This package provides a dictionary and the other files required to perform full text search in Italian documents using the PostgreSQL database.
Using the provided dictionary, search operations in Italian documents can keep into account morphological variations of Italian words, such as verb conjugations.
Contents
This vocabulary has been generated from the MySpell OpenOffice.org vocabulary, provided by the progetto linguistico.
The dictionary had to undergo an huge amount of transformations, and is now quite unrecognizable from the original. Above all, all the verbal forms, including irregular verbs, are now reduced to the infinite form. Furthermore, for each verb, the construction with pronominal and reflexive particles are recognized on gerund, present and past participle, imperative and infinite.
Great care has also been taken in reducing the different forms of adjectives (male and female, singular and plural, superlatives) to a single normal form, and to unify different forms of male and female (es. ricercatore and ricercatrice: male and female form of "researcher").
Furthermore, in the original dictionary, many unrelated male and female nouns were joined together as they were an adjective (es. caso/casi + casa/case, with the unrelated meanings of "case(s)" and "house(s)"). Such false friends have been mostly split apart to avoid false positives in search results, but some of them may still lie around in the dictionary (this is a kind of error that no Python script can help fixing...).
Some statistics about the current dictionary edition:
The dictionary was presented at PGDay 2007, the first Italian PostgreSQL conference. The slideshow is available for download.
This package doesn't include a stemming dictionary, which is already included in the PostgreSQL installation. The package can be used with database clusters in any encoding.
Please refer to the README file for installation details.
The package version 1.1 is compatible with PostgreSQL 8.2 and older version using the tsearch2 contrib module. The package also include the Italian Snowball stemmer.
The package is available in two encodings:
You should install only the version matching your cluster locale (use psql -tc SHOW LC_CTYPE postgres to know which is it).
Please refer to the README.italian_fts_utf8 or README.italian_fts_latin1 file for installation details.
The Italian Dictionary for Full-Text Search is distributed under GPL license.
I wish to thank Davide Prina and Gianluca Turconi, because without their progetto linguistico i wouldn't have had anything to work upon.
I also hearty thank Oleg Bartunov and Teodor Sigaev, the Tsearch2 authors.
And many thanks to Develer, one of the finest hackers assembly in Italy!