mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-04 00:02:52 -05:00 
			
		
		
		
	
		
			
				
	
	
		
			211 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			211 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
Tsearch2 - full text search extension for PostgreSQL
 | 
						|
 | 
						|
   [10][Online version] of this document is available
 | 
						|
   
 | 
						|
   This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
 | 
						|
   
 | 
						|
   Notice: This version is fully incompatible with old tsearch (V1),
 | 
						|
   which is considered as deprecated in upcoming 7.4 release and
 | 
						|
   obsoleted in 8.0.
 | 
						|
   
 | 
						|
   The Tsearch2 contrib module contains an implementation of a new data
 | 
						|
   type tsvector - a searchable data type with indexed access. In a
 | 
						|
   nutshell, tsvector is a set of unique words along with their
 | 
						|
   positional information in the document, organized in a special
 | 
						|
   structure optimized for fast access and lookup. Actually, each word
 | 
						|
   entry, besides its position in the document, could have a weight
 | 
						|
   attribute, describing importance of this word (at a specific) position
 | 
						|
   in document. A set of bit-signatures of a fixed length, representing
 | 
						|
   tsvectors, are stored in a search tree (developed using PostgreSQL
 | 
						|
   GiST), which provides online update of full text index and fast query
 | 
						|
   lookup. The module provides indexed access methods, queries,
 | 
						|
   operations and supporting routines for the tsvector data type and easy
 | 
						|
   conversion of text data to tsvector. Table driven configuration allows
 | 
						|
   creation of custom configuration optimized for specific searches using
 | 
						|
   standard SQL commands.
 | 
						|
   
 | 
						|
   Configuration allows you to:
 | 
						|
     * specify the type of lexemes to be indexed and the way they are
 | 
						|
       processed.
 | 
						|
     * specify dictionaries to be used along with stop words recognition.
 | 
						|
     * specify the parser used to process a document.
 | 
						|
       
 | 
						|
   See [11]Documentation Roadmap for links to documentation.
 | 
						|
 | 
						|
OpenFTS vs Tsearch2
 | 
						|
 | 
						|
    OpenFTS is a middleware between application and database, so it uses 
 | 
						|
    tsearch2 as a storage, while database engine is used as a query executor 
 | 
						|
    (searching). Everything else (parsing of documents, query processing, 
 | 
						|
    linguistics) carry outs on client side. That's why OpenFTS has its own 
 | 
						|
    configuration table (fts_conf) and works with its own set of dictionaries. 
 | 
						|
    OpenFTS is more flexible, because it could be used in multi-server 
 | 
						|
    architecture with separated machines for repository of documents 
 | 
						|
    (documents could be stored in file system), database and query engine.   
 | 
						|
 | 
						|
Authors
 | 
						|
 | 
						|
     * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
 | 
						|
     * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
 | 
						|
       
 | 
						|
Contributors
 | 
						|
 | 
						|
     * Robert John Shepherd and Andrew J. Kopciuch submitted
 | 
						|
       "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
 | 
						|
       v2)
 | 
						|
     * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
 | 
						|
       Reference" and proposed new naming convention for tsearch V2
 | 
						|
       
 | 
						|
New features
 | 
						|
 | 
						|
     * Relevance ranking of search results
 | 
						|
     * Table driven configuration
 | 
						|
     * Morphology support (ispell dictionaries, snowball stemmers)
 | 
						|
     * Headline support (text fragments with highlighted search terms)
 | 
						|
     * Ability to plug-in custom dictionaries and parsers
 | 
						|
     * Synonym dictionary
 | 
						|
     * Generator of templates for dictionaries (built-in snowball stemmer
 | 
						|
       support)
 | 
						|
     * Statistics of indexed words is available
 | 
						|
       
 | 
						|
Limitations
 | 
						|
 | 
						|
     * Lexeme should be not longer than 2048 bytes
 | 
						|
     * The number of lexemes is limited by 2^32. Note, that actual
 | 
						|
       capacity of tsvector is depends on whether positional information
 | 
						|
       is stored or not.
 | 
						|
     * tsvector - the size is limited by approximately 2^20 bytes.
 | 
						|
     * tsquery - the number of entries (lexemes and operations) < 32768
 | 
						|
     * Positional information
 | 
						|
          + maximal position of lexeme < 2^14 (16384)
 | 
						|
          + lexeme could have maximum 256 positions
 | 
						|
       
 | 
						|
References
 | 
						|
 | 
						|
     * GiST development site -
 | 
						|
       [12]http://www.sai.msu.su/~megera/postgres/gist
 | 
						|
     * OpenFTS home page - [13]http://openfts.sourceforge.net/
 | 
						|
     * Mailing list -
 | 
						|
       [14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen
 | 
						|
       eral
 | 
						|
       
 | 
						|
   [15]Documentation Roadmap
 | 
						|
   
 | 
						|
Documentation Roadmap
 | 
						|
 | 
						|
     * Several docs are available from docs/ subdirectory
 | 
						|
          + "Tsearch V2 Introduction" by Andrew Kopciuch
 | 
						|
          + "Tsearch2 Guide" by Brandon Rhodes
 | 
						|
          + "Tsearch2 Reference" by Brandon Rhodes
 | 
						|
     * Readme.gendict in gendict/ subdirectory
 | 
						|
          + [16][Gendict tutorial]
 | 
						|
       
 | 
						|
   Online version of documentation is always available from Tsearch V2
 | 
						|
   home page -
 | 
						|
   [17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
 | 
						|
   
 | 
						|
Support
 | 
						|
 | 
						|
   Authors urgently recommend people to use [18][openfts-general] or
 | 
						|
   [19][pgsql-general] mailing lists for questions and discussions.
 | 
						|
   
 | 
						|
Caution
 | 
						|
 | 
						|
   In spite of apparent easy full text searching with our tsearch module
 | 
						|
   (authors hope it's so), any serious search engine require profound
 | 
						|
   study of various aspects, such as stop words, dictionaries, special
 | 
						|
   parsers. Tsearch module was designed to facilitate both those cases.
 | 
						|
   
 | 
						|
Development History
 | 
						|
 | 
						|
   Pre-tsearch era
 | 
						|
          Development of OpenFTS began in 2000 after realizing that we
 | 
						|
          needed a search engine optimized for online updates and able to
 | 
						|
          access metadata from the database. This is essential for online
 | 
						|
          news agencies, web portals, digital libraries, etc. Most search
 | 
						|
          engines available utilize an inverted index which is very fast
 | 
						|
          for searching but very slow for online updates. Incremental
 | 
						|
          updates of an inverted index is a complex engineering task
 | 
						|
          while we needed something light, free and with the ability to
 | 
						|
          access metadata from the database. The last requirement is very
 | 
						|
          important because in a real life application a search engine
 | 
						|
          should always consult metadata ( topic, permissions, date
 | 
						|
          range, version, etc.). We extensively use PostgreSQL as a
 | 
						|
          database backend and have no intention to move from it, so the
 | 
						|
          problem was to find a data structure and a fast way to access
 | 
						|
          it. PostgreSQL has rather unique data type for storing sets
 | 
						|
          (think about words) - arrays, but lacks index access to them. A
 | 
						|
          document is parsed into lexemes, which are identified in
 | 
						|
          various ways (e.g. stemming, morphology, dictionary), and as a
 | 
						|
          result is reduced to an array of integer numbers. During our
 | 
						|
          research we found a paper of Joseph Hellerstein which
 | 
						|
          introduced an interesting data structure suitable for sets -
 | 
						|
          RD-tree (Russian Doll tree). It looked very attractive, but
 | 
						|
          implementing it in PostgreSQL seemed difficult because of our
 | 
						|
          ignorance of database internals. Further research lead us to
 | 
						|
          the idea to use GiST for implementing RD-tree, but at that time
 | 
						|
          the GiST code had for a long while remained untouched and
 | 
						|
          contained several bugs. After work on improving GiST for
 | 
						|
          version 7.0.3 of PostgreSQL was done, we were able to implement
 | 
						|
          RD-Tree and use it for index access to arrays of integers. This
 | 
						|
          implementation was ideally suited for small arrays and
 | 
						|
          eliminated complex joins, but was practically useless for
 | 
						|
          indexing large arrays. The next improvement came from an idea
 | 
						|
          to represent a document by a single bit-signature, a so-called
 | 
						|
          superimposed signature (see "Index Structures for Databases
 | 
						|
          Containing Data Items with Set-valued Attributes", 1997, Sven
 | 
						|
          Helmer for details). We developeded the contrib/intarray module
 | 
						|
          and used it for full text indexing.
 | 
						|
          
 | 
						|
   tsearch v1
 | 
						|
          It was inconvenient to use integer id's instead of words, so we
 | 
						|
          introduced a new data type called 'txtidx' - a searchable data
 | 
						|
          type (textual) with indexed access. This was a first step of
 | 
						|
          our work on an implementation of a built-in PostgreSQL full
 | 
						|
          text search engine. Even though tsearch v1 had many features of
 | 
						|
          a search engine it lacked configuration support and relevance
 | 
						|
          ranking. People were encouraged to use OpenFTS, which provided
 | 
						|
          relevance ranking based on coordinate information and flexible
 | 
						|
          configuration. OpenFTS v.0.34 is the last version based on
 | 
						|
          tsearch v1.
 | 
						|
          
 | 
						|
   tsearch V2
 | 
						|
          People recognized tsearch as a powerful tool for full text
 | 
						|
          searching and insisted on adding ranking support, better
 | 
						|
          configurability, etc. We already thought about moving most of
 | 
						|
          the features of OpenFTS to tsearch, and in the early 2003 we
 | 
						|
          decided to work on a new version of tsearch - tsearch v2. We've
 | 
						|
          abandoned auxiliary index tables which were used by OpenFTS to
 | 
						|
          store coordinate information and modified the txtidx type to
 | 
						|
          store them internally. Also, we've added table-driven
 | 
						|
          configuration, support of ispell dictionaries, snowball
 | 
						|
          stemmers and the ability to specify which types of lexemes to
 | 
						|
          index. Also, it's now possible to generate headlines of
 | 
						|
          documents with highlighted search terms. These changes make
 | 
						|
          tsearch more user friendly and turn it into a really powerful
 | 
						|
          full text search engine. After announcing the alpha version, we
 | 
						|
          received a proposal from Brandon Rhodes to rename tsearch
 | 
						|
          functions to be more consistent. So, we have renamed txtidx
 | 
						|
          type to tsvector and other things as well.
 | 
						|
          
 | 
						|
   To allow users of tsearch v1 smooth upgrade, we named the module as
 | 
						|
   tsearch2.
 | 
						|
   
 | 
						|
   Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave
 | 
						|
   people could download it from OpenFTS CVS (see link from [20][OpenFTS
 | 
						|
   page]
 | 
						|
 | 
						|
References
 | 
						|
 | 
						|
  10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
 | 
						|
  11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap
 | 
						|
  12. http://www.sai.msu.su/~megera/postgres/gist
 | 
						|
  13. http://openfts.sourceforge.net/
 | 
						|
  14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
 | 
						|
  15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap
 | 
						|
  16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict
 | 
						|
  17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
 | 
						|
  18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
 | 
						|
  19. http://archives.postgresql.org/pgsql-general/
 | 
						|
  20. http://openfts.sourceforge.net/
 |