mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-04 00:02:52 -05:00 
			
		
		
		
	
		
			
				
	
	
		
			211 lines
		
	
	
		
			9.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			211 lines
		
	
	
		
			9.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
Tsearch2 - full text search extension for PostgreSQL
 | 
						|
 | 
						|
   [1]Online version of this document is available
 | 
						|
 | 
						|
   Tsearch2  -  is the full text engine, fully integrated into PostgreSQL
 | 
						|
   RDBMS.
 | 
						|
 | 
						|
Main features
 | 
						|
 | 
						|
     * Full online update
 | 
						|
     * Supports multiple table driven configurations
 | 
						|
     * flexible  and  rich linguistic support (dictionaries, stop words),
 | 
						|
       thesaurus
 | 
						|
     * full multibyte (UTF-8) support
 | 
						|
     * Sophisticated  ranking  functions  with  support  of proximity and
 | 
						|
       structure information (rank, rank_cd)
 | 
						|
     * Index support (GiST and Gin) with concurrency and recovery support
 | 
						|
     * Rich query language with query rewriting support
 | 
						|
     * Headline support (text fragments with highlighted search terms)
 | 
						|
     * Ability to plug-in custom dictionaries and parsers
 | 
						|
     * Template  generator  for  tsearch2  dictionaries  with [2]snowball
 | 
						|
       stemmer support
 | 
						|
     * It is mature (5 years of development)
 | 
						|
 | 
						|
   Tsearch2,  in a nutshell, provides FTS operator (contains) for the new
 | 
						|
   data  types,  representing  document  (tsvector)  and query (tsquery).
 | 
						|
   Table  driven  configuration  allows creation of custom searches using
 | 
						|
   standard SQL commands.
 | 
						|
 | 
						|
   tsvector is a searchable data type, representing document. It is a set
 | 
						|
   of  unique  words  along  with  their  positional  information  in the
 | 
						|
   document,  organized  in a special structure optimized for fast access
 | 
						|
   and  lookup. Each entry could be labelled to reflect its importance in
 | 
						|
   document.
 | 
						|
 | 
						|
   tsquery  is  a  data  type for textual queries with support of boolean
 | 
						|
   operators.  It  consists of lexemes (optionally labelled) with boolean
 | 
						|
   operators between.
 | 
						|
 | 
						|
   Table driven configuration allows to specify:
 | 
						|
     * parser, which used to break document onto lexemes
 | 
						|
     * what lexemes to index and the way they are processed
 | 
						|
     * dictionaries to be used along with stop words recognition.
 | 
						|
 | 
						|
OpenFTS vs Tsearch2
 | 
						|
 | 
						|
   [3]OpenFTS  is  a middleware between application and database. OpenFTS
 | 
						|
   uses  tsearch2  as  a  storage and database engine as a query executor
 | 
						|
   (searching).   Everything  else,  i.e.  parsing  of  documents,  query
 | 
						|
   processing, linguistics, carry outs on client side. That's why OpenFTS
 | 
						|
   has  its own configuration table (fts_conf) and works with its own set
 | 
						|
   of dictionaries. OpenFTS is more flexible, because it could be used in
 | 
						|
   multi-server  architecture  with  separate  machines for repository of
 | 
						|
   documents  (documents  could  be  stored  in filesystem), database and
 | 
						|
   query engine.
 | 
						|
 | 
						|
   See [4]Documentation Roadmap for links to documentation.
 | 
						|
 | 
						|
Authors
 | 
						|
 | 
						|
     * Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
 | 
						|
     * Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia
 | 
						|
 | 
						|
Contributors
 | 
						|
 | 
						|
     * Robert   John   Shepherd   and   Andrew   J.   Kopciuch  submitted
 | 
						|
       "Introduction  to  tsearch" (Robert - tsearch v1, Andrew - tsearch
 | 
						|
       v2)
 | 
						|
     * Brandon   Craig   Rhodes  wrote  "Tsearch2  Guide"  and  "Tsearch2
 | 
						|
       Reference" and proposed new naming convention for tsearch V2
 | 
						|
 | 
						|
Sponsors
 | 
						|
 | 
						|
     * ABC Startsiden - compound words support
 | 
						|
     * University of Mannheim for UTF-8 support (in 8.2)
 | 
						|
     * jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
 | 
						|
       Inverted index (in 8.2)
 | 
						|
     * Georgia  Public  Library  Service  and LibLime, Inc. for Thesaurus
 | 
						|
       dictionary
 | 
						|
     * PostGIS community - GiST Concurrency and Recovery
 | 
						|
 | 
						|
   The  authors are grateful to the Russian Foundation for Basic Research
 | 
						|
   and Delta-Soft Ltd., Moscow, Russia for support.
 | 
						|
 | 
						|
Limitations
 | 
						|
 | 
						|
     * Length of lexeme < 2K
 | 
						|
     * Length of tsvector (lexemes + positions) < 1Mb
 | 
						|
     * The number of lexemes < 4^32
 | 
						|
     * 0< Positional information < 16383
 | 
						|
     * No more than 256 positions per lexeme
 | 
						|
     * The number of nodes ( lexemes + operations) in tsquery < 32768
 | 
						|
 | 
						|
References
 | 
						|
 | 
						|
     * GiST development site -
 | 
						|
       [6]http://www.sai.msu.su/~megera/postgres/gist
 | 
						|
     * GiN development - [7]http://www.sigaev.ru/gin/
 | 
						|
     * OpenFTS home page - [8]http://openfts.sourceforge.net/
 | 
						|
     * Mailing list -
 | 
						|
       [9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
 | 
						|
       ral
 | 
						|
 | 
						|
Documentation Roadmap
 | 
						|
 | 
						|
     * Several docs are available from docs/ subdirectory
 | 
						|
          + "Tsearch V2 Introduction" by Andrew Kopciuch
 | 
						|
          + "Tsearch2 Guide" by Brandon Rhodes
 | 
						|
          + "Tsearch2 Reference" by Brandon Rhodes
 | 
						|
     * Readme.gendict in gendict/ subdirectory
 | 
						|
          + Also, check [10]Gendict tutorial
 | 
						|
     * Check [11]tsearch2 Wiki pages for various documentation
 | 
						|
 | 
						|
Support
 | 
						|
 | 
						|
   Authors  urgently  recommend  people  to  use  [12]openfts-general  or
 | 
						|
   [13]pgsql-general mailing lists for questions and discussions.
 | 
						|
 | 
						|
Development History
 | 
						|
 | 
						|
   Latest news
 | 
						|
 | 
						|
   To the PostgreSQL 8.2 release we added:
 | 
						|
     * multibyte (UTF-8) support
 | 
						|
     * Thesaurus dictionary
 | 
						|
     * Query rewriting
 | 
						|
     * rank_cd  relevation  function  now  support  different  weights of
 | 
						|
       lexemes
 | 
						|
     * GiN support adds scalability of tsearch2
 | 
						|
 | 
						|
   Pre-tsearch era
 | 
						|
          Development  of  OpenFTS  began in 2000 after realizing that we
 | 
						|
          need  a  search engine optimized for online updates with access
 | 
						|
          to  metadata  from  the  database. This is essential for online
 | 
						|
          news agencies, web portals, digital libraries, etc. Most search
 | 
						|
          engines  available utilize an inverted index which is very fast
 | 
						|
          for  searching  but  very  slow for online updates. Incremental
 | 
						|
          updates  of  an  inverted  index  is a complex engineering task
 | 
						|
          while  we  needed something light, free and with the ability to
 | 
						|
          access  metadata  from  the  database. The last requirement was
 | 
						|
          very important because in a real life application search engine
 | 
						|
          should  always  consult  metadata  (  topic,  permissions, date
 | 
						|
          range,  version,  etc.).  We  extensively  use  PostgreSQL as a
 | 
						|
          database  backend and have no intention to move from it, so the
 | 
						|
          problem  was  to find a data structure and a fast way to access
 | 
						|
          it.  PostgreSQL  has  rather  unique data type for storing sets
 | 
						|
          (think  about  words) - arrays, but lacks index access to them.
 | 
						|
          During our research we found a paper of Joseph Hellerstein, who
 | 
						|
          introduced  an  interesting  data structure suitable for sets -
 | 
						|
          RD-tree  (Russian  Doll  tree). Further research lead us to the
 | 
						|
          idea to use GiST for implementing RD-tree, but at that time the
 | 
						|
          GiST  code  was untouched for a long time and contained several
 | 
						|
          bugs.  After  work  on  improving  GiST  for  version  7.0.3 of
 | 
						|
          PostgreSQL  was done, we were able to implement RD-Tree and use
 | 
						|
          it  for index access to arrays of integers. This implementation
 | 
						|
          was  ideally  suited  for  small  arrays and eliminated complex
 | 
						|
          joins,  but  was practically useless for indexing large arrays.
 | 
						|
          The  next improvement came from an idea to represent a document
 | 
						|
          by  a  single bit-signature, a so-called superimposed signature
 | 
						|
          (see "Index Structures for Databases Containing Data Items with
 | 
						|
          Set-valued  Attributes",  1997,  Sven  Helmer  for details). We
 | 
						|
          developed  the  contrib/intarray  module and used it for full
 | 
						|
          text indexing.
 | 
						|
 | 
						|
   tsearch v1
 | 
						|
          It was inconvenient to use integer id's instead of words, so we
 | 
						|
          introduced  a new data type called 'txtidx' - a searchable data
 | 
						|
          type  (textual)  with  indexed access. This was a first step of
 | 
						|
          our  work  on  an  implementation of a built-in PostgreSQL full
 | 
						|
          text search engine. Even though tsearch v1 had many features of
 | 
						|
          a  search  engine it lacked configuration support and relevance
 | 
						|
          ranking.  People were encouraged to use OpenFTS, which provided
 | 
						|
          relevance  ranking based on positional information and flexible
 | 
						|
          configuration.  OpenFTS  v.0.34  is  the  last version based on
 | 
						|
          tsearch v1.
 | 
						|
 | 
						|
   tsearch V2
 | 
						|
          People  recognized  tsearch  as  a  powerful tool for full text
 | 
						|
          searching  and  insisted  on  adding  ranking  support,  better
 | 
						|
          configurability,  etc.  We already thought about moving most of
 | 
						|
          the  features  of  OpenFTS to tsearch, and in the early 2003 we
 | 
						|
          decided  to  work  on  a  new  version of tsearch. We abandoned
 | 
						|
          auxiliary  index  tables  which  were  used by OpenFTS to store
 | 
						|
          positional  information  and  modified the txtidx type to store
 | 
						|
          them  internally.  We added table-driven configuration, support
 | 
						|
          of  ispell  dictionaries,  snowball stemmers and the ability to
 | 
						|
          specify  which types of lexemes to index. Now, it's possible to
 | 
						|
          generate  headlines of documents with highlighted search terms.
 | 
						|
          These  changes make tsearch more user friendly and turn it into
 | 
						|
          a  really  powerful  full  text  search  engine. Brandon Rhodes
 | 
						|
          proposed  to  rename  tsearch  functions for consistency and we
 | 
						|
          renamed  txtidx  type  to tsvector and other things as well. To
 | 
						|
          allow  users  of tsearch v1 smooth upgrade, we named the module
 | 
						|
          as tsearch2. Since version 0.35 OpenFTS uses tsearch2.
 | 
						|
 | 
						|
References
 | 
						|
 | 
						|
   1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
 | 
						|
   2. http://snowball.tartarus.org/
 | 
						|
   3. http://openfts.sourceforge.net/
 | 
						|
   4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
 | 
						|
   5. http:www.jfg-networks.com/
 | 
						|
   6. http://www.sai.msu.su/~megera/postgres/gist
 | 
						|
   7. http://www.sigaev.ru/gin/
 | 
						|
   8. http://openfts.sourceforge.net/
 | 
						|
   9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
 | 
						|
  10. http://www.sai.msu.su/~megera/wiki/Gendict
 | 
						|
  11. http://www.sai.msu.su/~megera/wiki/Tsearch2
 | 
						|
  12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
 | 
						|
  13. http://archives.postgresql.org/pgsql-general/
 |