mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-04 00:02:52 -05:00 
			
		
		
		
	integration. It is much better to create a word list of unstemmed words than stemmed ones. Chris K-L
		
			
				
	
	
		
			143 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			143 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
trgm - Trigram matching for PostgreSQL
 | 
						|
--------------------------------------
 | 
						|
 | 
						|
Introduction
 | 
						|
 | 
						|
	This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
 | 
						|
 | 
						|
	The pg_trgm contrib module provides functions and index classes
 | 
						|
	for determining the similarity of text based on trigram
 | 
						|
	matching.
 | 
						|
 | 
						|
Definitions
 | 
						|
 | 
						|
	Trigram (or Trigraph)
 | 
						|
 | 
						|
	A trigram is a set of three consecutive characters taken
 | 
						|
	from a string.  A string is considered to have two spaces
 | 
						|
	prefixed and one space suffixed when determining the set
 | 
						|
	of trigrams that comprise the string.
 | 
						|
 | 
						|
	eg. The set of trigrams in the word "cat" is "  c", " ca", 
 | 
						|
	"at " and "cat".
 | 
						|
 | 
						|
Public Functions
 | 
						|
 | 
						|
	real similarity(text, text)
 | 
						|
 | 
						|
	Returns a number that indicates how closely matches the two
 | 
						|
	arguments are.  A zero result indicates that the two words
 | 
						|
	are completely dissimilar, and a result of one indicates that
 | 
						|
	the two words are identical.
 | 
						|
 | 
						|
	real show_limit()
 | 
						|
 | 
						|
	Returns the current similarity threshold used by the '%'
 | 
						|
	operator.  This in effect sets the minimum similarity between
 | 
						|
	two words in order that they be considered similar enough to
 | 
						|
	be misspellings of each other, for example.
 | 
						|
 | 
						|
	real set_limit(real)
 | 
						|
 | 
						|
	Sets the current similarity threshold that is used by the '%'
 | 
						|
	operator, and is returned by the show_limit() function.
 | 
						|
 | 
						|
	text[] show_trgm(text)
 | 
						|
 | 
						|
	Returns an array of all the trigrams of the supplied text
 | 
						|
	parameter.
 | 
						|
 | 
						|
Public Operators
 | 
						|
 | 
						|
	text % text (returns boolean)
 | 
						|
 | 
						|
	The '%' operator returns TRUE if its two arguments have a similarity
 | 
						|
	that is greater than the similarity threshold set by set_limit(). It
 | 
						|
	will return FALSE if the similarity is less than the current
 | 
						|
	threshold.
 | 
						|
 | 
						|
Public Index Operator Classes
 | 
						|
 | 
						|
	gist_trgm_ops
 | 
						|
 | 
						|
	The pg_trgm module comes with an index operator class that allows a
 | 
						|
	developer to create an index over a text column for the purpose
 | 
						|
	of very fast similarity searches.
 | 
						|
 | 
						|
	To use this index, the '%' operator must be used and an appropriate
 | 
						|
	similarity threshold for the application must be set.
 | 
						|
 | 
						|
	eg.
 | 
						|
 | 
						|
	CREATE TABLE test_trgm (t text);
 | 
						|
	CREATE INDEX trgm_idx ON test_trgm USING gist (t gist_trgm_ops);
 | 
						|
	
 | 
						|
	At this point, you will have an index on the t text column that you
 | 
						|
	can use for similarity searching.
 | 
						|
 | 
						|
	eg.
 | 
						|
 | 
						|
	SELECT
 | 
						|
		t,
 | 
						|
		similarity(t, 'word') AS sml
 | 
						|
	FROM
 | 
						|
		test_trgm
 | 
						|
	WHERE
 | 
						|
		t % 'word'
 | 
						|
	ORDER BY
 | 
						|
		sml DESC, t;
 | 
						|
 | 
						|
	This will return all values in the text column that are sufficiently
 | 
						|
	similar to 'word', sorted from best match to worst.  The index will
 | 
						|
	be used to make this a fast operation over very large data sets.
 | 
						|
 | 
						|
Tsearch2 Integration
 | 
						|
 | 
						|
	Trigram matching is a very useful tool when used in conjunction
 | 
						|
	with a text index created by the Tsearch2 contrib module. (See
 | 
						|
	contrib/tsearch2)
 | 
						|
 | 
						|
	The first step is to generate an auxiliary table containing all
 | 
						|
	the unique words in the Tsearch2 index:
 | 
						|
 | 
						|
	CREATE TABLE words AS SELECT word FROM
 | 
						|
		stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
 | 
						|
 | 
						|
	Where 'documents' is a table that has a text field 'bodytext'
 | 
						|
	that TSearch2 is used to search.  The use of the 'simple' dictionary
 | 
						|
	with the to_tsvector function, instead of just using the already
 | 
						|
	existing vector is to avoid creating a list of already stemmed
 | 
						|
	words.  This way, only the original, unstemmed words are added
 | 
						|
	to the word list.
 | 
						|
 | 
						|
	Next, create a trigram index on the word column:
 | 
						|
 | 
						|
	CREATE INDEX words_idx ON words USING gist(word gist_trgm_ops);
 | 
						|
 | 
						|
	Now, a SELECT query similar to the example above can be used to
 | 
						|
	suggest spellings for misspelled words in user search terms. A
 | 
						|
	useful extra clause is to ensure that the similar words are also
 | 
						|
	of similar length to the misspelled word.
 | 
						|
 | 
						|
	Note: Since the 'words' table has been generated as a separate,
 | 
						|
	static table, it will need to be periodically regenerated so that
 | 
						|
	it remains up to date with the word list in the Tsearch2 index.
 | 
						|
 | 
						|
Authors
 | 
						|
 | 
						|
	Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
 | 
						|
	Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
 | 
						|
       
 | 
						|
Contributors
 | 
						|
 | 
						|
	Christopher Kings-Lynne wrote this README file
 | 
						|
 | 
						|
References
 | 
						|
 | 
						|
	Tsearch2 Development Site
 | 
						|
	http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
 | 
						|
	
 | 
						|
	GiST Development Site
 | 
						|
	http://www.sai.msu.su/~megera/postgres/gist/
 | 
						|
 |