mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-04 00:02:52 -05:00 
			
		
		
		
	
		
			
				
	
	
		
			201 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			201 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
An attempt at some sort of Full Text Indexing for PostgreSQL.
 | 
						|
 | 
						|
The included software is an attempt to add some sort of Full Text Indexing
 | 
						|
support to PostgreSQL. I mean by this that we can ask questions like:
 | 
						|
 | 
						|
	Give me all rows that have 'still' and 'nash' in the 'artist' or 'title'
 | 
						|
	fields.
 | 
						|
 | 
						|
Ofcourse we can write this as:
 | 
						|
 | 
						|
	select * from cds where (artist ~* 'stills' or title ~* 'stills') and 
 | 
						|
	(artist ~* 'nash' or title ~* 'nash');
 | 
						|
 | 
						|
But this does not use any indices, and therefore, if your database
 | 
						|
gets very large, it will not have very high performance (the above query
 | 
						|
requires a sequential scan of the table).
 | 
						|
 | 
						|
The approach used by this add-on is to define a trigger on the table and
 | 
						|
columns you want to do these queries on. On every insert to the table, it
 | 
						|
takes the value in the specified columns, breaks the text in these columns
 | 
						|
up into pieces, and stores all sub-strings into another table, together
 | 
						|
with a reference to the row in the original table that contained this
 | 
						|
sub-string (it uses the oid of that row).
 | 
						|
 | 
						|
By now creating an index over the 'fti-table', we can search for
 | 
						|
substrings that occur in the original table. By making a join between
 | 
						|
the fti-table and the orig-table, we can get the actual rows we want
 | 
						|
(this can also be done by using subselects - but subselects are currently
 | 
						|
inefficient in PostgreSQL, and maybe there're other ways too).
 | 
						|
 | 
						|
The trigger code also allows an array called StopWords, that prevents
 | 
						|
certain words from being indexed.
 | 
						|
 | 
						|
As an example we take the previous query, where we assume we have all
 | 
						|
sub-strings in the table 'cds-fti':
 | 
						|
 | 
						|
	select c.*
 | 
						|
	from cds c, cds-fti f1, cds-fti f2
 | 
						|
	where	f1.string ~ '^stills' and
 | 
						|
		f2.string ~ '^nash' and
 | 
						|
		f1.id = c.oid and
 | 
						|
		f2.id = c.oid ;
 | 
						|
 | 
						|
We can use the ~ (case-sensitive regular expression) here, because of
 | 
						|
the way sub-strings are built: from right to left, ie. house -> 'se' +
 | 
						|
'use' + 'ouse' + 'house'. If a ~ search starts with a ^ (match start of
 | 
						|
string), btree indices can be used by PostgreSQL.
 | 
						|
 | 
						|
Now, how do we create the trigger that maintains the fti-table? First: the
 | 
						|
fti-table should have the following schema:
 | 
						|
 | 
						|
	create cds-fti ( string varchar(N), id oid ) without oids;
 | 
						|
 | 
						|
Don't change the *names* of the columns, the varchar() can in fact also
 | 
						|
be of text-type. If you do use varchar, make sure the largest possible
 | 
						|
sub-string will fit.
 | 
						|
 | 
						|
The create the function that contains the trigger::
 | 
						|
 | 
						|
	create function fti() returns trigger as
 | 
						|
	    '/path/to/fti.so' language 'C';
 | 
						|
 | 
						|
And finally define the trigger on the 'cds' table:
 | 
						|
 | 
						|
	create trigger cds-fti-trigger after update or insert or delete on cds
 | 
						|
	    for each row execute procedure fti(cds-fti, artist, title);
 | 
						|
 | 
						|
Here, the trigger will be defined on table 'cds', it will create
 | 
						|
sub-strings from the fields 'artist' and 'title', and it will place 
 | 
						|
those sub-strings in the table 'cds-fti'.
 | 
						|
 | 
						|
Now populate the table 'cds'. This will also populate the table 'cds-fti'.
 | 
						|
It's fastest to populate the table *before* you create the indices.  Use the
 | 
						|
supplied 'fti.pl' to assist you with this.
 | 
						|
 | 
						|
Before you start using the system, you should at least have the following
 | 
						|
indices:
 | 
						|
 | 
						|
	create index cds-fti-idx on cds-fti (string); -- String matching
 | 
						|
	create index cds-fti-idx on cds-fti (id);     -- For deleting a cds row
 | 
						|
	create index cds-oid-idx on cds (oid);			 -- For joining cds to cds-fti
 | 
						|
 | 
						|
To get the most performance out of this, you should have 'cds-fti'
 | 
						|
clustered on disk, ie. all rows with the same sub-strings should be
 | 
						|
close to each other. There are 3 ways of doing this:
 | 
						|
 | 
						|
1. After you have created the indices, execute 'cluster cds-fti-idx on cds-fti'.
 | 
						|
2. Do a 'select * into tmp-table from cds-fti order by string' *before*
 | 
						|
   you create the indices, then 'drop table cds-fti' and
 | 
						|
   'alter table tmp-table rename to cds-fti'
 | 
						|
3. *Before* creating indices, dump the contents of the cds-fti table using
 | 
						|
   'pg_dump -a -t cds-fti dbase-name', remove the \connect
 | 
						|
   from the beginning and the \. from the end, and sort it using the
 | 
						|
   UNIX 'sort' program, and reload the data.
 | 
						|
 | 
						|
Method 1 is very slow, 2 a lot faster, and for very large tables, 3 is
 | 
						|
preferred.
 | 
						|
 | 
						|
 | 
						|
BENCH:
 | 
						|
~~~~~
 | 
						|
 | 
						|
Maarten Boekhold <maartenb@dutepp0.et.tudelft.nl>
 | 
						|
The following data was generated by the 'timings.sh' script included
 | 
						|
in this directory. It uses a very large table with music-related
 | 
						|
articles as a source for the fti-table. The tables used are:
 | 
						|
 | 
						|
product    : contains product information  :   540.429 rows
 | 
						|
artist_fti : fti table for product         : 4.501.321 rows
 | 
						|
clustered  : same as above, only clustered : 4.501.321 rows 
 | 
						|
 | 
						|
A sequential scan of the artist_fti table (and thus also the clustered table)
 | 
						|
takes around 6:16 minutes....
 | 
						|
 | 
						|
Unfortunately I cannot provide anybody else with this test-data, since I
 | 
						|
am not allowed to redistribute the data (it's a database being sold by
 | 
						|
a couple of wholesale companies). Anyways, it's megabytes, so you probably
 | 
						|
wouldn't want it in this distribution anyways.
 | 
						|
 | 
						|
I haven't tested this with less data.
 | 
						|
 | 
						|
The test-machine is a Pentium 133, 64 MB, Linux 2.0.32 with the database
 | 
						|
on a 'QUANTUM BIGFOOT_CY4320A, 4134MB w/67kB Cache, CHS=8960/15/63'. This
 | 
						|
is a very slow disk.
 | 
						|
 | 
						|
The postmaster was running with:
 | 
						|
 | 
						|
        postmaster -i -b /usr/local/pgsql/bin/postgres -S 1024 -B 256 \
 | 
						|
                -o -o /usr/local/pgsql/debug-output -F -d 1 
 | 
						|
 | 
						|
('trashing' means a 'select count(*) from artist_fti' to completely trash
 | 
						|
any disk-caches and buffers....)
 | 
						|
 | 
						|
TESTING ON UNCLUSTERED FTI
 | 
						|
trashing
 | 
						|
1: ^lapton and ^ric           : 0.050u 0.000s 5m37.484s 0.01%
 | 
						|
2: ^lapton and ^ric           : 0.050u 0.030s 5m32.447s 0.02%
 | 
						|
3: ^lapton and ^ric           : 0.030u 0.020s 5m28.822s 0.01%
 | 
						|
trashing
 | 
						|
1: ^lling and ^tones          : 0.020u 0.030s 0m54.313s 0.09%
 | 
						|
2: ^lling and ^tones          : 0.040u 0.030s 0m5.057s 1.38%
 | 
						|
3: ^lling and ^tones          : 0.010u 0.050s 0m2.072s 2.89%
 | 
						|
trashing
 | 
						|
1: ^aughan and ^evie          : 0.020u 0.030s 0m26.241s 0.19%
 | 
						|
2: ^aughan and ^evie          : 0.050u 0.010s 0m1.316s 4.55%
 | 
						|
3: ^aughan and ^evie          : 0.030u 0.020s 0m1.029s 4.85%
 | 
						|
trashing
 | 
						|
1: ^lling                     : 0.040u 0.010s 0m55.104s 0.09%
 | 
						|
2: ^lling                     : 0.030u 0.030s 0m4.716s 1.27%
 | 
						|
3: ^lling                     : 0.040u 0.010s 0m2.157s 2.31%
 | 
						|
trashing
 | 
						|
1: ^stev and ^ray and ^vaugh  : 0.040u 0.000s 1m5.630s 0.06%
 | 
						|
2: ^stev and ^ray and ^vaugh  : 0.050u 0.020s 1m3.561s 0.11%
 | 
						|
3: ^stev and ^ray and ^vaugh  : 0.050u 0.010s 1m5.923s 0.09%
 | 
						|
trashing
 | 
						|
1: ^lling (no join)           : 0.050u 0.020s 0m24.139s 0.28%
 | 
						|
2: ^lling (no join)           : 0.040u 0.040s 0m1.087s 7.35%
 | 
						|
3: ^lling (no join)           : 0.020u 0.030s 0m0.772s 6.48%
 | 
						|
trashing
 | 
						|
1: ^vaughan (no join)         : 0.040u 0.030s 0m9.075s 0.77%
 | 
						|
2: ^vaughan (no join)         : 0.030u 0.010s 0m0.609s 6.56%
 | 
						|
3: ^vaughan (no join)         : 0.040u 0.010s 0m0.503s 9.94%
 | 
						|
trashing
 | 
						|
1: ^rol (no join)             : 0.020u 0.030s 0m49.898s 0.10%
 | 
						|
2: ^rol (no join)             : 0.030u 0.020s 0m3.136s 1.59%
 | 
						|
3: ^rol (no join)             : 0.030u 0.020s 0m1.231s 4.06%
 | 
						|
 | 
						|
TESTING ON CLUSTERED FTI
 | 
						|
trashing
 | 
						|
1: ^lapton and ^ric           : 0.020u 0.020s 2m17.120s 0.02%
 | 
						|
2: ^lapton and ^ric           : 0.030u 0.020s 2m11.767s 0.03%
 | 
						|
3: ^lapton and ^ric           : 0.040u 0.010s 2m8.128s 0.03%
 | 
						|
trashing
 | 
						|
1: ^lling and ^tones          : 0.020u 0.030s 0m18.179s 0.27%
 | 
						|
2: ^lling and ^tones          : 0.030u 0.010s 0m1.897s 2.10%
 | 
						|
3: ^lling and ^tones          : 0.040u 0.010s 0m1.619s 3.08%
 | 
						|
trashing
 | 
						|
1: ^aughan and ^evie          : 0.070u 0.010s 0m11.765s 0.67%
 | 
						|
2: ^aughan and ^evie          : 0.040u 0.010s 0m1.198s 4.17%
 | 
						|
3: ^aughan and ^evie          : 0.030u 0.020s 0m0.872s 5.73%
 | 
						|
trashing
 | 
						|
1: ^lling                     : 0.040u 0.000s 0m28.623s 0.13%
 | 
						|
2: ^lling                     : 0.030u 0.010s 0m2.339s 1.70%
 | 
						|
3: ^lling                     : 0.030u 0.010s 0m1.975s 2.02%
 | 
						|
trashing
 | 
						|
1: ^stev and ^ray and ^vaugh  : 0.020u 0.010s 0m17.667s 0.16%
 | 
						|
2: ^stev and ^ray and ^vaugh  : 0.030u 0.010s 0m3.745s 1.06%
 | 
						|
3: ^stev and ^ray and ^vaugh  : 0.030u 0.020s 0m3.439s 1.45%
 | 
						|
trashing
 | 
						|
1: ^lling (no join)           : 0.020u 0.040s 0m2.218s 2.70%
 | 
						|
2: ^lling (no join)           : 0.020u 0.020s 0m0.506s 7.90%
 | 
						|
3: ^lling (no join)           : 0.030u 0.030s 0m0.510s 11.76%
 | 
						|
trashing
 | 
						|
1: ^vaughan (no join)         : 0.040u 0.050s 0m2.048s 4.39%
 | 
						|
2: ^vaughan (no join)         : 0.030u 0.020s 0m0.332s 15.04%
 | 
						|
3: ^vaughan (no join)         : 0.040u 0.010s 0m0.318s 15.72%
 | 
						|
trashing
 | 
						|
1: ^rol (no join)             : 0.020u 0.030s 0m2.384s 2.09%
 | 
						|
2: ^rol (no join)             : 0.020u 0.030s 0m0.676s 7.39%
 | 
						|
3: ^rol (no join)             : 0.020u 0.030s 0m0.697s 7.17%
 |