Playing around with full text search in PostgreSQL

I was finally able to play around with the full text search (former tsearch2 that's now a part of core in PostgreSQL 8.3).

The docs shows three ways of using the full text search:

  1. Ad hoc
  2. Indexed
  3. Pre-stored tsvector

I will use a small Python application to populate my database (modify as needed):

import urllib2
import re
import psycopg2
url = 'http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/2/1/0/2100/2100-8.txt'
raw = ''.join(urllib2.urlopen(url).readlines())
raw = raw.decode('iso-8859-1')
conn = psycopg2.connect("dbname=test")
curs = conn.cursor()
curs.execute("drop table if exists bible; create table bible(id serial primary  key, location text, quote text)")
conn.commit()
for quote in re.findall(r'[0-9]{3}\:[0-9]{3} .*', raw):
    location, text = quote.split(' ',1)
    text = text.strip()
    curs.execute("insert into bible(location, quote) values(%s,%s)", (location, text))
conn.commit()

With this test data loaded into our database we'll it's time to test a ad hoc full text search:

test=# select location, quote from bible where to_tsvector('swedish', quote) @@ to_tsquery('swedish', 'herre & gud & son');
 location |                              quote
----------+-----------------------------------------------------------------
 029:021  | Så säger HERREN Sebaot, Israels Gud, om Ahab, Kolajas son, och
 001:032  | Han skall bliva stor och kallas den Högstes Son, och Herren Gud
(2 rows)

Time: 587.860 ms
test=#


Let's create a GIN index on 'quote' and run our query again.

test$# create index quote_idx ON bible using gin(to_tsvector('swedish', quote));
test$# select location, quote from bible where to_tsvector('swedish', quote) @@ to_tsquery('swedish', 'herre & gud & son');
 location |                              quote
----------+-----------------------------------------------------------------
 029:021  | Så säger HERREN Sebaot, Israels Gud, om Ahab, Kolajas son, och
 001:032  | Han skall bliva stor och kallas den Högstes Son, och Herren Gud
(2 rows)

Time: 2.940 ms
test$#


There's a substantial difference between using index and not using index (of
course).

To try the pre-stored tsvector we alter the table to add a column and populate
it with the tsvector.

test=# alter table bible add column quotesearchable_index tsvector;
ALTER TABLE
Time: 11.422 ms
test=# update bible set quotesearchable_index = to_tsvector('swedish', quote);
UPDATE 31139
Time: 2367.304 ms
test=# create index quotesearchable_idx ON bible using gin(quotesearchable_index);
CREATE INDEX
Time: 799.223 ms
test=# select location, quote from bible where quotesearchable_index  @@ to_tsquery('swedish', 'lamm & offra & sabbatsdag');
 location |                              quote
----------+------------------------------------------------------------------
 028:009  | Men på sabbatsdagen skall du offra två årsgamla felfria lamm, så
(1 row)

Time: 0.829 ms
test=# select location, quote from bible where to_tsvector('swedish', quote) @@ to_tsquery('swedish', 'lamm & offra & sabbatsdag');
 location |                              quote             
----------+------------------------------------------------------------------
 028:009  | Men på sabbatsdagen skall du offra två årsgamla felfria lamm, så
(1 row)

Time: 0.935 ms
test=#

 
As can see with the comparable runs there is pretty much no difference here. I would guess the difference starts to show with big datasets.

Oh, and if you go with the pre-stored solution don't forget to create a trigger on insert and update that updates the tsvector.

I also tested a forth approach by storing the tsvector in it's own table and then join the original table when searching. That approach also needs the triggers for insert and update but also for delete. I'm not sure if this approach has some significant to offer but it was still fun to test.

References
http://www.postgresql.org/docs/8.3/interactive/textsearch.html
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Kommentarer

Kommentera inlägget här:

Namn:
Kom ihåg mig?

E-postadress: (publiceras ej)

URL/Bloggadress:

Kommentar:

Trackback
RSS 2.0