Playing around with full text search in PostgreSQL

I was finally able to play around with the full text search (former tsearch2 that's now a part of core in PostgreSQL 8.3).

The docs shows three ways of using the full text search:

  1. Ad hoc
  2. Indexed
  3. Pre-stored tsvector

I will use a small Python application to populate my database (modify as needed):

import urllib2
import re
import psycopg2
url = 'http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/2/1/0/2100/2100-8.txt'
raw = ''.join(urllib2.urlopen(url).readlines())
raw = raw.decode('iso-8859-1')
conn = psycopg2.connect("dbname=test")
curs = conn.cursor()
curs.execute("drop table if exists bible; create table bible(id serial primary  key, location text, quote text)")
conn.commit()
for quote in re.findall(r'[0-9]{3}\:[0-9]{3} .*', raw):
    location, text = quote.split(' ',1)
    text = text.strip()
    curs.execute("insert into bible(location, quote) values(%s,%s)", (location, text))
conn.commit()

With this test data loaded into our database we'll it's time to test a ad hoc full text search:

test=# select location, quote from bible where to_tsvector('swedish', quote) @@ to_tsquery('swedish', 'herre & gud & son');
 location |                              quote
----------+-----------------------------------------------------------------
 029:021  | Så säger HERREN Sebaot, Israels Gud, om Ahab, Kolajas son, och
 001:032  | Han skall bliva stor och kallas den Högstes Son, och Herren Gud
(2 rows)

Time: 587.860 ms
test=#


Let's create a GIN index on 'quote' and run our query again.

test$# create index quote_idx ON bible using gin(to_tsvector('swedish', quote));
test$# select location, quote from bible where to_tsvector('swedish', quote) @@ to_tsquery('swedish', 'herre & gud & son');
 location |                              quote
----------+-----------------------------------------------------------------
 029:021  | Så säger HERREN Sebaot, Israels Gud, om Ahab, Kolajas son, och
 001:032  | Han skall bliva stor och kallas den Högstes Son, och Herren Gud
(2 rows)

Time: 2.940 ms
test$#


There's a substantial difference between using index and not using index (of
course).

To try the pre-stored tsvector we alter the table to add a column and populate
it with the tsvector.

test=# alter table bible add column quotesearchable_index tsvector;
ALTER TABLE
Time: 11.422 ms
test=# update bible set quotesearchable_index = to_tsvector('swedish', quote);
UPDATE 31139
Time: 2367.304 ms
test=# create index quotesearchable_idx ON bible using gin(quotesearchable_index);
CREATE INDEX
Time: 799.223 ms
test=# select location, quote from bible where quotesearchable_index  @@ to_tsquery('swedish', 'lamm & offra & sabbatsdag');
 location |                              quote
----------+------------------------------------------------------------------
 028:009  | Men på sabbatsdagen skall du offra två årsgamla felfria lamm, så
(1 row)

Time: 0.829 ms
test=# select location, quote from bible where to_tsvector('swedish', quote) @@ to_tsquery('swedish', 'lamm & offra & sabbatsdag');
 location |                              quote             
----------+------------------------------------------------------------------
 028:009  | Men på sabbatsdagen skall du offra två årsgamla felfria lamm, så
(1 row)

Time: 0.935 ms
test=#

 
As can see with the comparable runs there is pretty much no difference here. I would guess the difference starts to show with big datasets.

Oh, and if you go with the pre-stored solution don't forget to create a trigger on insert and update that updates the tsvector.

I also tested a forth approach by storing the tsvector in it's own table and then join the original table when searching. That approach also needs the triggers for insert and update but also for delete. I'm not sure if this approach has some significant to offer but it was still fun to test.

References
http://www.postgresql.org/docs/8.3/interactive/textsearch.html
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

plpgsql and replacing pgmemcache with pgmemcached

Today I ran into a bit of a problem while replacing pgmemcache with pgmemcached (gotta love the naming on these two).
pgmemcache returns True or False depending on success or failure, for example:

society=# select memcache_replace('boguskey', 'nonsense data');
 memcache_replace
------------------
 f
(1 row)

Time: 2.504 ms
society=#

With pgmemcached we get the following behavior:

society=# select pgmemcached_replace('boguskey', 'nonsense data');
ERROR:  memcached_result: NOT STORED
society=#

Since we're calling memcache from within triggers it is no good when the triggers starts to break on the account of exceptions being raised. So to deal with this we could, in the case of *_replace catch others (to catch all exceptions) or internal_error (to catch the specific exception thrown by pgmemcached) or change the trigger to use *_set function insted since it behaves the same in both cases.

Is it wise to use *_set in this case? Every block using exception handling will create a subtransaction hence it could have an inpact on performance, especially when used inside triggers.

I have yet to run some perfomance testing to see what impact *_set vs *_replace with exception handling will have on our database.

The lessons learned today was: BEGIN ... EXCEPTION and to use \set VERBOSITY 'verbose' to get the exact error code returned by a third party library (in our case XX000 which is internal_error). I also picked up others since I was unaware of it.

Just a comment on the (obsolete) LAMP stack

Yeah, we all know the LAMP expression is kinda outdated by now but still I ran into this piece.
It says the LAMP is more or less symbolic these days and that the "A" is the only thing left. I beg to differ though.

The choices of web servers for deploying web applications have never been greater. I see Apache being replaced by Lighttpd and nginx all the time by large and small players alike.

Yes the LAMP is mere symbolic these days if even that. The different components in a deployment stack is pretty diverse today and depending on the size of your deploy (traffic, data, etc) you might be using a very different stack. (Think in terms of things like Varnish, memcached and other stuff that has found its place in today's extended stack.)

Todays stack is more like a stack of LEGO
Todays web stack is more like a stack of LEGO bricks...

I am also pleased to see that MySQL no longer is the given choice for developing web applications. The knowledge of the ordinary web developer seems to have finally been extended to include the idea that there are other database systems and that there are no golden database to suit/solve all your needs. (Of course I would choose PostgreSQL over MySQL anyday if the choice was between them. I don't have any strong argument for using MySQL in any situation anymore.)
I'm very glad this enlightment has finally started to be come common knowledge. It's simple, it's just like with anything else: Choose the right tool for the job. (Otherwise everything will start look like nails...)

RSS 2.0