strip html tags but keep href attribute value
To be able to match hostnames in links and at the same time get the benefits of tsearch I created a small function to strip of html tags while keeping the link intact for tsearch to tokenize.
The second regexp_replace is really not necessary since tsearch will ignore any HTML tags, or rather see them as XML tokens.
I'm sure there are more clever ways of accomplish the same thing but this seemed as a fine compromise for the moment. Thoughts and comments are of course welcome. :)
begin; -- strip tags function -- we use this to strip all html tags but still preserving the href -- attribute value so tsearch later can match host. span> -- Does two runs: -- 1) strip all tags containg the attribute href but preserve the -- attribute value and put it in parentheses. -- 2) strip of any remaining tags CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$ SELECT regexp_replace( regexp_replace($1, E'<[^>]*?(\s* href \s* = \s* ([\'"]) ([^>]*?) ([\'"]) ) [^>]*?>', E' (\\3) ', 'gx'), E'(< [^>]*? >)', E'', 'gx') $$ LANGUAGE SQL;