strip html tags but keep href attribute value

To be able to match hostnames in links and at the same time get the benefits of tsearch I created a small function to strip of html tags while keeping the link intact for tsearch to tokenize.

The second regexp_replace is really not necessary since tsearch will ignore any HTML tags, or rather see them as XML tokens.

I'm sure there are more clever ways of accomplish the same thing but this seemed as a fine compromise for the moment. Thoughts and comments are of course welcome. :)

begin;
    -- strip tags function
    -- we use this to strip all html tags but still preserving the href
    -- attribute value so tsearch later can match host.
    -- Does two runs:
    -- 1) strip all tags containg the attribute href but preserve the 
    --    attribute value and put it in parentheses.
    -- 2) strip of any remaining tags
    CREATE OR       REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
        SELECT regexp_replace(
            regexp_replace($1,
               E'<[^>]*?(\s* href \s* = \s* ([\'"]) ([^>]*?) ([\'"]) ) [^>]*?>',
               E' (\\3) ',
                'gx'),
            E'(< [^>]*? >)',
            E'',
             'gx')
    $$ LANGUAGE SQL;

Kommentarer
Postat av: olga

Thenk you

2010-03-03 @ 15:12:55
URL: http://nosite.ru

Kommentera inlägget här:

Namn:
Kom ihåg mig?

E-postadress: (publiceras ej)

URL/Bloggadress:

Kommentar:

Trackback
RSS 2.0