Loose String Matching (Levenshtein for Redshift?)
Perhaps a bit tangential to text mining, but has anyone found a way to efficiently do string analysis in Redshift? Ideally something like levenshtein() or levenshtein_less_distance() in postgres to return string similarity.
There must be a better way than
where left(frequently_misspelled_text_field, [number]) || '%' ilike left(good_text_field,[number]) || '%'
which is inefficient and not robust.
This came about previously when I was trending patient data on a state and city basis and had to deal with human-entered messes such as 'calafornia' , 'Yexas' (which wouldn't be caught unless I did the above using right() as well), 'NewYork', 'Massachusets', etc.
Please sign in to leave a comment.
Comments
1 comment