BigQuery fuzzy matching functions

I recently read this amazing series of blogs by Brian Suk about BigQuery string fuzzy matching:

I really liked the ideas around string matching and encoding. The implementation given in each of these uses a series of helper functions to break down the problem into simpler steps. This was really helpful for understanding how this algorithm works.

I wanted to see if it could be done in one step, logically following the instructions in sql, using some of the features of BigQuery to easy the difficulty of this task.

Here’s the first soundex function I created:

I got the same results for the examples given

The levenshtein distance however, I was unable to convert from Javascript to SQL — the algorithm requires procedural logic that would be almost impossible to implement in SQL in BigQuery. Perhaps now that scripting and procedures are available in BQ (in beta mode for now), this might be possible, but I can’t say for sure . Keep your eyes peeled here —

Given that this is implemented in javascript, I wanted to find an optimal version of the algorithm for performance reasons. I found a heated debate on github that concluded that this was the fastest implementation available:

Great, all I had to do was convert this into a BigQuery function:

While reading around this topic, I found another version of this kind of algorithm called Damerau-Levenshtein distance, which allows matching strings with the addition of swapping 2 characters in place, which can reduce the distance.

I found someone who had forked the original javascript repo and modified it to use the new algorithm. Hey presto, we’ve got another algorithm available:

Lastly, the NYSIIS algorithm mentioned in part 3 of Brain’s series. He notes while running through the implementation that he’d have preferred to use a SQL function instead of Javascript for performance reasons. I thought the same, and that the algorithm didn’t look so far away from the soundex algorithm in terms of implementation. I thought the analytical functions and unnesting/aggregating functions of bigquery would make this pretty feasible, so I made a version myself:

This was a really interesting challenge, and I like the way the innermost query appears to be executing steps 1 and 2 of the algorithm, and slowly progressing through the later steps as we get to the outermost query. The results yield the same results as shown in Brian’s output.

Hope you find these helpful!