I was just out exploring the Microsoft Downloads area this morning to see if there was anything new to check out. And what do you know, I came across a technology preview developed by Microsoft Research of a new Add-in for Excel 2010 – Fuzzy Lookup Add-In for Excel. The Add-in provides users to be able to compare two sets of data to do some cleansing and to get at a single representation text value. The reason that this is needed is to compare sets of data that are possibly coming from two different sources or maybe because of spelling mistakes or because the data is being manually entered in different ways on the front-end into your source system. Like say my name was entered as Dan English, Daniel English, Mr. Dan English, English Dan, and so on.
Here is the introduction information available in the PDF File ‘Fuzzy Lookup Add-in for Excel’ that is provided in this download.
A challenging problem in data management is that the same entity may be represented in multiple ways throughout the dataset. For instance, customer “Andy Hill” might also be present as “Mr. Andrew Hill” or “Hill, Andrew R.”. Variations can result from merging independent data sources, spelling mistakes, inconsistent naming conventions and abbreviations, or records with additional/missing information.
Fuzzy Lookup technology, developed by Microsoft Research, allows you to quickly identify data records which are textually similar. You can identify fuzzy duplicates within a single table or perform a fuzzy join between two different tables. The default configuration works well for a wide variety of data, but the matching may also be customized for specific domains.
Pretty cool. We have taken two sets of data that contained different names for the same items (in this case the company names) and the add-in has determined the correct matches so that we could come up with the total value of our stock portfolio.
Now there are additional advanced settings that are available in the Fuzzy Lookup pane which you can configure and the sample does provide an additional worksheet called Customization where you can go in and configure additional logic to translate items like say ‘Inc’ to ‘Incorporated’ and ‘USA’ to ‘United States of America’. This information can then be referenced in the Configure portion of the Fuzzy Lookup pane which is next to the ‘Go’ button.
This functionality is also available in the SQL Server Enterprise Edition (also in Datacenter and Developer Editions) of Integration Services (SSIS). If you are interested in checking out details on that here is a link to check out Fuzzy Lookup Transformation.