Data Matching: To match, or not match, that is the question*
Data Matching: To match, or not match, that is the question*
(*with apologies to William Shakespeare and Hamlet)
The idea of a trusted, reliable single-view of data about an entity (eg a person, product, organisation, location or other type of entity) is central to delivering on the promise of Customer Experience; Omnichannel Engagement; Digital Transformation and the reduction of Operational Costs as well as the need to comply with a plethora of industry and data protection regulations. A trusted, reliable single view of data is created by the process of matching separate pieces of data where they represent the same entity. Data matching is described in this blog by Henrik Liliendahl as follows:
“Data matching is about establishing a link between data elements and entities that do not have the same value, but are referring to the same real-world construct. The most common example is establishing a link between two different data records probably describing the same person as for example:
-
Bob Smith at 1 Main Str in Anytown
Robert Smith at One Main Street in Any Town
Data matching can be applied to other master data entity types as companies, locations, products and more.
The ability to do data matching accurately, effectively and at scale is important because the volume of data that is collected by enterprises is growing all the time. In this Dataversity article, Harald Smith says:
As we pump more data into our data lakes, or other downstream data stores, though, the problem of grouping like pieces of data about an entity together re-emerges, and that impacts our ability to get accurate, trusted information.
To obtain the best results, data matching requires business, domain and technology expertise in order to combat some of the issues such as overmatching; undermatching and to know what type of matching techniques to use. In addition, the process of data matching /entity resolution is an ongoing one where the results can change based on updating the inputs or changing the parameters around the match itself.
Here’s a brief user’s guide to help you understand some of the potential “gotchas” and how to avoid these to ensure the best possible matches in your data:
What is overmatching? This is where your criteria for matching are not strict enough and you match people that should not be matched. This can result in severe consequences for example you could merge the health or financial records of two separate people that could impact treatment or cause monetary issues.
What is undermatching? This is where your criteria for matching are too strict and it means you can end up with more duplicates in your data than there should be. The outcomes of this can include poor customer experience due to things like duplicate mailings and the inability for customer service reps to get a complete picture of a client. In a product situation you would have multiple entries for the same product which could result in over/under stocking and related problems.
Depending on the use case, undermatching may be better than overmatching (which can be very common with rules-based approaches) but in an ideal world your matching should be as accurate as possible. The consequences will very much depend on the cost to your business of getting the decision (to match or not match) wrong.
How is matching done and what are the potential challenges?
There are a few approaches – most notably rules-based matching and AI/ML matching. In most solutions available today an either/or approach is taken with one or the other of these being used. There are inherent strengths and weaknesses for each approach.
Rules-based matching tends to use techniques such as “exact match” and “fuzzy match” techniques. Rules-based approaches have the highest level of explainability because if two records match based on rule x you know exactly why it happened (because of rule x).
In an AL/ML approach to matching users can leverage more cutting edge techniques such as active learning and/or reinforcement learning. However, if those same two records matched based on an ML algorithm, it may not be so clear as to why they matched because machine learning techniques can be notoriously opaque. This has long been an issue in the world of Machine Learning. Early data mining workbenches – eg ISL’s Clementine (now IBM’s SPSS Modeller™) emphasised the ease with which users could combine the output of rules-based approaches such as decision-trees or CHAID with the output of Neural networks. This allowed for maximum accuracy combined with the ability to explain clearly why the predictions had been made which is essential in areas such as financial services credit risk.
The power of this hybrid approach (rules-based plus ML) means that you can leverage rules for those cases where you absolutely need 100% clarity on the reason for a match and ML for those cases where you perhaps don’t. In addition a hybrid approach could also mean that you use both techniques in combination on the same set of records as a double verification system. In Reltio it is possible to get recommended matches with rules and using AI/ML (within Reltio Match IQ). This means that when a data steward sees their rules-based matches also validated by the machine learning model they can have a higher level of confidence in the prediction, thereby improving their productivity. It can be seen therefore that solutions that offer the ability to use both techniques could lead to more accurate, workable predictions.
Here are two other potential issues to think about when considering how to go about matching and merging your data:
- Matching is not a “once and done” activity – you may need to adjust your matches based on new information and potentially “roll back” past decisions or un-match two records. Users may be dealing with issues caused by changes to match results over time and the need to change a decision based on new data and keep a record of these changes. In the Reltio Enterprise 360 platform it’s really easy to roll back match/merge if needed based on future new information that changes the decision. That is not the case with all MDM solutions.
- What are the Real-time requirements? Your team may need to be able to load data and match it at the same time so that they can deal with today’s “always on” demands. Not all systems can cope with this. With legacy MDM solutions it is not uncommon to hear about users having to delay the start of their day because the overnight batch run of match/merge across many dependent objects is still going. Or worse, has failed and must be recommenced which often causes huge delay and the associated business impacts.
Data changes can also impact records that were selected as not a match (this is a different scenario to unmerge). A user may have made a valid decision in the past about two records that are “not a match”, but if more data arrives at a later date (e.g. a new address or phone or email) that decision could change. Via a configuration option, Reltio would revert the “not a match” decision and place those two records back into the queue for evaluation based on some critical attributes changing (or arriving).
The ability to monitor and review the backlog of match/merge decisions can also be useful. In Reltio we like to say that you can measure the time to “goldenisation” – that moment when you have achieved the golden record and all the match and Data Quality issues have been resolved. Within the Reltio Hub – the application where data stewarding operations are performed – the match and DQ issues are automatically presented for resolution and then, once resolved the record is determined as “golden”.
As far as time savings and efficiencies go, many data management professionals simply accept the status quo in a number of areas. But this doesn’t have to be the case. Within Reltio the norm is for inline loading and automated execution of matching in an asynchronous fashion. This means that even while a load is occurring in Reltio, the match results are produced for data stewards to begin their investigations. This avoids all of the traditional, legacy issues of trying to run batch match/merge outside of normal business hours. Thus offering what can be significant potential operational savings in Data Steward and IT time.
The ability to use matching optimally to create a trusted, reliable single view of data that will help drive better business outcomes is at the heart of master data management. Doing this at scale and in real-time via rules and machine learning is a speciality of Reltio – please do come and chat to us about your data matching and mastering issues. You may decide “to match, or not to match” but whatever you decide, ensure that your enterprise master data management platform is helping you to continue moving forward with accurate, timely data and information. In other words:
“Master, go on
(and I will follow thee To the last gasp with truth and loyalty.)”
William Shakespeare, As You Like It
Further Reading:
For more information on matching and related topics please take a look at these other articles:
- https://www.dataversity.net/getting-ahead-of-data-matching-building-the-right-strategies-for-today-and-tomorrow/#
- https://www.dataversity.net/connecting-the-dots-strategies-for-matching-big-data//li>
- https://liliendahl.com/2018/11/28/data-matching-machine-learning-and-artificial-intelligence//li>
- https://liliendahl.com/2019/05/26/six-mdm-ai-and-ml-use-cases//li>
- https://towardsdatascience.com/an-overview-of-model-explainability-in-modern-machine-learning-fc0f22c8c29a/li>
- https://algorithmia.com/blog/active-learning-machine-learning/li>
- https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292