What is Data Quality?
Data Quality is a central pillar in any Data Management framework, which assumes data is inherently “dirty” and challenged by multiple inequalities: incompleteness, inaccuracies, duplications, inconsistencies, invalidity, dis-integrity, reasonability, and currency. Cleaning and restoring integrity to data by eliminating these inequalities increases its usefulness, and ultimately its value to an organization and to end data consumers. Given the complexity of the domain, it is useful to define multiple perspectives of data quality.
- Data quality refers to the challenges of collecting and integrating disparate and potentially dirty data sets from multiple systems.
- Data quality refers to the knowledge domain of data quality concepts and thinking around improving the business value of data generated by organizations.
- Data quality refers to the measurable characteristics of data sets that define it as high quality, fit for purpose, and useful.
- Data quality refers to the quality management frameworks and processes applied to data to improve and ensure its quality.
Understanding data quality
Data Quality as a Challenge
Initially data must typically undergo some form of transformation in an attempt to clean and refine it, “cleansing it of impurities”. Overcoming data inequalities is necessary to make high quality data that leads to useful insights and business value.
The initial challenge with managing the quality of data is in understanding quality expectations. What quality does the business expect from its data? Clearly, high quality data is what is expected, but high quality must be defined in terms of business requirements. If the business requires accurate client and sales records, then included in the definition of high quality data should be clearly defined expectations, in this case, that client records must meet 100% completeness and validity. A complete and valid customer data set would then contain contact information useful to both marketing and sales departments.
In fairness, while it is desirable to have 100% complete, clean data, this may not be practical for some expectations. Maintaining clean data, if the velocity of information is high, may require a cost benefit analysis to determine acceptable expectations.
Data Quality as Measurable Data Characteristics
The properties of data naturally lead to the ability to measure them and identify data errors. These findings are pivotal in making data improvements through the uncovering and remediation of root-causes. Oftentimes, expectations of a dataset’s quality is more optimistic than true. Measuring data sets through activities like data profiling, a form of data analysis used to inspect data and its quality, can be used to determine its objective quality along structural, contentual, and relational dimensions.
Data Quality as Processes Controls
Data quality management applies quality management techniques to data. Data quality management is the planning, implementation, and control of activities to ensure that data is quality and that data is fit for use by downstream data consumers.
Why is data quality important?
A data-centric organization believes that quality data forms a foundation of their own processes and value creation. Because data drives business decisions, the quality of that data directly impacts the ability for decision makers to arrive at correct and effective conclusions. Therefore the business drivers that support the establishment of a Data Quality Management programs include:
- Increasing the value and opportunities of organizational data
- Mitigating the risks and costs of relying on poor quality data through data quality improvements
- Improving organizational efficiencies and productivity through processes and roles that support the creation of high data quality systems
- Branding the organization as a reliable data source and partner and while further protecting its reputation
Poor data quality can negatively impact the organization as well and increase risk. Chief risks are missed business opportunities and operational drawbacks. Drawbacks of poor data quality generally lead organizations to make poor decisions because the underlying data may be incorrect or missing. Tangible drawbacks affect the business in many ways, some significant impacts include:
- Reduced capabilities, such as inability to invoice correctly
- Increased hindrances, such as growing volumes of customer support calls, and inability to resolve complaints
- Lost revenues from missed business opportunities, such as upselling
- Inability to react quickly to changing circumstances
- Strategic opportunity blindness
- Delays in data integrations especially in cases of mergers and acquisitions
- Increased risk exposure to fraudulent activity
- Poor decisions making capabilities resting on poor quality data
- Diminished or damaged business reputation because data is faulty
Data-centric organizations also understand that high quality data is not permanent. Measuring data quality is just one stage in the ongoing data quality management lifecycle. As new data is generated or brought in from external sources it must be integrated into the organization’s systems. Which means data quality builds overtime atop those systems. As the data lifecycle revolves, opportunities arise to learn more about how the business collects and uses its data, and then hone the data rules that ensure quality.
Measuring data quality
Simply speaking, high quality data is right, and fit for purpose. At the onset, this term yields little insight. Many frameworks have been devised to expand this definition and clarify for data handlers just what dimensions to observe about data to improve its quality. A question to ask of each dimension is how does that dimension contribute to the fitness of the data set for the purposes of the business.
The dimensions of data quality
A data quality dimension is a measurable characteristic of data. Five general dimensions help to further objectively define data quality (though several frameworks map 15 or more dimensions). These are general dimensions because several frameworks have been developed over the years in an attempt to understand the often difficult to pin down mission of improving data quality, and what dimensions are most relevant. Those frameworks, introduced below, should help to highlight the concept that data quality attracts many perspectives, and deserves continuous examination to produce the most effective models.
Generally, to be of useful quality, data must be accurate, complete, fresh, standard, and representative of real life.
Accuracy — Data must be accurate to usefully lead to proper conclusion. Inaccuracies can seep in during data entry, during data modifications, and at the beginning stages
Completeness — Null values in data sets are frowned upon. Companies should strive for complete data sets to maximize data impact. For instance, customer contact information is critical business information, incompleteness potentially leaves money on the table.of database design where improper specifications can be introduced.
Freshness — Freshness describes the currentness of data, sometimes called data currency. Is data up to date and reliably usable, or has it gone stale and worthless?
Standardization — Data that conforms to standards is more insightful and portable. Standardization also increases data context and system interoperability.
Representative — Data must faithfully represent the real life counterpart. While accurate data helps to correctly identify entities, representation differs, data is more representative when data structures characterize the entity more closely. While a company name may be recorded accurately, alone it is not representative—increase representation by adding data points for address, TIN, contact information, etc.
Because each of these dimensions is measurable, they can be improved over time. For example, upon discovering that a data set has columns with significant null records, it may be prudent to question the use of an additional column. If it’s found to be irrelevant for the end purpose, removing it before integrating the data set may be an option.
Data quality frameworks
Many dimensional frameworks that describe data quality have been proposed. The top three most influential are described below and illustrate that it is advantageous to see data from many different perspectives. Strong-Wang framework views data dimensions from the end data consumer. Thomas Redman’s framework approaches data dimensions structurally. And the Larry English framework considers intrinsic and extrinsic data concerns.
Strong-Wang framework
The Strong-Wang framework uses 15 dimensions to measure Data Quality through the lens of the end data consumer. Data dimensions are categorized along its intrinsic qualities, contextual quality, representational qualities, and accessibility.
Intrinsic Data Quality
- Accuracy
- Objectivity
- Believability
- Reputation
Contextual Data Quality
- Value-added
- Relevancy
- Timeliness
- Completeness
- Appropriate amount of data
Representational Data Quality
- Interpretability
- Ease of understanding
- Representational consistency
- Concise representation
Accessibility Data Quality
- Accessibility
- Access security
Thomas Redman framework
The Thomas Redman framework chooses to describe data quality through its structure, by defining data items as “representable triples”. Meaning, a value is from a domain of an attribute within an entity—value, attribute and entity. In this model, dimensions are associated with the three components.
Redman categorized 24 dimensions under three categories: the Data Model, which encompasses entities and attributes, Data Values, and a set of dimensions for Representational rules that are used to record data items.
Data Model
- Relevance of data
- Ability to obtain values
- Clarity of definition
Level of detail
- Attribute granularity
- Precision of attribute domain
Composition
- Naturalness or realness
- Identify-ability
- Homogeneity
- Minimum redundancy
Consistency
- Semantic consistency of components in model
- Structural consistency of attributes across entity types
Composition
- Naturalness or realness
- Identify-ability
- Homogeneity
- Minimum redundancy
Consistency
- Semantic consistency of components in model
- Structural consistency of attributes across entity types
Reaction to change
- Robustness
- Flexibility
Data Values
- Accuracy
- Completeness
- Currency
- Consistency
Representation
- Appropriateness
- Interpretability
- Portability
- Format precision
- Format flexibility
- Ability to represent null values
- Efficient use of storage
- Physical instance in accord with format
Larry English framework
The Larry English framework divides dimensions into two groups: inherent and pragmatic. Inherent dimensions are independent of data use, and could apply generally to data sets. The pragmatic group of dimensions tend to be data use dependent and dynamic.
Inherent Dimensions
- Definitional conformance
- Completeness of values
- Validity of business rule conformance
- Accuracy to a surrogate source
- Accuracy to realty
- Precision
- Non-duplication
- Equivalence of redundant or distributed data
- Concurrency of redundant or distributed data
Pragmatic Dimensions
- Accessibility
- Timeliness
- Contextual clarity
- Usability
- Derivation integrity
- Rightness or fact completeness
Metadata and Data Quality
Metadata is another critical element in managing data quality. Metadata describes what data and organization has:
- What the data represents
- How data is classified
- Data sources
- How data flows into and throughout an organization
- How data evolves through use
- Access privileges
- Metadata about the quality of the organization’s data
Because metadata defines what data represents, it is the primary means by which expectations can be clarified. Through a process of organizational learning and feedback, standards and requirements can be formalized and documented, and therefore made measurable. A metadata repository can then house these documents to be shared with the whole organization to build consensus and drive continuous improvements to data quality. This of course brings up the realization that metadata itself is data and must also be managed, which is the main driver for metadata management tools.
Improving data quality
Improving data quality follows a pragmatic approach. It begins by defining data as it pertains to the organization. Through definition data becomes measurable, and so data expectations and business rules can be created with measurable improvability. This process can also highlight where potential improvements can be prioritized and turned into goals.
The following sections summarize an organizational process to discover critical data that lead to goals and then establish a plan that deploys a data quality operational framework specific for the organization.
Define High Quality Data
First a collaborative process must be taken to define for the organization what high quality data means for their success. So far several dimensional frameworks have been introduced which forms a foundation to understand both semantic and technical aspects of data. While defining high quality data inside the company, teams should take into account multiple dimensional perspectives, but also ask a set of questions of themselves to determine their state of readiness.
- What is meant by high quality data to stakeholders?
- What would be the impacts of low quality data on the business?
- How would high quality data enable business strategy?
- What are the priorities driving improvements in data quality?
- What governance is in place that supports data quality improvements?
Define a Data Quality Strategy
To achieve improvements in data quality requires a data quality strategy (DQS) that lays out what is to be done, and how it is to be done. The DQS must be aligned with the business strategy that it supports. Consider the following topics when developing a strategic framework.
- How are business needs prioritized?
- What data is critical for meeting those business needs?
- Using business requirements, define business rules and data quality standards
- What are the discrepancies between data and expectations?
- What feedback do stakeholders have on findings?
- How will issues that arise be prioritized and managed?
- What opportunities for improvement can be identified and prioritized?
- How will data quality be measured, monitored, and reported on?
- How will metadata be captured and documented?
- How will quality controls be integrated into business and technical processes?
Identify Critical Data and Business Rules
Because not all data is equally important, identifying data and business rules is critical. Focus on defining the most important data first, that data which can deliver significantly greater value if its data quality were improved. Good starting points are domains such as regulatory compliance, financial improvements, and direct customer impact.
Perform an Initial Data Quality Assessment
After critical data and business rules have been defined, data analysts can begin querying data and digging into it to understand data content and relationship. Comparing the actual data to rules and expectations, for the first time, will likely reveal a much different image of the data than what is expected; such as undocumented dependencies and relationships, implied rules, redundant data, contradictions and overlap.
Identify and Prioritize Potential Improvements
At this point, with the initial data quality assessment, and a summary of the improvement process validated, it is time to dig deeper and perform an in-depth and full scale data profile of the large data sets, in order to adequately understand the extent of issues for improvement that can be addressed. Further, discussion with data stakeholders about the impact of the issues, combined with an analysis of business impacts can be used to prioritize the improvements.
Define Goals for Data Quality Improvements
The preliminary assessment forms the foundation for a more specific data quality program. The prioritization of improvements can then be considered through the lens of short-term remedial goals versus longer term strategic goals. Ideally, goals should be aligned with the strategic goal of discovering root causes and establishing the controls and mechanisms to prevent further issues.
Develop and Deploy Data Quality Operations
A data quality program should implement a plan and technology that allows data analysts and data stewards to achieve the following operations:
- Managing data quality rules
- Measure and monitor data quality
- Develop operational procedures for managing data issues
- Establish data quality Service Level Agreements
- Develop data quality reporting
Some or all of these operations above can be handled by data quality tools or Master Data Management platforms.