January 20, 2023•Updated on January 20, 2023
NaN minute read
First off, what is data integrity? Data integrity is the overall process of ensuring the accuracy and consistency of data over its entire lifecycle — from input, storage, retrieval, to ongoing maintenance. Simply put, it’s the measure of how well data is preserved.
To skip ahead, we’ll cover:
Why is data integrity important?
What are data accuracy and data reliability?
What is data quality?
Data integrity is a part of larger data governance efforts
Why logical layers matter, and how to use them.
Data integrity is foundational for building a data-driven culture and strengthens confidence in self-serve analytics.
Maintaining data integrity can help avoid the common obstacles in the analytics process, and keep analysts in the flow of limitless insights. While data integrity is best maintained with a checklist to run against your systems of data input, storage, and retrieval, there can be shortcuts.
For example, Mode recently unveiled Datasets, a feature that allows data teams to create verified, canonical datasets for use across the organization (you can sign up for the beta here). Datasets can help ensure common data can be greenlit for frequent, recurring use.
Data integrity is an umbrella term, and it covers a few pillars that represent different ways big data remains good data. These include data accuracy, data quality, and data governance—all important processes for data leaders to understand, and explain to their stakeholders.
Let’s break down what these components mean, and why they’re important.
Data accuracy and reliability are two ways to say the same thing. Data accuracy can answer: is this data true? Data reliability might answer: is this data still true? Data can be inaccurate on collection, but data can also become untrustworthy over time due to issues like human error, or formatting problems in the aggregation process.
Here are some common ways data can become untrustworthy:
Human error, like moving or deleting data. Though it might sound simple, storage and query limits on many applications can require frequent data purging by business teams.
Inconsistencies in formatting, common in application-specific vernacular (e.g., Excel).
Missing data. Different minimum fields in data collection can create incomplete pictures of certain subjects, but not others. This can lead to misleading quality assurance when testing campaigns.
Data freshness—or lack of it. Is your data coming from a relevant, recent source? As data collection proliferates across every department, ensuring data freshness is becoming a time sink for data teams (see more on that below). Perfectly accurate data can still be totally unreliable — like a customer who stopped going by their maiden name. It’s important to regularly audit data to ensure data accuracy and reliability.
Skewed data. Outliers and other factors can result in asymmetrical data and misleading insights. Learn how to demystify skewed data for analysis.
Mode note: To address data reliability, we recently launched a data freshness integration with dbt that surfaces timestamp and origin information from dbt’s Metadata API into Mode’s reports. This feature spares data teams the time spent hunting for whether data is stale or relevant.
An academic might define data quality as data that’s “fit for its intended uses in operations, decision making and planning.” For example in medicine, pathologists need to collect vast amounts of data to diagnose an unknown medical issue. However, only a subset of this data might be necessary to treat a known issue, especially a common one.
Data is the new raw material of business, and its “intended use” can shift rapidly. The goal posts of quality change as the question does, and this is especially the case in the flow of ad hoc analytics where one question leads to another. The common underlying issues that affect data quality can include outdated calculations, duplicate data sources, inconsistent formatting, or simply not enough good data.
Some high-level tactics to ensure data quality might include:
Establish standardized procedures and protocols for creating and updating data, and make sure all employees understand these processes
Monitor and update backup systems so data can be restored in the event of a system crash, etc.
Set controls in place to detect anomalies in your datasets
Data governance is the precursor to privacy, security, and regulatory compliance. Data governance —the concept of: who can access what data — is an essential component of data integrity, because it allows data teams to identify what teams might be the root cause of issues like human error, incomplete data, or misuse of Personally Identifiable Information (PII).
Data security — keeping data safe from internal and external threats — deserves a very long post of its own. But the evolving regulatory environment around data privacy suggests a new kind of data oversight that is less about external liability, and more about liability from your internal processes.
This is why data governance dovetails with data integrity, because the oversight of who can access what data is an essential component of any data privacy regulatory framework like GDPR, CPRA, and others.
Data privacy and security are issues any organization needs to take seriously. At minimum, here are some high-level strategies data leaders can employ:
Develop and enforce strong password policies to prevent unauthorized access to databases
Establish protocols and systems like file integrity monitoring (FIM) for monitoring changes made to data so that any modifications made are traceable back to their source
Utilize secure data storage solutions, such as encryption, that protect confidential information from unauthorized access
Why is data integrity so important? Each component of data integrity protects against various issues. Ensuring teams are acting on accurate and reliable data can save time and improve costs of campaigns and outreach. Fixing data quality issues can avert delays or downtime. And solid data governance can ensure quick action when there are data access or privacy issues.
But data integrity is more than just the sum of its parts. Ensuring the preservation of data throughout its lifecycle builds trust in your data team across the organization. And it allows you to replicate and scale analysis with confidence.
Here at Mode, we're empowering data teams to be fast, unified, and trustworthy. As data leaders move further and further to the front office, they are responsible for creating a data hub that everyone relies on. Data integrity is key as a metric for ensuring all of the processes, integral dashboards, and applications built from the data hub run as smoothly as possible — today, and for years to come.
Want to stay up with the times of data? Join us for a conversational webinar between Anna Filippova, Director, Community & Data at dbt Labs, and Benn Stancil, co-founder and Chief Analytics Officer at Mode, on the return of semantic layers, and what they mean for the data industry in 2023 and beyond.
Work-related distractions for data enthusiasts.