2 min read

The Difference Between Data Cleansing and Data Enrichment

Nikko Torres February 11, 2020 12:58:00 PM EST

Data Science DataOps Business Insights Data Marketplace Data Enrichment

Tucked away in portals across the internet, in different formats, with bad headers, strange encoding, and more – the world of data can be treacherous, and despite ‘data scientist’ being dubbed the sexiest job of the century, every data professional knows that it's not all glitz and glam. There are plenty of steps between getting your hands on the data and when it’s ready for modeling and analysis. Today, let’s look at the difference between data cleansing and data enrichment.

Think of data cleansing like the work of a prep cook: separate the good from the bad; wash off the dirt; measure; cut to size; then put the pieces in labeled containers. Data enrichment, on the other hand, is the chef’s touch: combining different ingredients to create a dish that’s greater than the sum of its individual components.

Raw ingredients vs. a prepared meal – just like raw and refined data

What is data cleansing?

Broadly, data cleansing consists of inspecting and formatting the data to make it suitable for analysis. Problems in the data need to be rectified in order to make the data useful for data scientists – they can be simple or complex, quick & painless or time-consuming & tedious. Common data problems include: incorrect data types (such as text cells in a currency field, or integers in a name column); data that does not conform to requirements or patterns (such as dates and times, postal code formats, e-mails, phone numbers); inconsistencies within the data (such as conflicting addresses for the same business, duplicates of rows); and many more.

What is data enrichment?

After the data is cleansed, it can be enriched. Data enrichment is the process of enhancing data by appending additional information to an existing dataset. The result is a richer dataset that makes it easier to derive more and better insights. For example, imagine you have a dataset containing a business name, their phone number, and address. Now we can enrich that data by pulling records from the stock market indices to get the stock price and valuation for each publicly-traded business. Additionally, we can pull from an online Maps API to get the precise location or add additional geospatial information. The possibilities are almost endless, and it’s critical to fully understand the problem you’re trying to solve when sourcing and combining data.

The impact of data variety

Data coming from a multitude of sources, both inside and outside an organization, introduces data variety. Depending on who you are, data variety can sound like a blessing or a curse. While variety is vital to getting unbiased, corroborated data, it also means jumping over massive hurdles to achieve data parity before any data can be engineered efficiently, or used together in a meaningful way. Like anything, an ounce of prevention is worth a pound of cure – having a data variety solution from the start of a project is the easiest way to ensure consistent progress throughout.

Why do these things matter?

Why does data need to be cleansed? Why does it need to be enriched? And most importantly, why is data cleansing and prep so important to your organization?

The one-word answer is: “Money.” Insights-driven organizations are growing at a rate of 30% annually and are on track to earn $1.2 trillion by 2020 – looking at these numbers, it would be downright bad business to ignore the value in a data-driven strategy.

Data variety is an obstacle for everyone, no matter how large the team. Although, the majority of companies are worried about a talent shortage when it comes to data, the root of the problem is a strategy shortage. Finding and preparing data is the most time-consuming and least enjoyable part of a Data Scientist's job. The below graph highlights the reallocation of investments when teams leveraging data prep automation tools.

Chart displaying the potential to reallocate time and increase productivity

As more and more organizations begin to adopt a data-driven approach to business, relying solely on internal data is no longer enough to be competitive. To stay ahead, organizations need to introduce external data sources into every piece of their data pipeline – and that means implementing adoption strategies and methodologies (like DataOps) to ensure high performance and consistent successful outcomes.

ThinkData offers a lot more data than that – over 250,000 thousand datasets from more than 75 countries around the world. Request a consultation with one of our data experts to talk about external data.

A futuristic loom weaving a digital fabric with the text:

4 min read

Data in 2024 — How Should You Prepare?

October 12, 2023 3:22:56 PM EDT

January is closer than it seems, and it's time to start planning (that's not just a note to myself about buying gifts before December 23rd)....

Featured image with a warning sign over a cargo ship and the text: How to better leverage data in risk management

4 min read

How to better leverage data for risk management and crisis response

September 8, 2023 1:06:16 PM EDT

It’s becoming increasingly difficult to manage risk in a global climate that’s growing in complexity. On the other side of this coin, however, is an...

Business Insights Supply Chain Data Use Case

3 min read

Next-Gen Data Catalogs: Eckerson TechVent Summary

July 18, 2023 9:50:46 AM EDT

Wondering what's up with data catalogs lately? We were part of a group of data and governance experts at a virtual event hosted by Eckerson Group to...

Business Insights Data Catalog

Data Catalog Platform

Featured Resource

Calculate your ROI

By Use Case

Featured Solution

See data's value

Data Products

Data Services

Featured Solution

Deliver insights

Company

Resources