Skip to the main content.
Data Catalog Platform
Featured Resource

Calculate your ROI

Need to know what benefit a catalog solution could offer you? Find out with our calculator.

Featured Solution

Deliver insights

Discover new revenue opportunities with secure data delivery

Featured Resource

Data Governance Essentials Kit

Level up your data governance program with interactive guides and webinars from industry data leaders.

4 min read

Refined Data is the Key to Train Machine Learning

All machine intelligence is powered by data. This isn't ground-breaking, or even news – we've known about data's value for decades now. However, not all data are created equal, and we'll be looking at executing machine learning products from a standpoint that prioritizes the quality of the data streaming into them.

Machine Learning (ML) is a specific subset of Artificial Intelligence that plays a key role in a wide range of critical applications such as image recognition, natural language processing, and expert systems to name a few.

A toy robot reading a book – not how machine learning works

ML solves problems that can’t be solved by numerical means alone, mainly through supervised or unsupervised learning:

  • In Supervised Learning, an algorithm is trained using a labeled set of input data and known responses to the data (output). It learns by comparing its actual output with correct output to find errors.
  • In Unsupervised Learning, the algorithm uses a set of unlabeled input data to explore the data and find some structure within.
Overview of machine learning workflow: 

An example of a machine learning workflow, from data, to modelling, to prediction

Example machine learning workflows, from raw data to predictions

The first step involves data collection, but this is just one part of the equation in building predictive solutions – the next, and often overlooked, step is the data prep.

Stages of Machine Learning development include:
  1. Data collection
  2. Data preparation
  3. Developing the hypothesis
  4. Model construction
  5. Model deployment
  6. Model evaluation

The current state of Machine Learning

IDC published their AI in 2020 and beyond predictions, in which they posit that the effective use of intelligent automation will require significant effort in data cleansing, integration, and management that IT will need to support.

Data cleansing is a manual and laborious task that continues to handcuff data professionals. Without effective methods to cleanse data efficiently using automation, organizations will not be able to achieve their digital transformation goals. The IDC FutureScape report states that resolving past data issues in legacy systems can be a substantial barrier to entry, particularly for larger enterprises, showcasing the difficulties endemic in the adoption of digital strategies.

Last year alone, organizations were expected to invest $1.3 trillion (USD) in digital transformation projects, according to Morningstar. McKinsey later announced that 70% of these projects fail. For those of you keeping score at home, those failures are costing businesses over $900B. No matter the size of lost investment, organizations cannot afford to continuously fail when it comes to digital transformation. Clean, standardized data is critical in unlocking the value of digital transformation efforts, but getting the data you want in the state you need is tedious, time-consuming, and expensive.

Typical workflows fail to get raw data in a usable format

A common misconception, highlighted through the above diagram, is that once all of the data is in a common place, we can plug it into whatever we want. But the process isn’t as simple as that. The data coming from different sources lack a common standard, and the applications sitting on top of the data can’t be fully implemented because there’s a massive barrier sitting between the data (lake or no lake) and the enterprise apps. Refined data needs to come in a reliable, repeatable, predictable format.

At the ground level, the operational hurdles that need to be overcome in order to unlock this value have slowed innovation to a crawl. In many organizations, data professionals are spending 80% of their time cleansing and prepping data instead of spending that time on analysis and modelling. Data professionals should not be burdened with the custodial tasks of data prep. Without a layer of automation in place, some tasks are hard, others are harder, and the rest are practically impossible.

“Data professionals should not be burdened with the custodial tasks of data prep.”

Predictive models are only as good as the data on which they’re built, so they need to be fuelled by refined, standardized data in order to produce results that are both practical and representative of reality.

The typical data refining process is complicated

When we see the process of refining data found in the wild, it's not as simple as plugging data into a BI layer or visualization tool. Often, there are many steps, something like this image:

Data refining is a time-consuming and multi-step process

Once the data is finally clean and in useable format, the next step is to pick an algorithm. There are trade-offs between several characteristics of algorithms, such as speed of training, predictive accuracy, and memory usage. With AutoML on the rise, algorithm selection is trending towards automation as well. If you are not familiar with Automated Machine Learning Systems you can read about it here, but generally speaking, with a refined dataset on hand, the AutoML system can design and optimize a machine learning pipeline for you.

Now that we've covered the ML workflow, let’s try to apply some of the concepts discussed by building a simple Product Recommender System powered by refined data and AutoML.

Strategy must catch up with talent

Many organizations are worried about a talent shortage when it comes to data, but the root of the problem is most often a strategy shortage. Data variety is an obstacle for everyone, no matter how large the team. There are a lot of platforms where you can find sources for building machine learning systems, and plenty more places to find some of the most frequently used datasets. A data management platform allows for faster model-ready deployment (among other advantages) by automating most of the custodial work, including the dataset refinement.

Data prep & processing, simplified

Feeding data through Namara handles the custodial tasks associated with turning raw data into usable assets. The workflow is simplified again, like in the first image:

Namara can save you time in the prep and processing stages

In this example, we'll examine an Amazon Customer Reviews Dataset sample that was refined using Namara. For collaborative filtering methods, the refined historical data is sufficient enough to make a prediction. The steps highlighted in the earlier image, ETL, Enrichment, Search, Role Management, and API, are handled by the platform.

Now that we have the ready-to-use data, let’s pick an algorithm. There are any number of ways we could go about this; however, let’s keep it simple for now, and use Turi Create to simplify the development of machine learning models in Python.

AutoML - The Code

Code example for AutoML
Once this has been built, it’s ready for model evaluation and deployment. We could also plot charts and visualizations to describe the data – as you can see, we're looking at just the tip of the iceberg when it comes to the analysis.

To sum it up

As mentioned earlier, 70% of digital transformation efforts fail, demonstrating the serious misalignment issue when it comes to predictive solutions and data strategies. Predictive models are only as good as the data fuelling them, so they need refined, standardized data in order to achieve meaningful results. Organizations must adopt and leverage data management strategies (think DataOps) as well as AutoML systems to unlock the value of data. This will lead to being able to deliver insights more efficiently, accurately, and at scale.



Want to learn more about aligning your business and data strategy? 

Request a consultation with one of our data experts or browse the largest catalog of solution ready data to determine how ThinkData’s tech can advance your projects.

A futuristic loom weaving a digital fabric with the text:

4 min read

Data in 2024 — How Should You Prepare?

January is closer than it seems, and it's time to start planning (that's not just a note to myself about buying gifts before December 23rd)....
Read More →
Featured image with a warning sign over a cargo ship and the text: How to better leverage data in risk management

4 min read

How to better leverage data for risk management and crisis response

It’s becoming increasingly difficult to manage risk in a global climate that’s growing in complexity. On the other side of this coin, however, is an...
Read More →
Blog title

3 min read

Next-Gen Data Catalogs: Eckerson TechVent Summary

Wondering what's up with data catalogs lately? We were part of a group of data and governance experts at a virtual event hosted by Eckerson Group to...
Read More →