A few years ago I took a call from an analyst at a hedge fund who was looking for external data that would, in his words, provide “alpha.” I explained that our company was connected to thousands of data sources and hundreds of thousands of public datasets; I told him that we were continuously pulling in open data from 70 countries, standardizing it through an ingestion pipeline trained against the largest catalogue of public data in the world, and serving it up via a suite of APIs that plug directly into any ecosystem.
There was a brief pause and the analyst said, “I have all that. Do you have anything I don’t?”
This surprised me. “Are you telling me that you’re already pulling down and processing data from every public agency and municipality in the USA?”
“No, but I could. It’s easy to get that stuff. I want something no one else has.”
This wouldn’t be the last time I’d hear someone make this claim. The misguided belief that public data is easy to access has defined the first ten years of the open data movement, much to the detriment of people who are actually trying to use it to make intelligent business decisions. In practice, public data is not easy to access and it’s not easy to use. We refer to this as the difference between available and accessible, and it’s a significant barrier to adoption.
For modern businesses, accessing and using data effectively is undeniably important. Whether it’s the data generated within the four walls of your organization, government open data, or even the data being released by other businesses, there’s lots of value that can be unlocked by piping these streams of information into analytical models or applications. With so much out there, executives are champing at the bit to gain access to new data, knowing that whoever innovates fastest can steal market share from whoever doesn’t adapt quickly enough.
This focus on innovation has led organizations to hire a lot of data scientists and analysts who are being tasked with creating new products, services, and applications that can be used to fuel business intelligence.
So what’s the problem?
The problem is the disconnect between a business division’s outcome-based thinking and a data scientist’s reality. When you’ve hired a fleet of data scientists, “innovation” is expected, but at ground level only a fraction of the work they’re doing can be considered groundbreaking. The majority of the time, data scientists are struggling with all of the things that my quant friend at the hedge fund thinks are simple: sourcing data, refining it, and plugging it into applications.
Because of the difficulty of working with data, we’re seeing a growing dissatisfaction with data projects. The hype is there, the talent is there, but the results aren’t. Months or even years of work are going into ideas that aren’t being productionalized due to the operational hurdles involved in using data effectively. It’s demoralizing for data scientists, and it’s a pressing business concern for executives.
Things are looking grim. In 2016 Gartner estimated that 60% of big data projects failed, and a year later Gartner analyst Nick Heudecker said the number was likely closer to 85%. No matter how deep your pockets are, no sane business is going to keep throwing money at something that fails four out of every five times.
The reality is that in order to innovate, all businesses have to optimize their data strategy.
From a management perspective, optimization starts with communication. There needs to be a better line of dialogue between the C-suite and data science divisions. Operational objectives should be clear, and data scientists have to be given the resources they need to do their job effectively. Having a data strategy is a step in the right direction, but has it been implemented? It’s not enough to want to be data-driven, you also need to understand what that entails and provide your team with the tools and support that enables them to put ideas into production.
The second way to optimize your data infrastructure is to speed up the time-consuming data management tasks that are plaguing data divisions everywhere.
In an ideal scenario, data scientists are empowered to experiment and try out ideas, but operationally this is often impossible. Sourcing, scraping, standardizing, refining, and integrating data simply takes too long. The result of this operational pitfall is one of two scenarios:
- ideas come from the top, where executives decide on key objectives and throw their data division at them; or
- data scientists work in controlled environments, using synthetic, small data to test out models.
By adopting DataOps frameworks and finding ways to automate the prep and process phase of gathering data, data scientists will be able to test and evaluate ideas faster and ensure their models are production-ready. This will lead to increased output, which will lead to better business outcomes. Optimization will lead to innovation.
Recently, I was in a meeting with someone who wanted to start using public data. I walked him through the platform and explained the steps involved in sourcing and integrating data.
He shrugged. “My data team can do all this.”
This time I was ready. “So why haven’t they?”