Those of you who follow advancements in artificial intelligence (AI) have probably heard data called the new currency or the crown jewels. At the same time, a contingency of people believe data is a commodity. But it can’t be both a highly prized proprietary possession and an interchangeable good.
So which is it?
With college economics far behind many of us, here’s a quick refresher on what makes a commodity a commodity.
It’s fungible. The quality of a given commodity may vary slightly, but it’s essentially consistent across producers. Consider gold: The purity may differ across samples—9 karats vs. 24 karats, for instance—but gold is gold.
The market dictates the price, and it’s traded openly. The intersection of supply and demand serves as the pricing determinant of commodities. You can trade everything from pork futures to oil and precious metals on open markets like the Chicago Mercantile Exchange.
A commodity usually generates low margins. With no major differentiation from one product to the next, there’s minimal wiggle room and strong competition.
It has common standards, which makes it easy to exchange for goods of the same type.
A commodity is (usually) an input into another finished good. Oil on its own is essentially worthless; its value lies as an energy source to fuel other products.
Is Data a Commodity?
Spoiler alert: It’s not. Data—that is accurate, precise training data that teaches models to discover predictive relationships—offers the keys to the kingdom. It is a far cry from a commodity, and here’s why.
First, training data is not fungible. Consider the datasets we need to build autonomous navigation systems. A stoplight is a stoplight, so for a car to recognize one, you may think all you need is a series of positive and negative images to train a classifier. It might as well be the Not Hotdog app from HBO’s Silicon Valley. Except it’s not that simple. Stoplights don’t look the same in every country. Not to mention, how was the data captured? What type of camera did the car use? Where was it mounted? What’s the angle of the image? What’s the angle of capture, and is it (partially) obstructed? Was it a sunny day or a rainy night? Something as seemingly straightforward as labeling a stoplight is actually quite complex.
Second, is the market the sole determinant of price, and is it openly traded? There is no open market for training data. I suspect there never will be because many organizations closely guard data as premium among their intellectual property. Let’s stick with our autonomous driving example. Companies in this industry are in a race to get to Level 4 autonomy first. Not likely that they will share their proprietary data in the midst of competition this fierce. Nor will banks, insurance providers, e-commerce merchants, advertisers, or, given the choice, many of us.
Third: low margins. Is the differentiation between the methodologies of generating training data minimal? That’s debatable, and it’s hard to say if anyone really knows. We haven’t yet seen controlled experiments to determine if one method is truly better than another. As someone who works with dozens of leading AI practitioners around the world, I can tell you that AI teams will pay more for accurate and consistent training data to help scale their models. There’s plenty of traditional crowdsourcing and outsourcing tools that offer more commoditized human knowledge as an API, but in that case, you’re getting what you pay for: lots of messy, imprecise data.
Fourth, there are no common standards. Let’s take autonomous vehicles again. A few regulations exist from regional and national governments, but they’re weak and inconsistent. In the U.S., the National Highway Traffic Safety Administration released guidance for vehicle performance for manufacturers and suppliers building autonomous vehicles, but it’s not a set of enforceable laws. The International Organization for Standardization issued ISO 26262 in 2011, but it provides dated guidance with no enforcement teeth. And states have tremendous authority in creating laws and regulations that apply to traffic locally. I think we’re a long ways off from seeing any common standards in the U.S., let alone globally, for autonomous driving data.
And finally, is training data an input into a finished product? We can affirmatively check this box. It’s probably the most important input into an AI model, hence why so many people—myself included—liken it to the crown jewels.
You, Me, Machine Learning and the Data
If you think training data is a commodity, you’re in for a surprise. Companies are at vastly different stages of maturity in building AI applications. And it’s not enough to have petabytes of raw data; you need it labeled consistently to provide reliable precision and recall. Only then does it become the catalyst that differentiates and accelerates your AI. This makes applying AI harder in the short-term but more interesting and valuable in the long-term.
Oh, and it also means that us pesky carbon life forms are likely to provide crucial crown jewels for silicon to simulate and enhance our cognition for a long, long time to come.
Thanks to the good folks @venturebeat who ran this post on June 11 at https://venturebeat.com/2017/06/11/massive-data-sets-are-not-a-commodity-for-ai/