"Without data it's very hard to be intelligent." - Duncan Anderson, CTO of IBM Watson Europe.
We are living in the land grab era of "Big Data." Whether you're running an e-commerce business, online directory, healthcare provider, or financial institution, you know that your ability to ingest, process, synthesize, and act based on an ever-growing supply of metadata may very well determine your success or failure. The problem is, unlike land, data is effectively infinite. The average enterprise is seeing the data relevant to them grow geometrically. Ironically, a disproportionate amount of the value theoretically stored in these data are opaque to large-scale computing systems - they're in photos, video, and audio, and they're often highly personal, social, and context-specific.
So who are the "white hats" in this overwhelming struggle? Cognitive computing, artificial / augmented intelligence, and machine learning are riding in to the rescue. The good news is that AI technologies, approaches, and services are improving significantly. The bad news is that the resulting models' ability to reason, emulate, and predict human responses in ways that actually help your business is limited by - you guessed it - the quality of the human-derived training data.
In order to train, test, and tune any AI, companies need human insights that are specific to their domain, and of good quality. For example, a computer doesn't know on its own that an insurance claim is valid (or even what an insurance claim is). "Quality" is a dangerously ambiguous term. In this field, quality is really a ratio of confidence in ground truth to the cost (a function of dollars, effort, and time). A data scientist needs to know how many people with specific traits agree, thereby establishing "truth." The more complex, subjective and unstructured the puzzle, the more difficult truth is to determine. In theory, given infinite money, resources, and time, we can ask dozens of qualified medical accountants whether they agree that a given medical claim is valid and reimbursable at 95% for a given patient. But in practice, most engineers need to solve millions, or even billions, of such questions daily - economically and reliably.
Here's the thing: More than a billion people are regularly providing exactly that kind of domain-specific training data for Google and Facebook every day. We all teach them, with our searches, posts, comments, tags, emails, and reactions. This gives Google and Facebook an almost insurmountable advantage in a winner-take-all flywheel that is further accelerated by their embarrassment of engineering riches. In short, if you're not Google or Facebook, you're screwed. (Okay, there are other major players who are similarly advantaged. Amazon has reams of data about our buying, viewing, and even listening habits. And let's not count out Microsoft, nor forget WeChat or Baidu. Still, the amount of quality data that Google and Facebook are receiving daily is simply unparalleled.)
As we look to 2016, it's clear that companies need to remain competitive with their data. Those that do, will win. Those that do not find a solution, will be buried. There is a shortage of domain-specific insights of quality and scale. This is where big data management comes into the picture. But contrary to popular belief, machine learning is not the silver bullet. The limiting factor for your success in understanding your data is the domain-specific training data you need to create useful machine-based models.
So how can companies other than Google and Facebook make sense of their data in 2016?
Break it down
It's easy to find yourself overwhelmed with big data initiatives, not knowing where to start or what to look for. A wise approach is to break it all into smaller, more manageable chunks. At Spare5, we break complex problems down into digestible tasks in order to provide quality insights at scale. You need to crawl before you can run, so start small to understand how to analyze results and implement new strategies. An economy is only as efficient as its currency is small and fluid. Spare5 is revolutionizing the creation of high-quality, domain-specific training data by reducing the currency of human insight to spare moments provided by targeted members of our curated community. Speaking of community...
Utilize a community
You're not going to build a community to rival that of Google and Facebook overnight. However, a targeted community willing to provide insights can help solve complex business challenges. "Targeted community" is the key here, though - a "crowd" isn't so helpful. There are a number of community-based resources out there to consider, so spend time understanding what benefits each provides and what makes the most sense for your business.
Train your machines. Well.
Machine learning is only as smart as the quality of training provided by humans. There needs to be a marriage of machine and humans to provide the best results. Machines are (at best) limited by the quality of human insights that are training them. Seek sources of high-quality, domain-specific training data - it's critical to effective, truly beneficial machine learning technologies.
Don't let your big data become a big problem in 2016. Get in front of your data tidal wave, and you'll be reaping the benefits by this time next year.
(Thanks to vmblog, who originally published this entry with a bit more, um, color, back on December 18)