A new industry of crowdsourced data labeling

Crowdsourcing Data Labeling for Artificial Intelligence in China

By now, most of us know that machine learning requires huge pools of data. We also know that the operating scale of companies like Facebook, Google, and Amazon gives them a powerful advantage in getting that data. What many of us don’t realize is that there is a growing industry specialized in doing the labor-intensive work of labeling all that data.

Data labeling tells machine learning that a particular pattern is a “cat” or a “dog.” Machines excel at pattern recognition, but still rely on humans to make meaning from those patterns. One of the main techniques for connecting our meaning-making to the pattern-recognition capabilities of machines is for humans to label the data that is used to train machine learning algorithms.

A recent article in The Economist highlights the role of China’s data labeling infrastructure in the success of its growing AI sector. In particular, it highlights the work of one supplier of labeled data, called MBH:

Mr Liu claims that MBH’s trick is not just numbers, but the methods the firm uses to distribute labeling work efficiently to its workers. This is done using the same kind of machine-learning systems that Amazon, an American e-commerce giant, uses to recommend products to its customers. Instead of suggesting stuff to shoppers, MBH assigns labeling tasks to workers. First, it gathers data from its workers as they carry out labeling jobs. Mr Liu says the company records its workers’ gaze, mouse movements and keyboard strokes. It also takes note of what sort of data-labeling task the worker is performing, from medical-imagery labeling to text translation. By measuring performance according to the type of task, he says, he is able to find workers who are better at some tasks than others, and steer those tasks to those workers.

China’s success at AI has relied on good data

What is particularly interesting about the way that MBH handles its data-labeling is that it relies on the same kind of outsourcing management techniques used by a company like Uber. In short, MBH crowdsources work to huge pools of contract contributors, most of whom live in China’s poorer rural areas.

MBH, in turn, acts as a supplier to many of China’s largest machine learning companies in areas like facial recognition. So, what we are seeing here is a whole data supply chain, as companies increasingly open themselves to external contributions of work and learning. In some cases, as with Facebook, Google, and other tech giants, the supply of data comes from end users. In other cases, it now comes from professional suppliers like MBH. But even these companies aren’t sourcing their data the traditional way by staffing up. Instead, they are relying on the gig economy and drawing from huge pools of contracted workers.

Scroll to Top