As humanity passes knowledge into machines, our attention needs to shift to the way we teach our algorithms. Machine learning gets a lot of attention these days, but from a practical perspective what we need to be thinking much more about is machine training.
The way we train machines is going to have very important short-term and long-term implications. In the short-term, there are real questions about the cost of training processes as more and more companies invest in machine learning. Over the longer-term, there important questions about exactly what our machines are learning and how we harness the best of what humanity has to offer in governing these systems.
[Tweet “Machine learning gets a lot of attention, but we need to be thinking more about machine training.”]
It is with these points in mind that a recent collaboration between DeepMind and OpenAI seemed worth writing about. These experiments demonstrate a new approach that incorporates human preferences as a kind of governance feedback loop in machine learning development process. In short, this research demonstrates a smart new approach to broadening human oversight over certain types of machine learning.
Careful What You Wish

In 2003, Nick Bostrom pointed out an oddly terrifying possibility related to goal setting in Artificial Intelligence. He envisioned an AI with human-level intelligence that was given the simple directive of maximizing the number of paperclips in its collection. The unintended consequences of these simple instructions were that the system proceeded to take over more and more resources, eventually converting every atom in the solar system into a paperclip.
This little parable illustrates the difficulty of maintaining governing control over artificially intelligent systems that learn and change themselves over time. The story vividly highlights how dangerous it is to provide a system — any system — with a set of initial operating guidelines and then simply walk away. Even operating guidelines as sound as the United States Constitution still require judges to interpret them in the ever-changing context of our political system.
It’s not enough to be thoughtful about the initial training of our intelligent systems; we also need well-designed processes for maintaining human governance over the course of an algorithm’s ongoing operations. In short, AI needs good governance process and that means developing governance systems that integrate into our intelligent systems.
The DeepMind/Open AI Experiments
What interests me about the DeepMind/OpenAI experiments is that they represent early steps in precisely this kind of governance system. In this respect, it reminds me of the excitement I felt over how Riot Games built governance functionality into the machine learning systems it uses to handle player toxicity in its popular game, League of Legends. The difference with this newest research is that it could be generic enough to be used in lots of other machine training applications.
In very simple terms, what the researchers did was couple a governance system with an already existing approach to Reinforcement Learning. In some of the experiments, these algorithms were set loose on classic Atari games, much like the famous experiments done by DeepMind in 2013. In those earlier experiments, the system was able to rely on already existing, explicitly defined ‘reward functions’ (i.e. the game scoring). As they learned and improved game strategies, the algorithms could use the game scores to measure their success. By comparing the score achieved through one strategy to that achieved by another, the Reinforcement Learning algorithms eventually rocketed into superhuman gamer terrain:
What’s different about the latest research is that it didn’t rely on these kinds of explicit reward functions. Instead, non-expert, human observers were shown short video clips that visualized the algorithms’ strategies, and then told to select which ones best mapped to some particular objective. By comparing and rating some nine hundred pairs of such short clips, an amateur machine trainer was able to teach a software robot to do a virtual backflip:
To be clear, the new system wasn’t perfect. With some games, like Beamrider it performed better than the above mentioned Reinforcement Learning using explicit reward functions. But for games like Space Invaders and Q*bert, it wasn’t able to match the power of simply using the reinforcement that the game scores offered. In the below charts, purple is the new approach, while orange represents using game scores:
Machine Training Efficiency
There are a few things that are striking about this approach. The first is the efficiency of the training.
The researchers claim that by breaking out their reward function in the way that they did, they reduce the ‘interaction complexity’ of the supervised learning that is done by humans by three orders of magnitude — that’s 1,000 times less. That greater simplicity translates into less time and lower cost (the Atari experiments, for example, cost $25 in computing expenses and $36 in human training expenses). The researches conclude: “these techniques can be economically scaled up to state-of-the-art reinforcement learning systems. This represents a step towards practical applications of deep RL (Reinforcement Learning) to complex real-world tasks.”
As noted at the outset, machine training efficiency is likely to be as a critical source of economic productivity in the years ahead. So, from that perspective alone, this research is important.
Preference-Based Machine Training
The second thing worth noting about this research relates to the fact that the human trainers were not working within the constraints of pre-existing reward functions. Instead, they were using their own judgement to nudge the Reinforcement Learning algorithms this way or that based on their own preferences. That is why the title of the research paper announcing this work is “Deep Reinforcement Learning from Human Preferences” (italics mine). And while it is true that the trainers were given prompts like “make this robot do a backflip,” how they went about interpreting those goals was completely up to them.
In the case of the Atari racing game (Enduro) experiment, rather than training the system to follow the implicit reward functions of the game which would be to outrace competitors on the track, the trainers taught the system to simply ride alongside other vehicles for long stretches of time. That’s not the point of the game, but you could imagine some gamers doing something like that just because they could.
In short, by making room for human preferences in training the system, we also open the door to greater human creativity and expression.
Governance Systems
The final point worth noting is that, while the researchers don’t actually call it this, what they have built is a nicely functioning governance system that could conceivably plug into lots of different types of Reinforcement Learning applications.
One could easily imagine software developers hiring amateur trainers to assist in the initial machine training, but what’s even more intriguing is the thought of this training feedback continuing through end user interactions well after the system is launched into the marketplace. I may do a follow up to this post to explore some of these possibilities. For now, the important thing to focus on is that what this research represents is a really clear example of a human preference-based governance system.
That’s cool.
Pokemon Trainers!
On a lighter note, ten years ago or so, our boys were really into Pokemon. I mean, we watched a lot of Pokemon movies and TV shows. Reading through this research, I couldn’t help thinking back to all these crazy little creatures and the trainers who raised and competed with them. Like machine learning algorithms, Pokemon come in all types and with all different kinds of powers. They need human trainers and they evolve over time based on their experiences.
I’ve got this funny feeling that we’re all about to become Pokemon trainers of a sort, as machine training research like this makes its way into more and more products and services.


