From experiments in robotics to old-school video game play research, OpenAI’s work in artificial intelligence technology is meant to be shared.
With a mission to ensure powerful AI systems are safe, OpenAI cares deeply about open source—both benefiting from it and contributing safety technology into it. "The research that we do, we want to spread it as widely as possible so everyone can benefit," says OpenAI’s Head of Infrastructure Christopher Berner. The lab’s philosophy—as well as its particular needs—lent itself to embracing an open source, cloud native strategy for its deep learning infrastructure.
OpenAI started running Kubernetes on top of AWS in 2016, and a year later, migrated the Kubernetes clusters to Azure. "We probably use Kubernetes differently from a lot of people," says Berner. "We use it for batch scheduling and as a workload manager for the cluster. It’s a way of coordinating a large number of containers that are all connected together. We rely on our
autoscaler to dynamically scale up and down our cluster. This lets us significantly reduce costs for idle nodes, while still providing low latency and rapid iteration."
In the past year, Berner has overseen the launch of several Kubernetes clusters in OpenAI’s own data centers. "We run them in a hybrid model where the control planes—the Kubernetes API servers,
etcd and everything—are all in Azure, and then all of the Kubernetes nodes are in our own data center," says Berner. "The cloud is really convenient for managing etcd and all of the masters, and having backups and spinning up new nodes if anything breaks. This model allows us to take advantage of lower costs and have the availability of more specialized hardware in our own data center."