From prototype to product with hybrid neural networks
Apache MXNet and the middle path between declarative and imperative programming.
After several decades as an interest of academic computer scientists and specialized research labs, deep learning is appearing widely in real products. That transition has led to several exciting new deep-learning frameworks that tend to emphasize either rapid prototyping or efficient deployment at scale. For product developers looking to experiment with an idea and then reinforce and deploy it, a single framework that supports both ends of the process is helpful.
Apache MXNet, an open-source deep learning framework first published in 2015, aims to achieve exactly that. I recently talked with Mu Li, principal scientist at Amazon and one of the original authors of MXNet—he was lead author of the “Parameter Server” paper that enables MXNet to have almost linear scale with additional processing power. He walked me through the origins of the framework and his vision of “hybridized” neural network models that can carry a single project from early prototyping through deployment at scale.
MXNet emerged from two goals that are classically opposed in computer science: ease of use and high performance. Declarative structures enable high performance in deep learning: the programmer specifies a network structure at a high level, and the framework implements low-level native routines to build the network in the most efficient manner possible.
The drawback is that these declarative frameworks require the programmer to know and specify the network structure—its computation graph—at the outset. That makes iteration slow and experimentation difficult; discovering the best network structure for a particular problem is arguably the principal task for a deep-learning engineer. In some cases, like long short-term memory (LSTM) networks, the structure of the neural network can depend on control statements–loops and ‘if’ statements–that can’t be evaluated until data is fed in, so programmers need to use equivalent statements provided by the frameworks—meaning lots of mucking about with low-level code.
Declarative frameworks can also be unintuitive to programmers who are accustomed to an interactive trial-and-error approach—an essential shortcoming as interest in deep learning explodes. Many newcomers to deep learning don’t have systematic training in symbolic linear algebra and want to teach themselves through active experimentation.
The deep learning community has responded by introducing a handful of imperative frameworks—notably PyTorch and Chainer—that execute programs line-by-line and allow programmers to use complex control statements to change network structure programmatically and on-the-fly. The drawback here is in performance: if the programmer doesn’t specify the full structure of the network before running it, the framework can’t pre-compile it to wring the best possible performance out of specialized hardware accelerators like GPUs.
“The problem [with imperative programming] is that it’s hard to optimize, because you never know what you’re going to write in the next sentence,” says Li. “You don’t know whether results in memory will be re-used, so it’s hard to get performance right. But it’s ideal for people who want to hack some code together for fast prototyping.” The result is that deep learning implementation sometimes gets split into a research stage using imperative frameworks, and a product stage using declarative frameworks. Li points to Facebook as an example: the company supports both PyTorch and Caffe2, a declarative framework, and uses the former for exploration and the latter for products.
Li and the MXNet developers have taken a hybrid approach that supports experimenting and prototyping with imperative programs, then seamlessly refactoring critical sections into declarative blocks as you move toward production at scale. “We want to have a single interface for users,” says Li. “You can start with imperative code, and then when you want to deploy, you hybridize.”
Developers will have more opportunities to run fast neural networks on all sorts of devices in the next several years. Researchers today depend on computers or cloud services with expensive GPUs to accelerate neural network training. NVIDIA’s GPUs are dominant in this field not just due to raw speed but because CUDA, its programming interface for GPUs, and cuDNN, its deep learning library, are broadly supported by deep learning frameworks, which compile neural networks into code that runs efficiently on NVIDIA GPUs by calling functions provided by cuDNN.
Researchers envision a future in which neural networks will conduct both inference and training on “edge” devices like mobile phones and embedded systems. With this variety of devices in play, from high-end phones to stripped-down IoT devices (and let’s not forget legacy equipment—many of today’s devices will still be in use for much of the coming decade), these networks won’t be able to depend on high-end GPUs; instead, they may use purpose-designed field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs)—or even rely on currently available digital signal processors (DSPs).
That means a flowering of different devices that deep-learning frameworks need to support. The framework authors can’t write extensions for every conceivable accelerator; instead, they’re focusing on general-purpose compilers that can recognize any accelerator hardware and compile neural networks to run efficiently on it.
In the next month or so, MXNet will launch its general compiler, allowing developers to implement neural networks on any accelerator, from high-end GPUs down to inexpensive DSPs and specialized processors in mobile phones. Li says the compiler will initially be open-sourced for early users.
This post is part of a collaboration between O’Reilly and Amazon. See our statement of editorial independence.