Description
A serving system for Deep Learning models is a tricky design problem. It's a large scale production system, so we want it to scale well, adapt to changing traffic patterns, and have low latency. It’s also part of the Data Scientist’s core loop - so it should be very flexible, and running an experiment on live traffic should be easy. In this talk, I’ll discuss key design considerations for such a system covering both perspectives. I’ll also describe a system we built at Taboola for serving TensorFlow models. It serves billions of requests per day, spread over dozens of models, and still has pretty good ergonomics,