2021-07-03

uber modelserver 阅读笔记

前言

今天阅读了 Uber 的一篇讲解 Model Server 的文章：Continuous Integration and Deployment for Machine Learning Online Serving and Models。刚好我目前也在调研 Model Server，所以做个阅读笔记。下面是 Uber 文章的全文和我的一些批注。

Introduction

At Uber, we have witnessed a significant increase in machine learning adoption across various organizations and use-cases over the last few years. Our machine learning models are empowering a better customer experience, helping prevent safety incidents, and ensuring market efficiency, all in real time. The figure above is a high level view of CI/CD for models and service binary.

One thing to note is we have continuous integration (CI)/continuous deployment (CD) for models and services, as shown above in Figure 1. We arrived at this solution after several iterations to address some of MLOps challenges, as the number of models trained and deployed grew rapidly. The first challenge was to support a large volume of model deployments on a daily basis, while keeping the Real-time Prediction Service highly available. We will discuss our solution in the Model Deployment section.

注：重点是提出了第一个挑战：如何支持频繁的大量的模型部署，同时保证 Prediction Service 的高可用性。在 Model Deployment 一节提到主要是使用了模型动态加载机制、一套模型部署流程和模型部署过程追踪（部署进度查询，和模型健康检查）。

The memory footprint associated with a Real-time Prediction Service instance grows as newly retrained models get deployed, which presented our second challenge. A large number of models also increases the amount of time required for model downloading and loading during instance (re)start. We observed a great portion of older models received no traffic as newer models were deployed. We will discuss our solution in the Model Auto-Retirement section.

注：重点是提出的第二个挑战：随着新模型的部署，带来的 Real-time Prediction Service 服务内存使用的快速上升。在 Auto-Retirement 一节，提到 Real-time Prediction Service 使用 Auto-Retirement 机制解决这个问题。主要是找出长久没有使用的模型，然后给模型的负责人发报警。

As we are managing a fleet of Real-time Prediction Services, manual service software deployment is not an option. The fourth challenge is to have a CI/CD story for Real-time Prediction Services software. During model deployment time, the Model Deployment Service performs validation by making prediction calls to the candidate model with sampled data. However, it does not check against existing models deployed to the Real-time Prediction Services. Even if a model passes validation, there is no guarantee that the model can be used or exhibits the same behavior (for feature transformation and model evaluation) when deployed to production Real-time Prediction Service instances. This is because there could be dependency changes, service build script changes, or interface changes between two Real-time Prediction Service releases. We will discuss our solution in the Continuous Integration and Deployment section.

注：提出如何对即将上线的模型的质量有所保证，和在持续交付中 Real-time Prediction Service 的版本和模型版本的冲突，隐形依赖等问题。在 Deployment 一节提出上线前的三个步骤来解决这些问题。

Model Deployment

For managing the models running in the Real-time Prediction Service, machine learning engineers can deploy new and retire unused models through the model deployment API. They can track their models’ deployment progress and health status through the API. You can see the system’s internal architecture below in Figure 2:
model deployment workflow and health check workflow

注：重要信息是 machine learning engineers can deploy new and retire unused models through the model deployment AIP.

Dynamic Model Loading

Historically, we sealed model artifacts into Real-time Prediction Service docker images, and deployed models together with the service. With the rapid growth of model deployments, this heavy process became a bottleneck for model iteration, causing interruptions between model and service developers.

注：重要信息是：we sealed model artifacts into Real-time Prediction Service docker images, and deployed models together with the service。将模型注入到 Real-time Prediction Service 的 docker 镜像中。

To solve this problem, we implemented dynamic model loading. The Model Artifact & Config store holds the target state of which models should be served in production. Realtime Prediction Service periodically checks that store, compares it with the local state, and triggers loading of new models and removal of retired models accordingly. Dynamic model loading decouples the model and server development cycles, enabling faster production model iteration.

注：为了解决上面的问题，Ubser 开发了 Dynamic Model Loading。将模型的配置信息存放到 store 中， Realtime prediction Service 会周期性检查 store，比较 store 中模型的 state 和已经部署的模型的 state 的状态。当状态不同，自动加载新模型和卸载老模型。

Model Deployment Workflow

Model deployment does not simply push the trained model into Model Artifact & Config store; it goes through the steps to create a self-contained and validated model package:

Artifacts validation: makes sure the trained model includes all necessary artifacts for serving and monitoring
Compile: packages all the model artifacts and metadata into a self-contained and loadable package into the Real-time Prediction Service
Serving validation: load the compiled model jar locally and perform a model prediction with example data from training dataset—this step ensures that the model can run and is compatible with Real-time Prediction Service

The primary reason for these steps is to ensure the stability of the Real-time Prediction Service. Since multiple models are loaded in the same container, a bad model may cause prediction request failure, and potentially interrupt models on the same container.

注：重要信息是：Since multiple models are loaded in the same container。多个模型在同一个容器中，这个容器是 Uber 的预测服务吗？Uber 是在预测服务里加载的模型？预测服务只有一个实例吗？

Model Deployment Tracking

For helping machine learning engineers manage their production models, we provide tracking for deployed models, as shown above in Figure 2. It involves two parts:

Deployment progress tracking: Deployment workflow will post deployment progress updates to the centralized metadata storage for tracking.
Health check: After the model finishes its deployment workflow, it becomes a candidate for model health check. The check is performed periodically to track model health and usage information, and it sends updates to the metadata storage.

Model Deployment 这一大节提到了 Uber 的 Realtime Prediction Service 支持通过比对 Store 中模型的状态，自动加载最新的模型版本。提到了多个模型部署在同一个 container 中（应该是 Realtime Prediction Service ）。这些特点和 Tensorflow Serving 一样，Ubser 的 Prediction Service 是和 Tensorflow Serving 一样的项目么？

Model Auto-Retirement

There is an API for retiring unused models. However, in many cases people forget to do that, or do not integrate model cleanup into their machine learning workflows. This results in unnecessary storage costs and an increased memory footprint. A large memory footprint can cause Java garbage collection pauses and out-of-memory errors, both of which can impact quality of service. To address this, we built a model auto-retirement process, wherein owners can set an expiration period for the models. If a model has not been used beyond the expiration period, the Auto-Retirement workflow, in Figure 1 above, will trigger a warning notification to the relevant users and retire the model. We saw a non-trivial reduction in our resource footprint after we enabled this feature.

注：这段说给每个模型都设置了一个过期时间，如果模型在大于过期时间的时间段没有处理任何请求，就会给该模型的负责人发警告信息。比较有意思的是提到了过多该下线的模型不下线会造成大量的内存占用，可能会引发 JVM 垃圾清理机制，也就是说 Uber 的 Realtime Prediction Service 使用 Java 写的。

Auto-Shadow

As machine learning engineers choose to roll out models with different strategies, they often need to devise ways to distribute real-time prediction traffic among a set of models. We have seen common patterns, such as gradual rollout and shadowing, in their traffic distribution strategies. In a gradual rollout, clients fork traffic and gradually shift the traffic distribution among a group of models. In shadowing, clients duplicate traffic on an initial (primary) model to apply on another (shadow) model). Figure 3 illustrates a typical traffic distribution among a set of models, wherein models A, B, and C participate in a gradual rollout, while model D shadows model B.
Real-time prediction traffic distribution amoung a set of models

这段主要讲如何将线上流量分流到待测试模型，测试测试模型的表现。主要聚焦于 shadow model，也就是将线上某模型的流量复制一份到测试模型。感觉就是 A/B 测试。

To help reduce engineering hours on developing repetitive implementations for common patterns, Real-time Prediction Services provides built-in mechanisms for traffic distribution. Below we focus on the scenario of model auto-shadowing.

Different teams have different strategies for model shadowing, but they all share common characteristics:

Model prediction results from production data are not used in production—they are collected for analysis
A shadow model shares most features with its primary model, which is especially true in user workflows that regularly retrain and update models
Shadowing usually spans a time window of days or weeks before it is stopped
A primary model can be shadowed by multiple shadow models; a shadow model can shadow multiple primary models
Shadow traffic can be 100%, or picked based on some criteria of the primary model traffic
To compare the results, the same prediction is collected for both the primary and the shadow models.
A primary model may be serving millions of predictions, and prediction logs may be sampled

We add auto-shadow configuration as part of the model deployment configurations. Real-time Prediction Service can check on the auto-shadow configurations, and distribute traffic accordingly. Users only need to configure shadow relations and shadow criteria (what to shadow and how long to shadow) through API endpoints, and make sure to add features that are needed for the shadow model but not for the primary model.

We found a built-in auto-shadow feature provides extra benefits:

As the majority of primary and shadow models share a set of common features, Real-time Prediction Service only fetches features from the online feature store that are not used in the primary model for the shadow models
By combining built-in prediction logging logic and shadow sampling logic, Real-time Prediction Service can reduce the amount of shadow traffic to those that are destined to be logged
Shadow models can be treated as second class models when service is under pressure, and paused/resumed in order to relieve load pressure

注：这段讲了不同模型对 shadow model 功能的共同的需求点和 shadow model 功能的好处。

Continuous Integration and Deployment

We rely on CI/CD for service release deployment for a fleet of Real-time Prediction Services. Since we are supporting critical business use cases, in addition to validation during model deployment, we need to ensure high confidence in the automated continuous integration and deployment process.

Our solution tries to address the following issues with new release:

** Code change not compatible: ** there could be two symptoms for this issue—a model is unable to load or make predictions with new binaries, or its behavior changed with new releases. The latter is difficult to identify and fix, and is critical to a model’s correctness.
** Dependencies not compatible: ** service unable to start due to underlying dependency changes.
** Build script not compatible: ** release not able to build due to build script changes.

注：自动化持续集成和部署遇到的困难，分别是代码不兼容，依赖不对，构建脚本和服务代码不兼容。

To address the above issues, we employed a three-stage strategy for validating and deploying the latest binary of the Real-time Prediction Service: staging integration test, canary integration test, and production rollout. The staging integration test and canary integration tests are run against non-production environments. Staging integration tests are used to verify the basic functionalities. Once the staging integration tests have been passed, we run canary integration tests to ensure the serving performance across all production models. After ensuring that the behavior for production models will be unchanged, the release is deployed onto all Real-time Prediction Service production instances, in a rolling deployment fashion.

注：为了解决上述问题，Uber 提出了模型上线前的三个步骤：分别是集成测试、金丝雀集成测试和滚动上线。集成测试是测试基本功能，金丝雀集成测试测试所有现有模型的服务性能。当这两个测试都通过后，再以滚动部署的方式进行部署。

Final Thoughts

We have shared our solutions for some of MLOps challenges. As we evolve Uber’s machine learning infrastructure and platform and support new machine learning use cases, we see new MLOps challenges emerge. Some of the things that we are working on include: near real-time monitoring for inference accuracy, feature quality, and business metrics; deploy and serve multi-task learning and hybrid models; perform feature validation; better model fallback mechanism; model traceability and debuggability, etc. Stay tuned.

注：讲述了一些新的挑战。

Acknowledgements

We could not have accomplished the technical work outlined in this article without the help of our team of engineers and data scientists at Uber. Special thanks to Smitha Shyam and the entire Michelangelo team for their support and technical discussion.

读完整篇文章，能够确定 Uber 的 Real-time Prediction Service 使用 Java 语言写，支持部署多个模型，支持动态模型加载、A/B 测试、滚动上线等功能。总体感觉和 Tensorflow Serving，TorchServe 的定位比较相像。目前比较疑惑的是 Uber 的 Real-time Prediction Service 究竟支持部署哪种格式的模型呢？支持 Tensorflow 和 Pytorch 模型吗？准备问下 Uber 的工程师。