Apparate: Early-Exit Models for ML Latency and Throughput Optimization – Additional Related Work

A number of model-serving systems have been proposed [4, 5,7,17,22,39,44,49] where the focus is on serving large volumes of inference requests within a pre-defined SLO. Existing systems favor maximizing the system throughput while adhering to the latency constraints (2) by the use of intelligent placement [22,49], batching [4] and routing [44]. To the best of our knowledge, no existing serving proposals focus on alleviating the latency-throughput tension.

\ The ML community has been actively working on early-exit networks, with several proposals focusing on the EE’s ramp architecture and exit strategy [28,36,46,53,57,58,64]. The architecture of the ramp depends on the domain, but it typically consists of one or more layers that provide information necessary to make an exit decision by emulating the original model. Replicating the last (few) layers is the common [57, 58], Apparate builds on this approach and prefers shallow ramps in its workflow (3). Once a ramp architecture is chosen, the exit strategy could be based on confidence of the labels [36] or entropy of the prediction [57]. More sophisticated approaches exist, e.g., instead of considering the ramps as fully independent, [64] uses counter-based exiting. Apparate’s focus is on leveraging EEs to resolve the latencythroughput tension in serving systems with a design that generalizes to a large class of EE architectures.

\ Optimizing model serving objectives based on workload characteristics has been discussed in recent works [15, 16, 21, 44, 49, 62, 63]. Inferline [16] optimizes cost in serving pipelines of models while adhering to strict latency constraints using intelligent provisioning and management. Shepherd [63] maximizes goodput and resource utilization in highly unpredictable workloads. Despite their impressive results, these works still optimize their metric of choice at the expense of latency and do not resolve the latency-throughput tension, which is the focus of our (complementary) work.

\ A recent line of work has focused on creating variants of an ML model to optimize serving performance. Some of these works look at execution graph level optimizations such as quantization and fusing to reduce inference latency [1, 3], while others replace the model with an equivalent one that meets the provided constraints. Solutions like Mystify [23] and INFaaS [44] generates and chooses model variants based on their intent and constraints (including performance). As shown in §5.2, Apparate’s wins persist even on compressed models, and is thus complementary to these works in that it can operate on their outputs. Finally, optimizing the execution of dynamic neural networks that alter NN execution (e.g., EEs, mixture of experts) was proposed in [18, 61]. These are low-level optimizations (e.g., at the GPU) which can benefit Apparate and improve its performance.

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

Print Share Comment Cite Upload Translate Updates

APA

Writings, Papers and Blogs on Text Models | Sciencx (2024-10-03T00:00:17+00:00) Apparate: Early-Exit Models for ML Latency and Throughput Optimization – Additional Related Work. Retrieved from https://www.scien.cx/2024/10/03/apparate-early-exit-models-for-ml-latency-and-throughput-optimization-additional-related-work/

MLA

" » Apparate: Early-Exit Models for ML Latency and Throughput Optimization – Additional Related Work." Writings, Papers and Blogs on Text Models | Sciencx - Thursday October 3, 2024, https://www.scien.cx/2024/10/03/apparate-early-exit-models-for-ml-latency-and-throughput-optimization-additional-related-work/

HARVARD

Writings, Papers and Blogs on Text Models | Sciencx Thursday October 3, 2024 » Apparate: Early-Exit Models for ML Latency and Throughput Optimization – Additional Related Work., viewed ,<https://www.scien.cx/2024/10/03/apparate-early-exit-models-for-ml-latency-and-throughput-optimization-additional-related-work/>

VANCOUVER

Writings, Papers and Blogs on Text Models | Sciencx - » Apparate: Early-Exit Models for ML Latency and Throughput Optimization – Additional Related Work. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/03/apparate-early-exit-models-for-ml-latency-and-throughput-optimization-additional-related-work/

CHICAGO

" » Apparate: Early-Exit Models for ML Latency and Throughput Optimization – Additional Related Work." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2024/10/03/apparate-early-exit-models-for-ml-latency-and-throughput-optimization-additional-related-work/

IEEE

" » Apparate: Early-Exit Models for ML Latency and Throughput Optimization – Additional Related Work." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2024/10/03/apparate-early-exit-models-for-ml-latency-and-throughput-optimization-additional-related-work/. [Accessed: ]

rf:citation

» Apparate: Early-Exit Models for ML Latency and Throughput Optimization – Additional Related Work | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2024/10/03/apparate-early-exit-models-for-ml-latency-and-throughput-optimization-additional-related-work/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

6 ADDITIONAL RELATED WORK

Related Posts