Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

:::info
Authors:
(1) Keivan Alizadeh;
(2) Iman Mirzadeh, Major Contribution;
(3) Dmitry Belenko, Major Contribution;
(4) S. Karen Khatamifard;
(5) Minsik Cho;
(6) Carlo C Del Mundo;
(7) Mohammad Rastegari;
(8) Mehrdad Farajtabar.
:::
Table of Links
Abs…


This content originally appeared on HackerNoon and was authored by Knapsack

:::info Authors:

(1) Keivan Alizadeh;

(2) Iman Mirzadeh, Major Contribution;

(3) Dmitry Belenko, Major Contribution;

(4) S. Karen Khatamifard;

(5) Minsik Cho;

(6) Carlo C Del Mundo;

(7) Mohammad Rastegari;

(8) Mehrdad Farajtabar.

:::

Abstract and 1. Introduction

2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints

2.2 Read Throughput

3 Load From Flash

3.1 Reducing Data Transfer

3.2 Improving Transfer Throughput with Increased Chunk Sizes

3.3 Optimized Data Management in DRAM

4 Results

4.1 Results for OPT 6.7B Model

4.2 Results for Falcon 7B Model

5 Related Works

6 Conclusion and Discussion, Acknowledgements and References

3 Load From Flash

This section addresses the challenge of conducting inference on devices where the available DRAM is substantially smaller than the size of the model. This necessitates storing the full model weights in flash memory. Our primary metric for evaluating various flash loading strategies is latency, dissected into three distinct components: the I/O cost of loading from flash, the overhead of managing memory with newly loaded data, and the compute cost for inference operations.

\ Our proposed solutions for reducing latency under memory constraints are categorized into three strategic areas, each targeting a specific aspect of the latency:

\ • Reducing Data Load: Aiming to decrease latency associated with flash I/O operations by loading less data[1].

\ • Optimizing Data Chunk Size: Enhancing flash throughput by increasing the size of data chunks loaded, thereby mitigating latency.

\ • Efficient Management of Loaded Data: Streamlining the management of data once it is loaded into memory to minimize overhead.

\ It is important to note that our focus is not on the compute aspect of the process, as it is orthogonal to the core concerns of our work. This delineation allows us to concentrate on optimizing flash memory interactions and memory management to achieve efficient inference on memory-constrained devices.

\ Finally, we will elaborate on the implementation of these strategies in subsequent sections.

\

:::info This paper is available on arxiv under CC BY-SA 4.0 DEED license.

:::


[1] It is notable that, by data we mean weights of the neural network. However, our developed techniques can be easily generalized to other data types transferred and used for LLM inference, such as activations or KV cache, as suggested by Sheng et al. (2023).


This content originally appeared on HackerNoon and was authored by Knapsack


Print Share Comment Cite Upload Translate Updates
APA

Knapsack | Sciencx (2024-07-31T15:00:27+00:00) Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash. Retrieved from https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-load-from-flash/

MLA
" » Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash." Knapsack | Sciencx - Wednesday July 31, 2024, https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-load-from-flash/
HARVARD
Knapsack | Sciencx Wednesday July 31, 2024 » Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash., viewed ,<https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-load-from-flash/>
VANCOUVER
Knapsack | Sciencx - » Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-load-from-flash/
CHICAGO
" » Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash." Knapsack | Sciencx - Accessed . https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-load-from-flash/
IEEE
" » Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash." Knapsack | Sciencx [Online]. Available: https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-load-from-flash/. [Accessed: ]
rf:citation
» Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash | Knapsack | Sciencx | https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-load-from-flash/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.