Large Language Models on Memory-Constrained Devices Using Flash Memory: Related Works

Efficient Inference for Large Language Models. As LLMs grow in size, reducing their computational and memory requirements for inference has become an active area of research. Approaches broadly fall into two categories: model compression techniques like pruning and quantization (Han et al., 2016b; Sun et al., 2023; Jaiswal et al., 2023; Xia et al., 2023), (Zhang et al., 2022a; Xu et al., 2023; Shao et al., 2023; Lin et al., 2023; Hoang et al., 2023; Zhao et al., 2023; Ahmadian et al., 2023; Liu et al., 2023a; Li et al., 2023), and selective execution like sparse activations (Liu et al., 2023b), (Mirzadeh et al., 2023) or conditional computation (Graves, 2016; Baykal et al., 2023). Our work is complementary, focusing on minimizing data transfer from flash memory during inference.

\ Selective Weight Loading. Most related to our approach is prior work on selective weight loading. SparseGPU (Narang et al., 2021) exploits activation sparsity to load a subset of weights for each layer. However, it still requires loading from RAM. Flexgen (Sheng et al., 2023) offloads the weights and kv-cache from GPU memory to DRAM and DRAM to flash memory, in contrast we consider only the cases the full model can’t reside in the whole DRAM and GPU memory on the edge devices. Flexgen is theoretically bound by the slow throughput of flash to DRAM in such scenarios. Firefly (Narang et al., 2022) shares our goal of direct flash access but relies on a hand-designed schedule for loading. In contrast, we propose a cost model to optimize weight loading. Similar techniques have been explored for CNNs (Parashar et al., 2017), (Rhu et al., 2013). Concurrently, Adapt (Subramani et al., 2022) has proposed adaptive weight loading for vision transformers. We focus on transformer-based LLMs and introduce techniques like neuron bundling tailored to LLMs.

\ To hide flash latency, we build on speculative execution techniques like SpAtten (Dai et al., 2021; Bae et al., 2023). But, we introduce lightweight speculation tailored to adaptive weight loading.

\ Hardware Optimizations. There is a rich body of work on hardware optimizations for efficient LLM inference, including efficient memory architectures (Agrawal et al., 2022), (Gao et al., 2022), dataflow optimizations (Han et al., 2016a), (Shao et al., 2022), hardware evaluation frameworks Zhang2023AHE, and flash optimizations (Ham et al., 2016), (Meswani et al., 2015). We focus on algorithmic improvements, but these could provide additional speedups.

\ Speculative Execution. Speculative decoding (Leviathan et al., 2022; Zhang et al., 2023; He et al., 2023) is a technique that uses a draft model for generation and uses the larger model to verify those tokens. This technique is orthogonal to us and can be used for further improvement. In case of speculative decoding, the window in our method should be updated with multiple tokens rather one.

\ Mixture of Experts. Mixture of Experts (Yi et al., 2023) have a sparse structure in their feed forward layer and can leverage our method for enabling larger models on device.

\ In summary, we propose algorithmic techniques to minimize weight loading from flash memory during LLM inference. By combining cost modeling, sparsity prediction, and hardware awareness, we demonstrate 4-5x and 20-25x speedup on CPU and GPU, respectively.

:::info This paper is available on arxiv under CC BY-SA 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Knapsack

Print Share Comment Cite Upload Translate Updates

APA

Knapsack | Sciencx (2024-07-31T21:00:15+00:00) Large Language Models on Memory-Constrained Devices Using Flash Memory: Related Works. Retrieved from https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-related-works/

MLA

" » Large Language Models on Memory-Constrained Devices Using Flash Memory: Related Works." Knapsack | Sciencx - Wednesday July 31, 2024, https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-related-works/

HARVARD

Knapsack | Sciencx Wednesday July 31, 2024 » Large Language Models on Memory-Constrained Devices Using Flash Memory: Related Works., viewed ,<https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-related-works/>

VANCOUVER

Knapsack | Sciencx - » Large Language Models on Memory-Constrained Devices Using Flash Memory: Related Works. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-related-works/

CHICAGO

" » Large Language Models on Memory-Constrained Devices Using Flash Memory: Related Works." Knapsack | Sciencx - Accessed . https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-related-works/

IEEE

" » Large Language Models on Memory-Constrained Devices Using Flash Memory: Related Works." Knapsack | Sciencx [Online]. Available: https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-related-works/. [Accessed: ]

rf:citation

» Large Language Models on Memory-Constrained Devices Using Flash Memory: Related Works | Knapsack | Sciencx | https://www.scien.cx/2024/07/31/large-language-models-on-memory-constrained-devices-using-flash-memory-related-works/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

5 Related Works

Related Posts