Where does In-context Translation Happen in Large Language Models: Inference Efficiency

This content originally appeared on HackerNoon and was authored by Computational Technology for All

:::info Authors:

(1) Suzanna Sia, Johns Hopkins University;

(2) David Mueller;

(3) Kevin Duh.

:::

Table of Links

5. Inference Efficiency

Speeding up transformer inference is of great interest to the community (Fournier et al., 2023). We highlight the potential of speeding up inference time as a direct consequence of identifying where task recognition occurs in the model and redundancy of self-attention processing. Our results indicate that we can achieve significant speedups in inference by removing the processing of context-tokens all-together after a certain point in the model, with little to no impact on downstream performance.

\ Then, for a model with nℓ layers, the amount of processing in terms of speed and memory saved is approximately (nℓ − r)/nℓ × (k/k + 1).

\ Using the example of LLAMA7B (32 layers), we see from Figure 2 that the model is very close to it’s ceiling score after processing the examples at layer 14 (ℓ = 14). If we no longer need to process examples after ℓ = 14, under a prompt size of 5 the savings are approximately 45%.

\ For instruction-tuned models which are typically deployed in production, even if we assume that no examples are provided, savings can be non-trivial as very long-form instructions are typically provided to the model in an attempt to control it’s behavior (prompt engineering).

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Computational Technology for All

Print Share Comment Cite Upload Translate Updates

APA

Computational Technology for All | Sciencx (2024-08-30T17:00:24+00:00) Where does In-context Translation Happen in Large Language Models: Inference Efficiency. Retrieved from https://www.scien.cx/2024/08/30/where-does-in-context-translation-happen-in-large-language-models-inference-efficiency/

MLA

" » Where does In-context Translation Happen in Large Language Models: Inference Efficiency." Computational Technology for All | Sciencx - Friday August 30, 2024, https://www.scien.cx/2024/08/30/where-does-in-context-translation-happen-in-large-language-models-inference-efficiency/

HARVARD

Computational Technology for All | Sciencx Friday August 30, 2024 » Where does In-context Translation Happen in Large Language Models: Inference Efficiency., viewed ,<https://www.scien.cx/2024/08/30/where-does-in-context-translation-happen-in-large-language-models-inference-efficiency/>

VANCOUVER

Computational Technology for All | Sciencx - » Where does In-context Translation Happen in Large Language Models: Inference Efficiency. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/08/30/where-does-in-context-translation-happen-in-large-language-models-inference-efficiency/

CHICAGO

" » Where does In-context Translation Happen in Large Language Models: Inference Efficiency." Computational Technology for All | Sciencx - Accessed . https://www.scien.cx/2024/08/30/where-does-in-context-translation-happen-in-large-language-models-inference-efficiency/

IEEE

" » Where does In-context Translation Happen in Large Language Models: Inference Efficiency." Computational Technology for All | Sciencx [Online]. Available: https://www.scien.cx/2024/08/30/where-does-in-context-translation-happen-in-large-language-models-inference-efficiency/. [Accessed: ]

rf:citation

» Where does In-context Translation Happen in Large Language Models: Inference Efficiency | Computational Technology for All | Sciencx | https://www.scien.cx/2024/08/30/where-does-in-context-translation-happen-in-large-language-models-inference-efficiency/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

5. Inference Efficiency

Related Posts