Training and Testing Data Formats for AnLLM Models

This content originally appeared on HackerNoon and was authored by Anchoring

:::info Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

:::

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References

\ A More Experimental Results

B Data Settings

To provide a thorough insight into how we continually pre-train the model into AnLLM and carry out evaluations, we showcase some data examples in this section for both training and testing data.

B.1 Training Data Examples

In this section, we provide examples to illustrate the specific data format used in training the AnLLM models. For the AnLLM-EP model, the endpoints act as anchor tokens, allowing us to directly utilize natural language texts. For the AnLLM-AC model, we append a new token at the end of each sequence in the input texts, which are initially split into sentences using the NLTK toolkits.[3] Some examples are presented in Table 6. All the trainig data are downloaded from HuggingFace [4], an opensource community.

B.2 Testing Data Examples

For the testing outlined in the results section (Section 5), we employ the same evaluation method as in previous work (Gao et al., 2023), which treats each choice as text generation and computes the corresponding probabilities, respectively. Table 7 presents some evaluation examples.

\ Table 4: Accuracy of 13B LLMs on Question Answering Benchmarks. Compared to 7B AnLLMs, the 13B AnLLMs exhibit superior performance, with up to 2.0 accuracy enhancements, suggesting that AnLLMs possess excellent scalability to larger model architectures.

\ Table 5: Case Study of Real-time Inference. During the inference process, AnLLM-EP generates "endpoint" as the anchor token, whereas AnLLM-AC produces "" as the anchor token. Once upon an anchor token, we execute the REDUCTION as shown in Line 16 to reduce the keys/values caches.

\ Table 6: Training Data Examples for the AnLLM-EP and AnLLM-AC models. For the AnLLM-EP model, the endpoints are the natural anchor tokens. For the AnLLM-AC model, we manually append tokens to sequences as the anchor tokens.

\ Table 7: Testing Data Examples for the AnLLM-EP and AnLLM-AC models. The log-likelihood of the red italicized texts is calculated as the choice probabilities.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[3] https://www.nltk.org/api/nltk.tokenize.punkt. html

\ [4] https://huggingface.co/datasets/ togethercomputer/RedPajama-Data-1T-Sample

This content originally appeared on HackerNoon and was authored by Anchoring

Print Share Comment Cite Upload Translate Updates

APA

Anchoring | Sciencx (2024-10-11T16:00:42+00:00) Training and Testing Data Formats for AnLLM Models. Retrieved from https://www.scien.cx/2024/10/11/training-and-testing-data-formats-for-anllm-models/

MLA

" » Training and Testing Data Formats for AnLLM Models." Anchoring | Sciencx - Friday October 11, 2024, https://www.scien.cx/2024/10/11/training-and-testing-data-formats-for-anllm-models/

HARVARD

Anchoring | Sciencx Friday October 11, 2024 » Training and Testing Data Formats for AnLLM Models., viewed ,<https://www.scien.cx/2024/10/11/training-and-testing-data-formats-for-anllm-models/>

VANCOUVER

Anchoring | Sciencx - » Training and Testing Data Formats for AnLLM Models. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/11/training-and-testing-data-formats-for-anllm-models/

CHICAGO

" » Training and Testing Data Formats for AnLLM Models." Anchoring | Sciencx - Accessed . https://www.scien.cx/2024/10/11/training-and-testing-data-formats-for-anllm-models/

IEEE

" » Training and Testing Data Formats for AnLLM Models." Anchoring | Sciencx [Online]. Available: https://www.scien.cx/2024/10/11/training-and-testing-data-formats-for-anllm-models/. [Accessed: ]

rf:citation

» Training and Testing Data Formats for AnLLM Models | Anchoring | Sciencx | https://www.scien.cx/2024/10/11/training-and-testing-data-formats-for-anllm-models/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

B Data Settings

B.1 Training Data Examples

B.2 Testing Data Examples

Related Posts