Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation

:::info
Authors:
(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);
(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);
(3) Xiaolong Huang, Microsoft Corporation;
(4) Linjun Yang, Micro…


This content originally appeared on HackerNoon and was authored by Auto Encoder: How to Ignore the Signal Noise

:::info Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

:::

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

4.2 Model Fine-tuning and Evaluation

The pretrained Mistral-7b [19] checkpoint is fine-tuned for 1 epoch using the loss in Equation 2. We follow the training recipe from RankLLaMA [24] and utilize LoRA [17] with rank 16. To further reduce GPU memory requirement, techniques including gradient checkpointing, mixed precision training, and DeepSpeed ZeRO-3 are applied.

\ For the training data, we utilize both the generated synthetic data and a collection of 13 public datasets, yielding approximately 1.8M examples after sampling. More details are available in Appendix A. To provide a fair comparison with some previous work, we also report results when the only labeled supervision is the MS-MARCO passage ranking [5] dataset.

\ We evaluate the trained model on the MTEB benchmark [28]. Note that the retrieval category in MTEB corresponds to the 15 publicly available datasets in the BEIR benchmark [42]. Evaluation of one model takes about 3 days on 8 V100 GPUs due to the need to encode a large number of documents. Although our model can accommodate sequence length beyond 512, we only evaluate on the first 512 tokens for efficiency. Official metrics are reported for each category. For more details about the evaluation protocol, please refer to the original papers [28, 42].

\

:::info This paper is available on arxiv under CC0 1.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Auto Encoder: How to Ignore the Signal Noise


Print Share Comment Cite Upload Translate Updates
APA

Auto Encoder: How to Ignore the Signal Noise | Sciencx (2024-10-09T16:00:22+00:00) Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation. Retrieved from https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-model-fine-tuning-and-evaluation/

MLA
" » Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation." Auto Encoder: How to Ignore the Signal Noise | Sciencx - Wednesday October 9, 2024, https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-model-fine-tuning-and-evaluation/
HARVARD
Auto Encoder: How to Ignore the Signal Noise | Sciencx Wednesday October 9, 2024 » Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation., viewed ,<https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-model-fine-tuning-and-evaluation/>
VANCOUVER
Auto Encoder: How to Ignore the Signal Noise | Sciencx - » Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-model-fine-tuning-and-evaluation/
CHICAGO
" » Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation." Auto Encoder: How to Ignore the Signal Noise | Sciencx - Accessed . https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-model-fine-tuning-and-evaluation/
IEEE
" » Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation." Auto Encoder: How to Ignore the Signal Noise | Sciencx [Online]. Available: https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-model-fine-tuning-and-evaluation/. [Accessed: ]
rf:citation
» Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation | Auto Encoder: How to Ignore the Signal Noise | Sciencx | https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-model-fine-tuning-and-evaluation/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.