Improving Text Embeddings with Large Language Models: Is Contrastive Pre-training Necessary?

:::info
Authors:
(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);
(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);
(3) Xiaolong Huang, Microsoft Corporation;
(4) Linjun Yang, Micro…


This content originally appeared on HackerNoon and was authored by Auto Encoder: How to Ignore the Signal Noise

:::info Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

:::

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

Figure 3: Effects of contrastive pre-training. Detailed numbers are in Appendix Table 6.

\ Weakly-supervised contrastive pre-training is one of the key factors behind the success of existing text embedding models. For instance, Contriever [18] treats random cropped spans as positive pairs for pre-training, while E5 [46] and BGE [48] collect and filter text pairs from various sources.

\

\

:::info This paper is available on arxiv under CC0 1.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Auto Encoder: How to Ignore the Signal Noise


Print Share Comment Cite Upload Translate Updates
APA

Auto Encoder: How to Ignore the Signal Noise | Sciencx (2024-10-09T19:00:17+00:00) Improving Text Embeddings with Large Language Models: Is Contrastive Pre-training Necessary?. Retrieved from https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-is-contrastive-pre-training-necessary/

MLA
" » Improving Text Embeddings with Large Language Models: Is Contrastive Pre-training Necessary?." Auto Encoder: How to Ignore the Signal Noise | Sciencx - Wednesday October 9, 2024, https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-is-contrastive-pre-training-necessary/
HARVARD
Auto Encoder: How to Ignore the Signal Noise | Sciencx Wednesday October 9, 2024 » Improving Text Embeddings with Large Language Models: Is Contrastive Pre-training Necessary?., viewed ,<https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-is-contrastive-pre-training-necessary/>
VANCOUVER
Auto Encoder: How to Ignore the Signal Noise | Sciencx - » Improving Text Embeddings with Large Language Models: Is Contrastive Pre-training Necessary?. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-is-contrastive-pre-training-necessary/
CHICAGO
" » Improving Text Embeddings with Large Language Models: Is Contrastive Pre-training Necessary?." Auto Encoder: How to Ignore the Signal Noise | Sciencx - Accessed . https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-is-contrastive-pre-training-necessary/
IEEE
" » Improving Text Embeddings with Large Language Models: Is Contrastive Pre-training Necessary?." Auto Encoder: How to Ignore the Signal Noise | Sciencx [Online]. Available: https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-is-contrastive-pre-training-necessary/. [Accessed: ]
rf:citation
» Improving Text Embeddings with Large Language Models: Is Contrastive Pre-training Necessary? | Auto Encoder: How to Ignore the Signal Noise | Sciencx | https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-is-contrastive-pre-training-necessary/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.