Improving Text Embeddings with Large Language Models: Synthetic Data Generation

:::info
Authors:
(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);
(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);
(3) Xiaolong Huang, Microsoft Corporation;
(4) Linjun Yang, Micro…


This content originally appeared on HackerNoon and was authored by Auto Encoder: How to Ignore the Signal Noise

:::info Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

:::

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

3 Method

3.1 Synthetic Data Generation

Utilizing synthetic data generated by advanced LLMs such as GPT-4 presents a compelling opportunity, especially in terms of enhancing diversity across a multitude of tasks and languages. Such diversity is essential for developing robust text embeddings that can perform well across different tasks, be it semantic retrieval, textual similarity, or clustering.

\ To generate diverse synthetic data, we propose a simple taxonomy that categorizes embedding tasks into several groups, and then apply different prompt templates to each group.

\ Asymmetric Tasks This category comprises tasks where the query and document are semantically related but are not paraphrases of each other. Depending on the length of the query and document, we further divide asymmetric tasks into four subgroups: short-long match, long-short match, short-short match, and long-long match. For instance, short-long match tasks involve a short query and a long document, which is a typical scenario in commercial search engines. For each subgroup, we design a two-step prompt template that first prompts LLMs brainstorm a list of tasks, and then generates a concrete example conditioned on the task definition. In Figure 1, we show an example prompt for the short-long match subgroup. The outputs from GPT-4 are mostly coherent and of high quality. In our preliminary experiments, we also attempted to generate the task definition and query-document pairs using a single prompt, but the data diversity was not as satisfactory as the proposed two-step approach.

\ Symmetric Tasks Symmetric tasks involve queries and documents that have similar semantic meanings but different surface forms. We examine two application scenarios: monolingual semantic textual similarity (STS) and bitext retrieval. We design two distinct prompt templates for each scenario, tailored to their specific objectives. Since the task definition is straightforward, we omit the brainstorming step for symmetric tasks.

\ To further boost the diversity of the prompts and thus the synthetic data, we incorporate several placeholders in each prompt template, whose values are randomly sampled at runtime. For example, in Figure 1, the value of “{query_length}” is sampled from the set “{less than 5 words, 5-10 words, at least 10 words}”.

\ To generate multilingual data, we sample the value of “{language}” from the language list of XLMR [7], giving more weight to high-resource languages. Any generated data that does not conform to the predefined JSON format are discarded during the parsing process. We also remove duplicates based on exact string matching.

\

:::info This paper is available on arxiv under CC0 1.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Auto Encoder: How to Ignore the Signal Noise


Print Share Comment Cite Upload Translate Updates
APA

Auto Encoder: How to Ignore the Signal Noise | Sciencx (2024-10-09T12:00:44+00:00) Improving Text Embeddings with Large Language Models: Synthetic Data Generation. Retrieved from https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-synthetic-data-generation/

MLA
" » Improving Text Embeddings with Large Language Models: Synthetic Data Generation." Auto Encoder: How to Ignore the Signal Noise | Sciencx - Wednesday October 9, 2024, https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-synthetic-data-generation/
HARVARD
Auto Encoder: How to Ignore the Signal Noise | Sciencx Wednesday October 9, 2024 » Improving Text Embeddings with Large Language Models: Synthetic Data Generation., viewed ,<https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-synthetic-data-generation/>
VANCOUVER
Auto Encoder: How to Ignore the Signal Noise | Sciencx - » Improving Text Embeddings with Large Language Models: Synthetic Data Generation. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-synthetic-data-generation/
CHICAGO
" » Improving Text Embeddings with Large Language Models: Synthetic Data Generation." Auto Encoder: How to Ignore the Signal Noise | Sciencx - Accessed . https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-synthetic-data-generation/
IEEE
" » Improving Text Embeddings with Large Language Models: Synthetic Data Generation." Auto Encoder: How to Ignore the Signal Noise | Sciencx [Online]. Available: https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-synthetic-data-generation/. [Accessed: ]
rf:citation
» Improving Text Embeddings with Large Language Models: Synthetic Data Generation | Auto Encoder: How to Ignore the Signal Noise | Sciencx | https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-synthetic-data-generation/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.