Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data

Figure 2: Task type and language statistics of the generated synthetic data (see Section 3.1 for task type definitions). The “Others” category contains the remaining languages from the XLM-R language list.

\ Figure 2 presents the statistics of our generated synthetic data. We manage to generate 500k examples with 150k unique instructions using Azure OpenAI Service [2], among which 25% are generated by GPT-35-Turbo and others are generated by GPT-4. The total token consumption is about 180M. The predominant language is English, with coverage extending to a total of 93 languages. For the bottom 75 low-resource languages, there are about 1k examples per language on average.

\ In terms of data quality, we find that a portion of GPT-35-Turbo outputs do not strictly follow the guidelines specified in the prompt templates. Nevertheless, the overall quality remains acceptable, and preliminary experiments have demonstrated the benefits of incorporating this data subset.

:::info This paper is available on arxiv under CC0 1.0 DEED license.

:::

[2] https://oai.azure.com/

This content originally appeared on HackerNoon and was authored by Auto Encoder: How to Ignore the Signal Noise

Print Share Comment Cite Upload Translate Updates

APA

Auto Encoder: How to Ignore the Signal Noise | Sciencx (2024-10-09T15:00:20+00:00) Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data. Retrieved from https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-statistics-of-the-synthetic-data/

MLA

" » Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data." Auto Encoder: How to Ignore the Signal Noise | Sciencx - Wednesday October 9, 2024, https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-statistics-of-the-synthetic-data/

HARVARD

Auto Encoder: How to Ignore the Signal Noise | Sciencx Wednesday October 9, 2024 » Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data., viewed ,<https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-statistics-of-the-synthetic-data/>

VANCOUVER

Auto Encoder: How to Ignore the Signal Noise | Sciencx - » Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-statistics-of-the-synthetic-data/

CHICAGO

" » Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data." Auto Encoder: How to Ignore the Signal Noise | Sciencx - Accessed . https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-statistics-of-the-synthetic-data/

IEEE

" » Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data." Auto Encoder: How to Ignore the Signal Noise | Sciencx [Online]. Available: https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-statistics-of-the-synthetic-data/. [Accessed: ]

rf:citation

» Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data | Auto Encoder: How to Ignore the Signal Noise | Sciencx | https://www.scien.cx/2024/10/09/improving-text-embeddings-withlarge-language-models-statistics-of-the-synthetic-data/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

4 Experiments

4.1 Statistics of the Synthetic Data

Related Posts