Improving Text Embeddings with Large Language Models
AI Summary
Introduction Summary
- Text Embeddings: Represent natural language semantically as vectors.
- Applications: Used in information retrieval, question answering, etc.
- Techniques: Approximate nearest neighbors for document recall, retrieval-augmented generation, and source attribution.
- Challenges: Traditional methods fail to capture full context; advanced models (E5, BGE) use complex training but have limitations.
- Our Proposal: A new method using large language models to generate synthetic data for text embedding tasks in 93 languages.
Related Work Summary
- Text Embeddings: Used for various NLP tasks; early methods include latent semantic indexing.
- Recent Methods: Use supervision from natural language inference and labeled data, but face diversity and coverage issues.
- Our Approach: Single-stage training with synthetic data generation, bypassing multi-stage training limitations.
Method Summary
- Synthetic Data Generation: Using language models like GPT-4 to increase task and language diversity.
- Task Categorization: Dividing tasks into groups and applying tailored prompts.
- Training: Using InfoNCE loss with in-batch and hard negatives, leveraging synthetic data and public datasets.
- Data Statistics: 500,000 examples, 93 languages, with a focus on English and low-resource languages.
Main Results Summary
- Model Performance: E5-Mistral 7B+ Full Data achieves top scores on benchmarks.
- Generative Language Modeling: Suggests language models can generate training data for text embeddings.
- Multilingual Capabilities: Better performance in high-resource languages, with room for improvement in low-resource ones.
- Contrastive Pre-Training: Minimal benefit for extensively pre-trained models like Mistral 7B.
- Personalized Pasky Retrieval: Evaluates long context capability, with adjustments for longer contexts needed.
- Training Configurations: Mistral 7B initialization is effective; instructions impact performance significantly.