Improving Text Embeddings with Large Language Models



AI Summary

Introduction Summary

  • Text Embeddings: Represent natural language semantically as vectors.
  • Applications: Used in information retrieval, question answering, etc.
  • Techniques: Approximate nearest neighbors for document recall, retrieval-augmented generation, and source attribution.
  • Challenges: Traditional methods fail to capture full context; advanced models (E5, BGE) use complex training but have limitations.
  • Our Proposal: A new method using large language models to generate synthetic data for text embedding tasks in 93 languages.
  • Text Embeddings: Used for various NLP tasks; early methods include latent semantic indexing.
  • Recent Methods: Use supervision from natural language inference and labeled data, but face diversity and coverage issues.
  • Our Approach: Single-stage training with synthetic data generation, bypassing multi-stage training limitations.

Method Summary

  • Synthetic Data Generation: Using language models like GPT-4 to increase task and language diversity.
  • Task Categorization: Dividing tasks into groups and applying tailored prompts.
  • Training: Using InfoNCE loss with in-batch and hard negatives, leveraging synthetic data and public datasets.
  • Data Statistics: 500,000 examples, 93 languages, with a focus on English and low-resource languages.

Main Results Summary

  • Model Performance: E5-Mistral 7B+ Full Data achieves top scores on benchmarks.
  • Generative Language Modeling: Suggests language models can generate training data for text embeddings.
  • Multilingual Capabilities: Better performance in high-resource languages, with room for improvement in low-resource ones.
  • Contrastive Pre-Training: Minimal benefit for extensively pre-trained models like Mistral 7B.
  • Personalized Pasky Retrieval: Evaluates long context capability, with adjustments for longer contexts needed.
  • Training Configurations: Mistral 7B initialization is effective; instructions impact performance significantly.