Convert Any Webpage Into LLM Dataset - Local and Free - LLM Scraper



AI Summary

Summary: llm Scripper Tool Overview

  • Tool Introduction:
    • Name: llm Scripper
    • Purpose: Scrapes websites and converts data to structured JSON format.
    • Platforms: Windows, Linux
    • Use Cases: AI applications, API integration, vector databases.
  • JSON Format:
    • Easily converted to JSONL, a common dataset format.
    • Widely compatible with APIs.
  • Features:
    • Local model support.
    • Supports GG UF and OpenAI models (requires paid account).
    • Zod schema validation library integration.
    • Playwright browser automation framework utilization.
    • TypeScript for type safety.
    • Multiple input modes: HTML, Markdown, text, image.
    • Streaming support for crawling multiple pages.
  • Installation Guide:
    • Prerequisites: Node.js and npm.
    • Installation via PowerShell.
    • Download local models from Hugging Face.
  • Usage:
    • Code setup in Visual Studio Code.
    • Import libraries, launch Chromium, instantiate llm with a model.
    • Define schema with Zod, set URL, and run scraper.
    • Output: JSON format data from web pages.
  • Alternative Usage with OpenAI:
    • Requires OpenAI API key from platform.openai.com.
    • Similar setup with OpenAI’s GPT-4 Turbo instead of a local model.
  • Conclusion:
    • llm Scripper is an effective tool for converting web pages into structured data.
    • The tutorial includes installation, usage, and integration with OpenAI.
    • The creator offers assistance for any issues and encourages subscribing and sharing the video.