How to Run 70B and 120B LLMs Locally - 2 bit LLMs



AI Summary

  • Nut Jurg has successfully quantized large language models (LLMs) to 2-bit, allowing them to run efficiently on local systems, particularly in CPU-only or mixed CPU/GPU environments.
  • This process involves using lLama CPP to convert models into GGF formats.
  • Some models have extra small and extra extra small versions available.
  • For larger models, context length adjustments may be necessary.
  • The quantized models may not be visible in tools like LM Studio, but there is a workaround for local installation and use.
  • The process for downloading and installing these models involves:
    • Downloading the GGF file from Hugging Face.
    • Creating a directory structure in LM Studio’s cache folder.
    • Loading the model in LM Studio and adjusting settings as needed.
  • The video demonstrates using the Kafka model as an example, including adjusting GPU settings and handling model prompts.
  • The model’s performance is tested with various questions, showing that despite the 2-bit quantization, it can still provide coherent responses.
  • The video concludes with a reminder that the quality of responses from 2-bit quantized models may not be as reliable as full versions.

For more information on the process and the models, the video suggests checking out additional content on the channel.