How to Run 70B and 120B LLMs Locally - 2 bit LLMs
AI Summary
- Nut Jurg has successfully quantized large language models (LLMs) to 2-bit, allowing them to run efficiently on local systems, particularly in CPU-only or mixed CPU/GPU environments.
- This process involves using lLama CPP to convert models into GGF formats.
- Some models have extra small and extra extra small versions available.
- For larger models, context length adjustments may be necessary.
- The quantized models may not be visible in tools like LM Studio, but there is a workaround for local installation and use.
- The process for downloading and installing these models involves:
- Downloading the GGF file from Hugging Face.
- Creating a directory structure in LM Studio’s cache folder.
- Loading the model in LM Studio and adjusting settings as needed.
- The video demonstrates using the Kafka model as an example, including adjusting GPU settings and handling model prompts.
- The model’s performance is tested with various questions, showing that despite the 2-bit quantization, it can still provide coherent responses.
- The video concludes with a reminder that the quality of responses from 2-bit quantized models may not be as reliable as full versions.
For more information on the process and the models, the video suggests checking out additional content on the channel.