How to Run 70B and 120B LLMs Locally

How to Run 70B and 120B LLMs Locally - 2 bit LLMs

AI Summary

Nut Jurg has successfully quantized large language models (LLMs) to 2-bit, allowing them to run efficiently on local systems, particularly in CPU-only or mixed CPU/GPU environments.

This process involves using lLama CPP to convert models into GGF formats.

Some models have extra small and extra extra small versions available.

For larger models, context length adjustments may be necessary.

The quantized models may not be visible in tools like LM Studio, but there is a workaround for local installation and use.

The process for downloading and installing these models involves:

Downloading the GGF file from Hugging Face.

Creating a directory structure in LM Studio’s cache folder.

Loading the model in LM Studio and adjusting settings as needed.

The video demonstrates using the Kafka model as an example, including adjusting GPU settings and handling model prompts.

The model’s performance is tested with various questions, showing that despite the 2-bit quantization, it can still provide coherent responses.

The video concludes with a reminder that the quality of responses from 2-bit quantized models may not be as reliable as full versions.

For more information on the process and the models, the video suggests checking out additional content on the channel.

ThirdBrAIn.tech

Explorer

How to Run 70B and 120B LLMs Locally - 2 bit LLMs

How to Run 70B and 120B LLMs Locally - 2 bit LLMs

Graph View