Content

Overview: The problem statement
Quantize the llama2-7b-chat model
Load and Run the model
Challenges and Considerations
Endnotes
References

Overview: The problem statement

The LLMs(Large Language Models), as the word ‘Large’ in their name, are very large in size(in terms of number of parameters and the space that they take on disk and memory). The llama2 series models by meta for example comes in 3 variants as below:-

Llama2-7B-Chat: 7 Billion parameters and ~14 GB in disk-size
Llama2-13B-Chat: 13 Billion parameters and ~26 GB in disk-size
Llama2-70B-Chat: 70 Billion parameters and ~140 GB in disk-size

The parameters are basically float16 numbers which take 2 bytes for each parameter.

So to load them into memory and then making predictions from them is quite complex and expensive task in terms of:-

Memory requirements
Prediction latency

Thus to provide a potential solution to the above problems there is this process called model-quantization. Quantization is a technique that reduces the precision of the weights and activations in a neural network. This can lead to a significant reduction in the model size, as well as an improvement in inference speed.

In this blog-post we will be seeing how to quantize the llama2-7b-chat (the same technique can be applied to other variants of the llama2 series models as well) model so that it can be loaded into local cpu system and make successful prediction from it with fairly good performance. This is like having our own LLM to talk to with no fear of getting our prompts copied by some external systems. Let’s dive in…

Quantize the llama2-7b-chat model

Below are the steps to create the quantized version of the model and run it locally in a cpu based system. The given process has been successfully applied on Linux(Ubuntu) os with 12 GB RAM and core i5 processor.

Step-1: Request the access to download the llama2 model from the meta ai website

Step-2: Clone the below 2 repositories:-

git clone https://github.com/facebookresearch/llama.git

git clone https://github.com/ggerganov/llama.cpp.git

Step-3: Navigate into the llama repository and run the below command:-

/bin/bash ./download.sh

follow the instructions on screen to download the model. As a result you will have a llama-2–7b-chat directory created with the model files in it.

Step-4: Now in order to quantize the model first it needs to be converted in f16(float16) format first. For that first Navigate to the llama.cpp directory then run:-

make

python3 -m pip install -r requirements.txt

python3 convert.py --outfile models/7B/ggml-model-f16.bin --outtype f16 ../llama/llama-2-7b-chat --vocab-dir ../llama

./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0

The convert command is used to create the equivalent f16 file from the original model file. This is needed in order to create the quantized version of the model. Thus with the above convert.py command the model file will be converted into f16 format file ggml-model-f16.bin. This would be roughly equal to the size of the original model file. Then this f16 file is used to create the 4-bit quantized version of the model which would be roughly 4 times less (~3.8GB) than the original model file.

Load and Run the model

We are all set to run the model and chat with it. Run the below command to load the quantized model into the memory and chat with our own llm:-

./main -m ./models/7B/ggml-model-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt

The model which had been created using the above guide has been uploaded to the huggingface. That can be downloaded and used directly. Get the model here

Challenges and Considerations

While quantization offers numerous advantages, it also presents some challenges:

Accuracy loss: Quantization can lead to a slight decrease in accuracy compared to full-precision models. However, careful implementation and fine-tuning can minimize this loss. Hardware compatibility: Not all hardware supports quantized models, which may limit their deployment options.
Technical complexity: Quantization requires specific expertise and tools, adding an additional layer of complexity to the model development process.

Endnotes

Despite the challenges, the field of LLM quantization is rapidly evolving. Researchers are developing new methods to achieve high accuracy with minimal loss, and hardware manufacturers are adding support for quantized models.

The future of LLMs lies in making them smaller, faster, and more accessible. Quantization plays a crucial role in achieving this goal, paving the way for a wider range of applications and a more democratized AI landscape.

Hope you enjoyed the post…