Chatting With LLaMA, Offline, Using a MacBook M1 Pro (32GB RAM)

How to run Meta LLaMA 2 AI model locally on a MacBook Pro M1.

Supplies

MacBook M1 Pro, 32 GB RAM

Fill Meta form here:

and wait for an email from Meta with link and download instructions.

Get the downloader:

git clone https://github.com/facebookresearch/llama

Enter in downloader folder and execute:

sh ./download.sh

The downloader will ask for the model to download (ie "7B") and the download link.

Rename the model folder leaving just the model size. For example if you downloaded llama-2-7b, rename it to "7B":

mv llama-2-7b 7B

Install FastChat (fschat), which is the inference engine for the language model:

pip3 install fschat

Download the tool to convert llama to HF:

wget https://github.com/huggingface/transformers/raw/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

Convert the model:

 export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
 python3 convert_llama_weights_to_hf.py --input_dir ./ --model_size 7B --output_dir ./hf_7B

It is time to chat, just execute:

python3 -m fastchat.serve.cli --model-path ./hf_7b --device mps --load-8bit --style rich