Running an open source GPT model locally

Share with:

If you’re interested in natural language processing or machine learning, you’ve likely come across Chat GPT, OpenAI’s groundbreaking language model that’s transforming human-machine interaction. Lately, I’ve been curious to explore how GPT performs locally compared to cloud-based options, and that’s where GPT4All comes into play. This open-source alternative to GPT-3.5 enables you to run CPU quantized models on your own machine. As the GPT4All GitHub repository puts it, it’s “a chatbot trained on a massive collection of clean assistant data including code, stories, and dialogue.”

So how does it work? How can a GTP model fit in the scope of a single machine? By leveraging CPU quantization to build a more compact model representation, using high performance vectorized operations. This approach offers reduced precision but still opens the door to producing usable compact AI solutions.

System reqs:

Python 3.6 or higher
At least 8 GB of RAM
At least 30 GB of free disk space

Installation

Clone the GPT4All repository from GitHub via terminal command:

git clone git@github.com:nomic-ai/gpt4all.git

Download the CPU quantized model checkpoint file called gpt4all-lora-quantized.bin to the /chat folder in the gpt4all repository.

Running the model

To run GPT4All, run one of the following commands from the root of the GPT4All repository. Choose the option matching the host operating system:

# M1 Mac/OSX: 
cd chat;./gpt4all-lora-quantized-OSX-m1

# Linux: 
cd chat;./gpt4all-lora-quantized-linux-x86

# Windows (PowerShell): 
cd chat;./gpt4all-lora-quantized-win64.exe

# Intel Mac/OSX: 
cd chat;./gpt4all-lora-quantized-OSX-intel

After the command completes you’re presented with a chat prompt where you can interact with the GPT model.

Closing Thoughts

On a strictly conversational level, the results are impressive. In some regards it feels like talking to another person. But there are also some notable gaps related to the reduced precision, the answers seem to be less relevant and it seems to make more false assumptions. The GPT4All model is based on the Facebook’s Llama model and is able to answer basic instructional questions but is lacking the data to answer highly contextual questions, which is not surprising given the compressed footprint of the model.

Model responses are noticably slower. I’m running an Intel i9 processor, and there’s typically 2-5 seconds of delay between question and answer. It’s right in the neighborhood of not terrible but not great either. I notice that the model does not seem to make use of the overall computing power so there seems to be opportunity to improve performance.

The model is able to recommend external resources like videos on youtube. For example when I asked for instructions on how to tie my shoes, it responded with a link to a youtube video showing how to tie shows. I haven’t seen ChatGPT refering to external links, so it seemed like a noteworthy difference. The model was also able to provide written instructions when asked to do so specifically.

The model seems to have trouble with time. When it first starts up it thinks the date is August 31 2019. Sometimes it won’t tell me the date and time and I have to ask a few different times before getting an answer. If I try to tell it the current date and time it gets confused and starts giving me inconsistent dates back each time I ask.

GPT4All goes a long way to make GPT models accessible to smaller scale applications. Providing a command line interface for manual experimentation as well as a Python client for programmatic interaction. I’m excited to see the new and innovative solutions this will unlock. These types of smaller models could be used in a distributed system of specialists to build a cohesive “village” of specialist models. Each model can route requests within the community so that responses come from the respective expert.

Let me know what you think in the comment section below!

Share with:

Leave a Reply Cancel reply