Affiliate links on Android Authority may earn us a commission. Learn more.

After ChatGPT's outages, I installed an offline AI chatbot that will never go down

With guaranteed uptime and privacy, local AI chatbots offer a safety net you don't get from ChatGPT.

•

Published onJune 4, 2024

•

Calvin Wankhede / Android Authority

If you’re a frequent ChatGPT user, you may have noticed that the AI chatbot occasionally tends to go down or stop working at the most inconvenient of times. These outages usually don’t last very long, but after the last one left me stranded, I began yearning for a more reliable alternative. Luckily, it turns out that a simple solution does exist in the form of local language models like LLaMA 3. The best part? They can run even on relatively pedestrian hardware like a MacBook Air! Here’s everything I learned from using LLaMA 3 and how it compares vs ChatGPT.

Why you should care about local AI chatbots

Most of us have only ever used ChatGPT and well-known alternatives like Microsoft’s Copilot and Google’s Gemini. However, all of these chatbots run on powerful servers in far-away data centers. But using the cloud just means relying on someone else’s computer, which can go down or stop working for hours on end.

It’s also unclear how cloud-based AI chatbots respect your data and privacy. We know that ChatGPT saves conversations to train future models and the same likely applies to every other Big Tech company out there too. It’s no surprise that companies globally, ranging from Samsung to Wells Fargo, have limited their employees from using ChatGPT internally.

Online AI chatbots are neither reliable nor private.

This is where locally run AI chatbots come in. Take LLaMA 3, for example, which is an open-source language model developed by Meta’s AI division (yes, the same company that owns Facebook and WhatsApp). The key distinction here is LLaMA’s open-source status — it means anyone can download and run it for themselves. And since no data ever leaves your computer, you don’t have to worry about leaking secrets.

The only requirement for running LLaMA 3 is a relatively modern computer. This unfortunately disqualifies smartphones and tablets. However, I’ve found that you can run the smaller version of LLaMa 3 on shockingly low-end hardware, including many laptops released within the past few years.

LLaMA 3 vs ChatGPT: How does offline AI fare?

Samsung

I’ll go over how to install LLaMA 3 on your computer in the next section, but you may want to know how it holds up vs ChatGPT first. The answer isn’t straightforward because ChatGPT and LLaMA 3 both come in different variations.

Until last month, the free version of ChatGPT was restricted to the older GPT-3.5 model and you needed to pay $20 per month to use GPT-4. With the release of GPT-4o, however, OpenAI now lets free users access its latest model with some restrictions on how many messages you can send per hour.

LLaMA 3 comes in two model sizes too: 8 billion and 70 billion parameters. The 8B version is the only choice for those with limited computational resources, which essentially means everyone except the most diehard PC gamers. You see, the larger 70B model requires at least 24GB of video memory (VRAM), which is currently only available on exotic $1,600 GPUs like Nvidia’s RTX 4090. Even then, you’ll have to settle for a compressed version as the full 70B model requires 48GB of VRAM.

Given all of that, LLaMA 3 8B is naturally our model of choice. The good news is that it holds up very well against GPT-3.5, or ChatGPT’s baseline model. Here are a few comparisons between the two:

Prompt 1: Write a cover letter for the position of DevOps Engineer at YouTube. I have been working at Oracle Cloud since graduating as a software engineer in 2019.

Result: Virtually a tie, even if I do favor LLaMA’s bullet point approach a bit more.

Prompt 2: What’s 8888×3+10?

Result: Both chatbots delivered the correct answer.

Prompt 3: Write a short Python program that simulates a simple dice rolling game. The program should allow the user to specify the number of dice, the number of sides on each die, and how many times they want to roll. The program should then output the results of each roll.

Result: Both chatbots produced working code.

One caveat worth noting is that neither GPT-3.5 nor LLaMA 3 can access the internet to fetch recent information. Asking both models about the Pixel 8’s SoC, for example, yielded confident-sounding but completely inaccurate answers. If you ever ask factual questions, I’d take the local model’s responses with a big pinch of salt. But for creative and even programming tasks, LLaMA 3 performs quite admirably.

How to download and run LLaMA 3 locally

Calvin Wankhede / Android Authority

As I alluded to above, LLaMA 3 comes in two sizes. The LLaMA 3 8B model doesn’t require anything more than a semi-recent computer. In fact, running it on my desktop yielded faster responses than ChatGPT or any online chatbot available today. While my computer does have a mid-range gaming GPU, LLaMA 3 will also happily run on a laptop with modest hardware. Case in point: I still got reasonably quick responses while running it on an M1 MacBook Air with 16GB of RAM. That’s four year old hardware, older than ChatGPT itself!

With that background out of the way, you’ll need some software to actually interface with LLaMA 3. This is because even though you can download the model for free, Meta doesn’t offer it as a program or app that you can simply double-click to run. Thanks to the open-source community, however, we have several different LLM frontends available today.

After trying a handful of them, I’d recommend GPT4All as it makes the process of downloading and running LLaMA 3 as painless as possible. Here’s a quick guide:

Download GPT4All for your Windows or macOS computer and install it.
Open the GPT4All app and click Download models.
Search for the “LLaMA 3 Instruct” model and click Download. This is the 8B model fine-tuned for conversations. The download may take some time depending on your internet connection.
Once the download completes, close the browser pop-up and select LLaMA 3 Instruct from the model dropdown menu.
That’s it — you’re ready to start chatting. You should see the screen pictured above. Simply type in a prompt, press enter, and wait for the model to generate its response.

My admittedly powerful desktop can generate 50 tokens per second, which easily beats ChatGPT’s response speed. Apple Silicon-based computers offer the best price to performance ratio, thanks to their unified memory, and will generate tokens faster than a human can read.

If you’re on Windows without a dedicated GPU, LLaMA 3’s text generation will be quite a bit slower. Running it on my desktop CPU yielded just 5 tokens per second and required at least 16GB of system memory. On the flip side, however, cloud-based chatbots also throttle to a crawl during periods of heavy demand. Besides, I can at least rest easy knowing that my chats won’t ever be read by anyone else.

Features