There was a recent dramatic improvement on the speed of LLM’s on CPU thanks to llamafile’s author. She goes on extensively about it on her blog but the short version is: expect 7-billion parameters to be usable on consumer-grade CPU even in Q8. Now it’s certainly possible to self-host a coding assistant with llamafile, continue.dev and Docker on a VPS. Let’s see how to achieve that.
I’ll use Docker + Traefik but you can easily convert it to anything else (native + nginx for example).
First let’s build the image. llamafile being a single binary with no dependency, the Dockerfile is straightforward. Save it in the current directory.
FROM alpine:latest
ARG llamafile_version
ADD https://github.com/Mozilla-Ocho/llamafile/releases/download/$llamafile_version/llamafile-$llamafile_version /usr/local/bin/llamafile
RUN apk update && apk upgrade && rm -rf /var/lib/apk && \
chmod 755 /usr/local/bin/llamafile && \
adduser -D llamafile
USER llamafile
ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]
CMD ["--server", "--host", "0.0.0.0", "--nobrowser", "--log-disable", "-m", "/model"]
Let’s build it with the following command. As of today, the latest version of llamafile is 0.7.
version=0.7
docker build -t local/llamafile:$version --build-arg llamafile_version=$version .
Download a suitable model from Huggingface, such as Mistral-7B-Instruct-v0.2 from TheBloke. Let’s assume the file is /data/models/mistral.gguf
.
Let’s run it briefly to make sure it works:
docker run --rm -v /data/models/mistral.gguf:/model:ro -p 8080:8080 local:llamafile:0.7
You should now be able to ask it something with curl:
curl -v -d '{"prompt": "write a python function to print integers from x to y", "stream": true}' -H "content-type: application/json" http://127.0.0.1:8080/completion
The argument "stream": true
is only there so that you can quickly see if it works.
Then we need a compose file for convenience.
services:
web:
image: local/llamafile:0.7
volumes:
- /data/models/mistral.gguf:/model:ro
labels:
- "traefik.enable=true"
- "traefik.http.routers.codingassistant.rule=Host(`codingassistant.example.com`)"
- "traefik.http.routers.codingassistant.tls=true"
- "traefik.http.routers.codingassistant.entrypoints=websecure"
- "traefik.http.routers.codingassistant.tls.certresolver=le"
- "traefik.http.routers.codingassistant.service=codingassistant"
- "traefik.http.services.codingassistant.loadbalancer.server.port=8080"
- "traefik.http.services.codingassistant.loadbalancer.server.scheme=http"
I will let you add your own layer(s) of security such as IPAllowlist or basic auth.
You can now configure the continue.dev extension in Visual Studio Code with the following:
"models": [
{
"title": "llamafile",
"model": "mistral-7b",
"completionOptions": {},
"provider": "openai",
"apiKey": "EMPTY",
"apiBase": "https://codingassistant.example.com/v1"
}
],
And now you can query your model from vscode easily.
You now have your own coding assistant secured with TLS (and more if you want to), self-hosted and relatively fast on many VPS/dedicated servers. Enjoy! 😁