There was a recent dramatic improvement on the speed of LLM’s on CPU thanks to llamafile’s author. She goes on extensively about it on her blog but the short version is: expect 7-billion parameters to be usable on consumer-grade CPU even in Q8. Now it’s certainly possible to self-host a coding assistant with llamafile, continue.dev and Docker on a VPS. Let’s see how to achieve that.

I’ll use Docker + Traefik but you can easily convert it to anything else (native + nginx for example).

First let’s build the image. llamafile being a single binary with no dependency, the Dockerfile is straightforward. Save it in the current directory.

FROM alpine:latest

ARG llamafile_version

ADD https://github.com/Mozilla-Ocho/llamafile/releases/download/$llamafile_version/llamafile-$llamafile_version /usr/local/bin/llamafile

RUN apk update && apk upgrade && rm -rf /var/lib/apk && \
chmod 755 /usr/local/bin/llamafile && \
adduser -D llamafile

USER llamafile

ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]
CMD ["--server", "--host", "0.0.0.0", "--nobrowser", "--log-disable", "-m", "/model"]

Let’s build it with the following command. As of today, the latest version of llamafile is 0.7.

version=0.7
docker build -t local/llamafile:$version --build-arg llamafile_version=$version .

Download a suitable model from Huggingface, such as Mistral-7B-Instruct-v0.2 from TheBloke. Let’s assume the file is /data/models/mistral.gguf.

Let’s run it briefly to make sure it works:

docker run --rm -v /data/models/mistral.gguf:/model:ro -p 8080:8080 local:llamafile:0.7

You should now be able to ask it something with curl:

curl -v -d '{"prompt": "write a python function to print integers from x to y", "stream": true}' -H "content-type: application/json" http://127.0.0.1:8080/completion

The argument "stream": true is only there so that you can quickly see if it works.

Then we need a compose file for convenience.

services:
  web:
    image: local/llamafile:0.7
    volumes:
      - /data/models/mistral.gguf:/model:ro
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.codingassistant.rule=Host(`codingassistant.example.com`)"
      - "traefik.http.routers.codingassistant.tls=true"
      - "traefik.http.routers.codingassistant.entrypoints=websecure"
      - "traefik.http.routers.codingassistant.tls.certresolver=le"
      - "traefik.http.routers.codingassistant.service=codingassistant"
      - "traefik.http.services.codingassistant.loadbalancer.server.port=8080"
      - "traefik.http.services.codingassistant.loadbalancer.server.scheme=http"

I will let you add your own layer(s) of security such as IPAllowlist or basic auth.

You can now configure the continue.dev extension in Visual Studio Code with the following:

 "models": [

    {
      "title": "llamafile",
      "model": "mistral-7b",
      "completionOptions": {},
      "provider": "openai",
      "apiKey": "EMPTY",
      "apiBase": "https://codingassistant.example.com/v1"
    }
  ],

And now you can query your model from vscode easily.

You now have your own coding assistant secured with TLS (and more if you want to), self-hosted and relatively fast on many VPS/dedicated servers. Enjoy! 😁