Self-hosted coding assistant with llamafile, and docker

There was a recent dramatic improvement on the speed of LLM’s on CPU thanks to llamafile‘s author. She goes on extensively about it on her blog but the short version is: expect 7-billion parameters to be usable on consumer-grade CPU even in Q8. Now it’s certainly possible to self-host a coding assistant with llamafile, and Docker on a VPS. Let’s see how to achieve that.

I’ll use Docker + Traefik but you can easily convert it to anything else (native + nginx for example).

First let’s build the image. llamafile being a single binary with no dependency, the Dockerfile is straightforward. Save it in the current directory.

FROM alpine:latest

ARG llamafile_version

ADD$llamafile_version/llamafile-$llamafile_version /usr/local/bin/llamafile

RUN apk update && apk upgrade && rm -rf /var/lib/apk && \
chmod 755 /usr/local/bin/llamafile && \
adduser -D llamafile

USER llamafile

ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]
CMD ["--server", "--host", "", "--nobrowser", "--log-disable", "-m", "/model"]

Let’s build it with the following command. As of today, the latest version of llamafile is 0.7.

docker build -t local/llamafile:$version --build-arg llamafile_version=$version .

Download a suitable model from Huggingface, such as Mistral-7B-Instruct-v0.2 from TheBloke. Let’s assume the file is /data/models/mistral.gguf.

Let’s run it briefly to make sure it works:

docker run --rm -v /data/models/mistral.gguf:/model:ro -p 8080:8080 local:llamafile:0.7

You should now be able to ask it something with curl:

curl -v -d '{"prompt": "write a python function to print integers from x to y", "stream": true}' -H "content-type: application/json"

The argument "stream": true is only there so that you can quickly see if it works.

Then we need a compose file for convenience.

    image: local/llamafile:0.7
      - /data/models/mistral.gguf:/model:ro
      - "traefik.enable=true"
      - "traefik.http.routers.codingassistant.rule=Host(``)"
      - "traefik.http.routers.codingassistant.tls=true"
      - "traefik.http.routers.codingassistant.entrypoints=websecure"
      - "traefik.http.routers.codingassistant.tls.certresolver=le"
      - "traefik.http.routers.codingassistant.service=codingassistant"
      - ""
      - ""

I will let you add your own layer(s) of security such as IPAllowlist or basic auth.

You can now configure the extension in Visual Studio Code with the following:

  "models": [

      "title": "llamafile",
      "model": "mistral-7b",
      "completionOptions": {},
      "provider": "openai",
      "apiKey": "EMPTY",
      "apiBase": ""

And now you can query your model from vscode easily.

You now have your own coding assistant secured with TLS (and more if you want to), self-hosted and relatively fast on many VPS/dedicated servers. Enjoy! 😁

This entry was posted in artificial intelligence, Computer, Docker, Generative AI, Large Language Models, Linux. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.