apfel: The LLM Already on Your Mac

Introduction

Local AI has a memory problem. Every guide for running an LLM on your own machine starts the same way — install LM Studio or Ollama, download a 4 to 14 gigabyte model, watch your fans spin up, and hope your Mac has enough RAM to hold the model alongside everything else you were already doing. On an 8 or 16 gigabyte M1 or M2, that math gets uncomfortable fast. Browser tabs swap to disk, the editor stutters, and the local-first pitch starts to feel less like ownership and more like punishment.

There is a model already loaded on your Mac that you are paying nothing extra for. If you are running a recent macOS on Apple Silicon and you have Apple Intelligence turned on, the system has already pulled the on-device foundation model into memory. It is sitting there, ready to answer prompts, whether you ever ask it to or not. The only thing missing has been a way to talk to it from the terminal — to script with it, pipe into it, or point another tool at it the way you would point at OpenAI.

That is what apfel does. It is a single Homebrew-installed CLI that wraps Apple’s FoundationModels framework and gives you three things: one-shot prompts, an interactive chat REPL, and an OpenAI-compatible HTTP server. No API key. No download. No second model resident in memory. By the end of this guide, you will be running prompts against the model your Mac is already holding open.

Prerequisites:

macOS 26 Tahoe or later
Apple Silicon (M1 or newer)
Apple Intelligence enabled in System Settings
Homebrew installed

If your Mac does not meet those requirements, this is not the right tool — apfel is a thin shim over a framework that simply is not present on older systems. In that case, LM Studio with a small quantized model is still the better path.

What apfel Actually Is

Apple ships an on-device LLM as part of Apple Intelligence. The model is small by frontier-lab standards, runs entirely on the Neural Engine and GPU, and is exposed to developers through a Swift framework called FoundationModels. Apps like Mail, Notes, and Shortcuts already call into it. What Apple has not shipped is a command-line interface — if you want to use the model from a script or another non-Apple tool, you have to write Swift.

apfel is that missing CLI. It is a small Swift binary that links the FoundationModels framework, accepts prompts on stdin or argv, and either prints the response or serves it over HTTP using the OpenAI chat-completions schema. The model does not move. The weights do not get re-downloaded. apfel is just the doorway between your terminal and the model that macOS is already running.

The practical consequence: starting apfel costs you almost nothing in memory or boot time. There is no model to load — that work has already been done by the system.

Step 1: Install apfel

brew install apfel

What this does: Homebrew pulls the apfel binary into /opt/homebrew/bin/apfel (or /usr/local/bin/apfel on Intel Macs, though Intel Macs cannot run the foundation model in the first place). There is nothing else to configure. No API key, no config file, no first-run wizard. The first time you run it, it will ask the framework whether the on-device model is available, and the framework will either say yes or tell you why not.

If apfel reports that the model is unavailable, the usual fix is to open System Settings, go to Apple Intelligence & Siri, and confirm Apple Intelligence is turned on and finished its initial setup. The first download of the model itself is handled by macOS, not apfel — the system pulls it in the background after you opt in.

Step 2: One-Shot Prompts

The simplest mode. Pass a string, get a response, exit.

┌[ north@macOS ] ~
└➤ apfel "What is two plus two?"
 
Two plus two equals four.

This is the form that makes apfel useful in shell scripts. Anywhere you would have shelled out to curl https://api.openai.com/..., you can now shell out to apfel instead — no key in your environment, no network round trip, no per-token billing. A few patterns that come up often:

# Pipe a file in as context
cat error.log | apfel "Summarize the failures in this log."
 
# Quiet mode strips banners and metadata for clean script output
result=$(apfel -q "Capital of France?")
 
# System prompt to set behavior for the call
apfel --system "You are a terse Unix admin." "Explain inode exhaustion."

What this does: Each invocation is a fresh conversation — there is no history carried between calls. The -q flag suppresses anything that is not the model’s response, which matters when you want to capture output into a variable. --system (or --system-file path/to/persona.txt) sets the system prompt for that call only.

The on-device model has a 4,096-token context window. That includes your system prompt, the user prompt, and the response together. It is enough room for short summaries, classification, structured-output tasks, and quick lookups. It is not enough room to feed it a whole book.

Step 3: Interactive Chat

For exploratory work — testing prompts, sketching out an idea, asking follow-up questions that depend on previous answers — there is a chat REPL.

┌[ north@macOS ] ~
└➤ apfel --chat
Apple Intelligence · on-device LLM · apfel v1.3.3
──────────────────────────────────────────────────
Type 'quit' to exit.
 
you› This is a test to see if you can respond to me in the terminal.
 ai› Sure, I can help with that! What would you like to test?
 
you›

What this does: Chat mode keeps the conversation history in memory for the duration of the session, within the 4,096-token limit. When the conversation grows past that limit, older turns drop off — there is no magic; the context window is what it is.

The chat REPL also accepts a system prompt at launch and can attach Model Context Protocol (MCP) tool servers if you want the model to call out to external tools during a conversation:

apfel --chat -s "You are a helpful coding assistant."
apfel --chat --mcp ./mcp/calculator/server.py

For most uses, the plain apfel --chat is the right call. MCP tooling is there if you need it, but it is not required to get value out of the REPL.

Step 4: Run It as a Local API Server

This is the mode that matters most for integration. apfel exposes the OpenAI chat-completions API on localhost:11434, which means anything that can talk to OpenAI can talk to your Mac’s foundation model with one URL change and no key.

┌[ north@macOS ] ~
└➤ apfel --serve
apfel server v1.3.3
├ endpoint: http://127.0.0.1:11434
├ model:    apple-foundationmodel
├ cors:     disabled
├ origin:   localhost only (http://127.0.0.1, http://localhost, http://[::1])
├ token:    none
├ health:   public
├ max concurrent: 5
├ debug:    off
└ ready
 
Endpoints:
  POST http://127.0.0.1:11434/v1/chat/completions
  GET  http://127.0.0.1:11434/v1/models
  GET  http://127.0.0.1:11434/health

What this does: apfel starts a small HTTP server bound to loopback only, with origin checks that reject requests claiming to come from anywhere other than localhost. There is no auth token by default, because the only callers that can reach it are processes already running on your machine. CORS is disabled for the same reason — this is not meant to be hit from a webpage on the open internet.

A first request looks like any other OpenAI call:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "apple-foundationmodel",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

If you want it running in the background the way Ollama does, Homebrew handles that:

brew services start apfel

What this does: Homebrew installs a launchd service definition that starts apfel on login and restarts it if it crashes. brew services stop apfel shuts it down. Logs go to the standard Homebrew services location — brew services info apfel will tell you where.

Streaming responses work. Non-streaming works. The temperature, max_tokens, and seed parameters are honored and mapped onto the framework’s GenerationOptions. GET /v1/models returns the single available model so OpenAI clients that probe for a model list do not fall over.

The practical payoff: any tool that lets you set a custom OpenAI base URL — Continue.dev in VS Code, the OpenAI Python SDK with base_url=, an in-house script — can be redirected at http://localhost:11434/v1 and run entirely on-device. No code change beyond the URL.

Where the Foundation Model Falls Short

This is where the trade-off conversation has to be honest. The Apple Foundation Model is a small, fast, on-device model designed for system features like summarizing notifications and rewriting text in Mail. It is not GPT-5. It is not Claude. It will not solve problems that need a frontier-class model.

Concrete limits to plan around:

4,096-token context window. Enough for short prompts and modest summaries. Not enough for long documents, large codebases, or extended back-and-forth.
Modest reasoning. Multi-step logic, math beyond arithmetic, and code generation for non-trivial problems are where the gap with a hosted frontier model shows up most.
Single language model, no embeddings. apfel exposes chat completions only. If you need vector embeddings for retrieval, you still need another tool.
No fine-tuning. The model is what Apple shipped. You can change behavior with system prompts; you cannot change the weights.

Get it if:

You are on a low-RAM Apple Silicon Mac and adding a second model in LM Studio causes memory pressure
You want a quick local fallback for scripts that would otherwise call a hosted API
You are teaching or demoing local AI and want zero-friction setup with no model downloads
You want an OpenAI-compatible endpoint for integration testing without spending tokens

Skip it if:

You need a large context window or strong reasoning — use a real local model in LM Studio, or call a hosted API
You are on Intel, on macOS older than Tahoe, or have Apple Intelligence disabled
Your workload depends on embeddings, fine-tuning, or features apfel does not currently expose

Closing Thoughts

LM Studio is still the right answer most of the time. A well-chosen quantized model in the 4 to 8 billion parameter range will outperform the Apple Foundation Model on almost any reasoning task, and the local-first story is the same. apfel does not replace that workflow.

What apfel replaces is the case where you have already decided to use local AI and your hardware is pushing back. The model is loaded whether you use it or not. You may as well have a doorway to it.

Add the rock to the pile. The next person debugging memory pressure on an 8 gigabyte Mac should not have to start from zero.

North Engineer

Explorer