How to setup a local coding agent on macOS

412 points · 102 comments on HN · read original →

Points and comments are a snapshot, not live.

Guide to running Gemma 4 as a local coding agent on macOS using llama.cpp with Multi-Token Prediction for 24% speed gains.

The author set up a local coding agent on an M1 Max Mac using Gemma 4 26B-A4B (Q4 quantization) with llama.cpp Metal acceleration. The baseline achieved 58.2 tokens per second for generation. Adding a Q8 MTP draft model with speculative decoding improved generation speed to 72.2 tokens per second (1.24x speedup) by predicting multiple tokens at once. Testing `--spec-draft-n-max` values from 1 to 6 showed peak performance at 3. The setup includes a multimodal projector for screenshot support and runs as an OpenAI-compatible API server on port 8080. A 17 GB model folder is required. The author also tested Qwen 3.6 35B as a higher-quality alternative but found it slower at 55 tokens per second. llama.cpp with Metal outperformed MLX implementations tested on the same hardware.

What commenters are saying

Users confirmed MTP yields modest but real speedups on Apple Silicon, though gains vary by hardware. Several commenters noted llama.cpp supports direct HuggingFace model downloads via `-hf` flags, eliminating the need for separate CLI tools. Practical concerns emerged: 64 GB RAM is a hard requirement for full models; smaller configurations (48 GB or less) may need smaller models or significant quantization. One user reported MTP occasionally breaks markup in agentic workflows. A thread segment debated local versus hosted models on economic and usability grounds, with consensus that local models are slower but valuable for privacy, offline use, and learning. Some users report success with smaller quantizations on 16-24 GB Macs, though performance expectations differ widely by workload.