RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

267 points · 93 comments on HN · read original →

Points and comments are a snapshot, not live.

Running Qwen 3.6 27B Q8 at 80+ tokens/second on dual RTX 5080 and RTX 3090 GPUs via llama.cpp.

A user configured an RTX 5080 (16GB) and refurbished RTX 3090 (24GB) for local LLM inference using an Asus Prime X570-Pro motherboard. Key steps included disabling CSM in BIOS, enabling Above 4G Decoding and ReSize BAR Support, and setting both PCIe slots to Gen 4. The NVIDIA driver installation required blacklisting the nova driver for same-GPU setups; different GPU generations use the standard nvidia-open driver. The llama.cpp build used CMAKE_CUDA_ARCHITECTURES="86;120" to support both Ampere and Blackwell architectures, with -DGGML_CUDA_NCCL=OFF flagged as counter-productive. Running Qwen 3.6 Q8 with 230k context, MTP speculative decoding, and tensor-split allocation (-ts 2,3) across both cards achieved 80-91 tokens/second.

What commenters are saying

Commenters questioned the financial logic of the setup: one noted $3/1M-token pricing on OpenRouter versus 2k+ hardware cost plus electricity. Responses highlighted privacy, autonomy, and hedging against provider rate changes or regulation as motivations beyond pure cost comparison. A second thread focused on technical gaps: commenters noted the article reads as a recipe without explaining optimal weight distribution, bandwidth utilization, or driver issues. One user noted even with MTP, the RTX 3090's 936GB/s bandwidth was underutilized at 720GB/s achieved throughput, suggesting room for improvement. A few shared alternative setups: one user runs Qwen 3.6 35B A3B on a 6600 XT (15-20 tok/s), another mentioned dual Minis Forum DEG1 with Oculink cards.