Do transformers need three projections? Systematic study of QKV variants

arXiv.org · On Hacker News (2026-06-05)

191 points · 36 comments on HN · read original →

Points and comments are a snapshot, not live.

Transformers can omit one or two of the standard three QKV projections with minimal performance loss and substantial memory savings.

Researchers systematically evaluated transformer variants that share projections: Q-K=V (shared key-value), Q=K-V (shared query-key), and Q=K=V (single projection). Across synthetic tasks, vision benchmarks, and language models up to 1.2B parameters trained on 10B tokens, these variants matched or occasionally exceeded standard QKV performance. The Q-K=V variant achieved 50% KV cache reduction with only 3.1% perplexity degradation. Combined with grouped query attention (GQA), Q-K=V yielded 87.5% cache reduction; combined with multi-query attention (MQA), it achieved 96.9% reduction, enabling on-device inference. The authors attribute success to keys and values occupying similar representational spaces and attention operating in low-rank regime, while noting Q=K-V breaks attention directionality. The work is accepted at ICML 2026.

What commenters are saying

Commenters highlighted a confusing notation issue: the authors use "Q-K=V" and similar expressions, where the minus sign denotes sharing (not subtraction), creating significant head-scratching. Several proposed clearer alternatives like tuple notation (Q=K, V). Beyond notation, skepticism centered on training scale: the 1.2B model on 10B tokens is undertrained relative to Chinchilla scaling laws, raising questions about whether benefits persist with trillions of tokens and larger models. One commenter noted performance degradation decreases from 5.4% to 2.2% as sequence length increases (512 to 2048), suggesting shorter sequences may not explain acceptable performance. General sentiment acknowledges the ablation study's value while viewing results as preliminary data points needing larger-scale validation.