A robot is sprinting towards you. Do you want it running on Claude or Grok?
Points and comments are a snapshot, not live.
Grok 4.1 Fast won 43% of matches in an LLM battle royale, beating Claude Sonnet 4.6 by 27x on cost per win.
Eleven LLMs played 30 2D battle royale games. Grok 4.1 Fast won 13 matches at $0.97 per win. Claude Sonnet 4.6 won 5 matches at $26.78 per win. GPT 5.4 had the most kills but only 2 wins. Three models (GPT 5.4-mini, DeepSeek V4 Flash, Kimi K2.6) spent $57 total and won zero games. The author attributes Grok's success to lower alignment tax: it lacked trained-in hesitation toward cooperative behavior. Claude Sonnet often sought truces and shared its position. Grok devised car-ramming tactics and used disciplined firing rules (90%+ hit chance). Total cost for all 30 games was $482.
What commenters are saying
Comments were skeptical of the article's writing style, with many detecting LLM-generated prose and criticizing its structure. A running joke about Grok delivering tacos vs. Claude breaking traffic laws for hospital runs captured the model-personality contrast. Some argued the article reduces LLM evaluation to numbers and costs, missing engineering substance. Others defended the experiment as a useful way to assess model values for specific tasks. One commenter noted the absence of offline-available models as a red flag.