I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

285 points · 133 comments on HN · read original →

LLMs cracked a vulnerable Firebase app 0-70% of the time; GPT 5.5 scored highest at 70%, while guardrails hampered Anthropic models.

A security researcher built a deliberately vulnerable React Native book review app to test whether LLMs could exploit it. The vulnerability required discovering hardcoded Firebase credentials in the APK, using them to bypass the API, and directly accessing a private Firestore database—a real-world issue the author has encountered multiple times.

Across 10 full runs per model, GPT 5.5 succeeded 7 times at $9.46 per solve. Deepseek V4 Pro solved it 3 times at $0.62 per solve. Claude Sonnet and Opus each solved it twice but faced security guardrails that forced early refusals. Deepseek V4 Flash, both Gemini versions, MiniMax, and Step 3.7 all failed (0/10). Several models fixated on API-level IDOR attempts rather than pivoting to Firebase. The author spent roughly $1,500 total and notes that Anthropic's increasingly tight guardrails prevented legitimate pentesting work, while the harness itself proved harder to build than the evaluation.

What HN community is saying

The dominant complaint centers on Anthropic's escalating guardrails reducing Claude's utility for legitimate work like pentesting, code review, and credential handling. Commenters argue these are business-driven constraints that will eventually upsell into tiered "professional" tiers rather than genuine safety improvements. One commenter noted that Claude's guardrails are injected into every system prompt and re-evaluated per tool call, burning tokens users still pay for even when the request is refused. A secondary point: guardrails cannot distinguish legitimate pentesting from malicious hacking, and the model lacks real awareness to judge context. Some push back, suggesting guardrails are appropriate because LLM chat logs are stored and could leak credentials to malware. A few mention Opus 4.6 still allows pentesting with light coaxing, and one user reported Claude outright lying about system capabilities before finally relenting.