Should I choose Llama 4 Scout or Maverick?

Most indie makers should choose Scout. Scout is a 17B active / 109B total MoE model that runs on a single H100 (INT4) or RTX 4090 (Q4), with Groq API pricing at just $0.11/$0.34 per 1M tokens. Maverick is a larger model with 128 experts — self-hosting requires 4x H100 GPUs, so at indie scale you'll almost always use it via API (OpenRouter $0.15/$0.60 — Groq only carries Scout, not Maverick). Unless you need higher-quality reasoning or vision capabilities, Scout is sufficient.

Does the Llama 4 benchmark controversy mean it's not usable?

No. Meta submitted a specially tuned experimental version (not the public release) to LMArena, sparking widespread community scrutiny. Meta VP Ahmad Al-Dahle publicly denied intentional score inflation, but LMArena subsequently updated its policy to prohibit fine-tuned submissions. This invalidates the LMArena rankings but doesn't affect the public version's real-world performance. Independent tests show coding tasks genuinely lag behind (Rootly tested 69.5% accuracy), but the cost advantages for batch classification, summarization, and retrieval tasks remain real. Conclusion: don't use Llama 4 as a coding copilot, but it's still the top cost-saving choice for high-throughput batch workloads.

How much VRAM does self-hosting Llama 4 require?

Llama 4 Scout advertises '17B active parameters,' but the MoE architecture requires all expert parameters (109B total) to be loaded into memory. BF16 requires ~218GB VRAM (infeasible for consumers), INT4 quantization requires ~55GB (one H100 80GB), and Q4 quantization requires ~24GB (RTX 4090 or Mac M4 Pro 48GB). A '17B model' does not mean '17GB VRAM.'

When does self-hosting beat using the API?

For most indie makers: basically never. Renting an H100 in the cloud costs $1,075–2,153/month, and you'd need to process ~3.8 billion tokens per month to undercut Groq API pricing — virtually impossible. The only exception: if you already own an RTX 4090 or Mac Studio (paying only ~$20–30/month in electricity) and exceed 50–100M tokens per month, self-hosting starts to make sense.

Llama 4 Indie Maker Complete Guide: Scout vs Maverick, API vs Self-Hosting — What's the Right Call?

Meta released Llama 4 on April 5, 2026, and things got complicated fast.

On one side: official claims that "Maverick benchmarks beat GPT-4o." On the other: the discovery that Meta submitted a non-public experimental version to LMArena, sparking a major credibility controversy. On HN, some called it a flop, while others reported saving 90% on API costs for batch workloads.

If you're an indie maker considering whether to shift some workloads from Claude / GPT-4o to Llama 4, you don't need another benchmark deep-dive. You need a decision framework built around cost calculation and scenario selection. That's what this guide is.

TL;DR

Scout is the indie maker's choice (Groq API $0.11/$0.34); use Maverick via API (OpenRouter $0.15/$0.60 — Groq does not carry Maverick) — don't self-host it
The benchmark controversy is real (Meta submitted a non-public experimental version; Al-Dahle denied score inflation but LMArena changed its rules), coding tasks genuinely lag, but cost advantages for batch/retrieval workloads are unaffected
"17B active parameters" does not mean 17GB VRAM — MoE loads all 109B params, requiring at least 55GB with INT4
Renting cloud H100s for self-hosting is almost always more expensive than Groq API; only consider self-hosting if you already own an RTX 4090 or Mac Studio
10M context is powerful for retrieval (98% accuracy), not for synthesis (quality drops after 2M+)
Together.ai Scout pricing at $0.18/$0.59 is 2x more expensive than OpenRouter at $0.08/$0.30 — the premium is only worth it for compliance requirements

What Is Llama 4? Scout vs Maverick in 30 Seconds

Llama 4 uses a MoE (Mixture of Experts) architecture — not all parameters are activated on every forward pass. Instead, only a subset of experts is used per inference. This makes the model "look large but run efficiently."

	Scout	Maverick
Active params	17B	17B
Total params	109B (16 experts)	400B (128 experts)
Context	10M tokens	1M tokens
Minimum self-host hardware	1x H100 (INT4) / RTX 4090 (Q4)	4x H100 (INT4)
Groq API pricing	$0.11/$0.34	— Scout only
Positioning	GPT-4o mini tier + ultra-long context	GPT-4o tier (disputed)

For most indie makers, the answer is Scout. Self-hosting Maverick requires 4x H100 GPUs — indie scale simply doesn't justify that. If you want Maverick, use the API (OpenRouter $0.15/$0.60 — Groq only carries Scout). Its inference quality improvement rarely justifies a 2x premium for batch or retrieval tasks.

The Benchmark Controversy: Should I Trust Llama 4?

Bottom line first: LMArena rankings are invalid, coding scenarios genuinely underperform, but batch workloads still deliver cost advantages.

Here's the full timeline:

Meta submitted a chat-tuned experimental version called "Llama-4-Maverick-03-26-Experimental" (not the publicly downloadable release) to LMArena
Researcher Nathan Lambert and others flagged inconsistencies between the submitted version and the public release
Meta VP of Generative AI Ahmad Al-Dahle publicly denied intentional score inflation; however, the submission of a specially tuned experimental build to LMArena was reported by tech media — it was not an acknowledgment made by Al-Dahle himself
LMArena subsequently updated its policy to prohibit fine-tuned submissions; the community remained broadly skeptical of Meta's explanation
Rootly's independent coding benchmark: Llama 4 came in last place with 69.5% accuracy — 18% behind the top performer

The HN community consensus: "It feels like a flop because the expectations are real."

How should you interpret this?

Don't use LMArena rankings as a reference — Meta submitted an unreleased experimental version, so rankings don't reflect the public release's true capabilities
The gap in coding tasks is a structural weakness of the MoE architecture — stateful coding requires tracking state across steps, something MoE's expert routing is naturally poor at
But batch classification, document summarization, retrieval QA, and other "relatively independent per-call" tasks are completely unaffected — these workloads care about cost efficiency, not leaderboard rankings

Confidence rating: Benchmark controversy facts (HIGH confidence, confirmed by multiple independent sources). Coding underperformance conclusion (MEDIUM confidence — Rootly is a single independent test, but structural MoE weakness has theoretical backing).

Full API Pricing Comparison

Not all Llama 4 API providers charge the same — the gaps are larger than you'd expect.

Pricing data as of April 2026, based on each provider's official pricing page.

Provider	Scout Input $/1M	Scout Output $/1M	Maverick Input $/1M	Maverick Output $/1M	Notes
OpenRouter	$0.08	$0.30	$0.15	$0.60	Cheapest, auto-routing
Groq	$0.11	$0.34	—	Scout only	Fastest (LPU ~408 tok/s)
Together.ai	$0.18	$0.59	$0.55	$2.19	SOC 2 Type II + HIPAA

Three selection logics:

Cost-first → OpenRouter (Scout output $0.30, cheapest available)
Speed-first → Groq (LPU architecture, p50 latency < 500ms)
Compliance requirements (HIPAA / SOC 2) → Together.ai (roughly 2x premium, but with clear compliance certifications)

Together.ai is Meta's official partner, but "official partner" doesn't mean "best value." If you have no clear compliance requirements, choose OpenRouter or Groq.

For comparison: Claude Sonnet 4.6 output pricing is $15.00/1M tokens; Groq Scout is $0.34 — 44x cheaper. But price isn't the only decision factor — more on that below.

Llama 4 vs Claude / GPT-4o Cost Calculation

Let's use real tasks rather than abstract pricing comparisons.

Assumptions: 1:3 input:output token ratio (200 input + 600 output tokens per call), 30,000 calls per month (~1,000 per day).

Plan	Monthly Cost Calculation	Monthly Cost
Groq Scout	(200×$0.11 + 600×$0.34) / 1M × 30,000	$6.78
OpenRouter Scout	(200×$0.08 + 600×$0.30) / 1M × 30,000	$5.88
Claude Haiku 4.5	(200×$1.00 + 600×$5.00) / 1M × 30,000	$96.00
Claude Sonnet 4.6	(200×$3.00 + 600×$15.00) / 1M × 30,000	$288.00
GPT-4o mini	(200×$0.15 + 600×$0.60) / 1M × 30,000	$11.70

Groq Scout is 93% cheaper than Haiku 4.5 and 97% cheaper than Sonnet 4.6.

But saving 90%+ doesn't mean you should switch everything over. Here's the scenario breakdown:

Tasks well-suited for switching to Llama 4:

Batch document summarization (each document is independent — no cross-document reasoning required)
Data classification / tagging (keyword extraction, sentiment analysis)
Codebase navigation / retrieval (finding specific functions, tracing call paths)
Image-and-text extraction (Scout is natively multimodal, available to non-EU users)

Tasks not suited for switching:

Coding tasks (Rootly MCQ-format test shows 18% gap)
Multi-turn tool calling agents (Maverick still marked "under development" as of April 2026)
Real-time chat with extremely long context (TTFT > 60 seconds at 10M tokens)
Safety-critical outputs (hallucination rates at long context lack reliable data)

How to estimate your own token distribution? Enable usage logging in your API calls and record one week of prompt_tokens and completion_tokens to calculate your actual input:output ratio. Different application types vary significantly — chatbots are typically 1:3, while summarization tasks may be 10:1. Plug your real numbers into the formula above rather than relying on my assumptions.

What Can You Actually Do with a 10M Token Context?

Scout's 10M token context window is a real feature, not a marketing gimmick — but you need to understand what it can and can't do.

Meta's official NIAH (Needle In A Haystack) benchmark shows: 98% retrieval accuracy at 10M context.

But there's a critical distinction: context-as-database (retrieval) vs context-as-working-memory (synthesis).

Retrieval (Effective, 10M Usable)

Finding specific information in an extremely long context — like Ctrl+F, but smarter:

Full codebase analysis (500K–2M tokens): finding specific API calls, tracing dependency chains, generating onboarding documentation
Legal/contract batch processing: comparing clause conflicts across 50+ contracts in a single batch (10M tokens ≈ 7,000 pages of documents)
Long-term research assistant: 6–12 months of notes and papers in persistent context, queryable at any time

Synthesis (Limited, Quality Drops After 2M+)

Requiring the model to synthesize new perspectives or restructure content across a large body of material — like asking it to "read all 50 files and then refactor the architecture":

Community testing and analysis indicate that synthesis task quality drops significantly beyond 2M tokens. "Throw the entire codebase in and ask Llama 4 to refactor it" is an unrealistic expectation.

Conclusion: 10M context is a context-as-database tool — use it to search, locate, and compare. It's not a context-as-working-memory system — don't expect deep synthesis across 10M tokens.

Self-Hosting Llama 4 Hardware Requirements: Don't Be Fooled by "17B"

This is the most common technical misconception: "Scout has 17B active parameters, so VRAM requirements are similar to a 17B dense model."

Wrong. In MoE (Mixture of Experts), all expert parameters must be loaded into memory — not just the subset activated during each forward pass.

The math:

109B total params × 2 bytes (BF16) = ~218GB VRAM (infeasible for consumers)
109B × 0.5 bytes (INT4) = ~55GB VRAM (1x H100 80GB)
For comparison: a 17B dense model in INT4 only needs ~9GB

Model	Precision	VRAM Required	Recommended Hardware	Performance
Scout	BF16	~218GB	Infeasible (consumer)	—
Scout	INT4	~55GB	1x H100 80GB	Standard production
Scout	Q4 (Ollama)	~24GB	RTX 4090 / Mac M4 Pro 48GB	25–40 tok/s
Scout	1.78-bit (Unsloth)	~14GB	RTX 3080 16GB	~20 tok/s (significant quality loss)
Maverick	INT4	~200GB	4x H100	Not indie-scale

Ollama Quick Install

# Install Ollama (macOS)
brew install ollama

# Download Llama 4 Scout (Q4, requires 24GB+ VRAM)
ollama pull llama4

# Run
ollama run llama4

Performance expectations (community-reported, MEDIUM confidence):

M4 Pro Mac 48GB: ~30–40 tok/s
RTX 4090 24GB: ~25–35 tok/s
M3 Max 36GB: ~20–28 tok/s

Note: Maverick does not support Ollama consumer deployment (requires 200GB+ VRAM).

API vs Self-Hosting Cost Analysis: When Does Self-Hosting Actually Pay Off?

Let's look at the numbers first.

Self-Host Option	Monthly Cost	vs Groq Scout	Break-even Monthly Tokens
Rent H100 (Vast.ai)	~$1,075	Groq is almost always cheaper	~3.8B tokens (not practical)
Rent H100 (Lambda Labs)	~$2,153	Groq is always cheaper	~6.1B tokens (impossible)
Own RTX 4090 (electricity only)	~$20–30	Break-even at 50–100M tokens/month	50–100M tokens
Own Mac Studio M4 Ultra (electricity only)	~$15–25	Faster break-even	40–80M tokens

Break-even calculations based on Groq Scout pricing of $0.11/$0.34 (as of 2026-04-18), assuming a 1:3 token ratio.

The conclusion is clear: unless you already own the hardware, cloud-rented self-hosting will always cost more than Groq API.

But there's one frequently overlooked hidden cost: DevOps maintenance time. A solo side project spending 3–5 hours per week maintaining Ollama/vLLM (model updates, scaling, debugging) costs $600–1,000/month at $50/hr. Factor that in, and even with existing hardware, the break-even point shifts significantly higher.

Honestly, most indie makers spend $10–100 per month on API fees. By the time self-hosting becomes a serious consideration, your product should already have enough revenue to justify the infrastructure investment.

Indie Maker Scenario Selection Matrix

Task Type	Llama 4 Scout	Claude Haiku 4.5	Depends on Scale
Batch document summarization	✅ First choice (save 90%+)	Higher quality but 14x more expensive	—
Data classification / tagging	✅ First choice	—	—
Keyword extraction	✅ First choice	—	—
Codebase retrieval	✅ 10M context advantage	—	—
Image/text extraction	✅ (non-EU users)	❌ Not supported	Claude vision more stable
Complex coding copilot	❌ 18% behind	—	✅ Claude Sonnet
Multi-turn agent	❌ Tool calling unstable	✅	—
Real-time chat > 10 concurrent	⚠️ Groq rate limits	✅	—
Article writing (English)	⚠️ Quality varies by task	✅ More consistent quality	—

A hybrid architecture is the most pragmatic approach:

Batch / classification / retrieval tasks → Groq Scout (save 90%+)
Quality-critical user-facing tasks → Claude Haiku 4.5 fallback
Assuming 70% goes through Scout and 30% through Haiku, hybrid costs are ~60% cheaper than pure Haiku

License Risks and Long-Term Strategic Assessment

The Llama 4 Community License is not what most people think of as "open source" — it's source-available and does not conform to the Open Source Definition (OSI standard).

Three Key License Restrictions

MAU cap: Monthly active users exceeding 700 million require additional authorization from Meta (indie makers won't come close to this in practice)
EU multimodal restriction: EU users cannot use Llama 4's vision features (Scout/Maverick's multimodal capabilities). Text features remain available in the EU
Non-OSI open source: This is not true open source — Meta retains greater control

SaaS developers with EU users take note: If your product serves EU users and uses Llama 4's vision features (e.g., letting users upload screenshots for analysis), you're technically in violation of the license terms. Text features are not affected.

Meta's Long-Term Strategic Risk

Several concerning signals have emerged between 2025 and 2026:

VP Joelle Pineau's resignation — leadership changes in Meta AI
Digitimes reported in December 2025 that Meta delayed Llama's successor, with internal teams moving toward closed-source
Zuckerberg marginalized the GenAI org

Recommendation: Don't assume Llama 5 will be open. Before depending heavily on Llama 4, design a provider-agnostic fallback mechanism. The simplest approach: use an abstraction layer to isolate API calls (switching from Groq to Claude requires only changing the endpoint + model name), keeping the switch cost under 20 lines of code.

License information is based on the Llama 4 Community License as of 2026-04-18. Meta may modify terms at any time.

Decision Matrix: Determine in 3 Minutes Whether Llama 4 Is Right for You

There's a lot of information here. Let's compress it into three steps:

Step 1: Task Type Filter

Is your primary workload a coding copilot or multi-turn agent? → Not recommended to switch — Claude/GPT-4o are still better
Is your primary workload batch processing, classification, or retrieval? → Continue to Step 2

Step 2: Estimate Monthly Token Volume + Choose API Provider

Monthly cost = (input_tokens × input_price + output_tokens × output_price) / 1,000,000 × monthly_calls

Monthly Token Volume	Recommendation
< 100M tokens	Groq or OpenRouter API (monthly cost < $50 — don't think about self-hosting)
100M–1B tokens	Groq API + Haiku fallback hybrid architecture
> 1B tokens and you own a GPU	Evaluate self-hosting (RTX 4090 / Mac Studio)
> 1B tokens and no GPU	Still use API (renting cloud H100 is not cost-effective)

Step 3: Compliance and Regional Filter

Have HIPAA / SOC 2 requirements? → Together.ai (~2x premium, with clear certifications)
Have EU users + using vision features? → Exclude Llama 4 multimodal, switch to Claude vision
Neither of the above? → OpenRouter (cheapest) or Groq (fastest)

Risk Disclosure

Pricing changes constantly: The API market is highly competitive. Prices cited in this article are a snapshot from April 2026. For live data, check each provider's pricing page.

Benchmark limitations: The Rootly coding benchmark cited in this article is a single independent test with limited sample size. The conclusion about coding underperformance has theoretical backing from MoE structural weaknesses, but does not mean Llama 4 will necessarily underperform in every coding scenario.

Cost estimates are based on assumptions: Cost calculations assume a 1:3 input:output token ratio and 30,000 calls per month. Your actual token distribution may vary significantly — measuring your real numbers should be the first thing you do after going live.

License risk: The terms of the Llama 4 Community License may be modified at any time. The license analysis in this article is based on conditions as of 2026-04-18.

Conclusion

Llama 4 is neither "a cheap Claude replacement" nor "a failure to ignore just because of benchmark controversy."

It's a tool with clearly defined use cases: batch classification, document summarization, codebase retrieval — for these tasks, Groq Scout is 93% cheaper than Claude Haiku with sufficient quality to get the job done. But coding copilots and multi-turn agents are a different story — this is a structural limitation of the MoE architecture, not something you can fix with prompt engineering.

The most pragmatic approach: a hybrid architecture. Route batch tasks through Groq Scout ($0.11/$0.34), and send quality-sensitive user-facing features to Claude Haiku 4.5 ($1/$5) — with a try/except switch in 20 lines of code. You'll save 60%+ on API costs without compromising the tasks that matter most.

Start now: use the formula above to estimate your monthly spend, run it through the decision matrix, and pick the first scenario to test. Remember — you don't need to switch everything at once. Start by running one batch task on Groq Scout for a week, quantify the savings, then decide whether to expand.