Diana: Optimizing LLM-Powered Development Agents Through Token Economics
Version 1.3.2 | March 2026 58% cost reduction achieved: $3.81 → $1.59 per task
Abstract
Diana is an autonomous .NET development agent that integrates with multiple LLM providers (Claude, Kimi, OpenAI) to read, write, build, and test code through a tool-calling loop. Over 9 iterations, we systematically reduced the cost-per-task from $3.81 to $1.59 — a 58% reduction — through token-level optimizations, Roslyn-based code analysis, smart history compression, and escalating model selection. This paper documents the architecture, the optimization journey, what worked, what failed, and the lessons learned.
1. Problem Statement
LLM-powered coding agents are expensive. A typical "create a calculator with API + Blazor UI + SQLite" task costs ~$3.81 in API tokens. The agent spends most tokens on:
- Reading files it just created — the LLM writes a file, then reads it back to verify
- Repeating context — long conversation histories re-sent every turn
- Boilerplate generation — CRUD patterns that are identical across projects
- System prompt overhead — 2,500+ tokens sent on every turn
Our goal: reduce cost without reducing capability. The agent must still produce working, compilable code with attractive UI.
2. Architecture
2.1 Layer Overview
┌──────────────────────────────────────────────────────┐
│ Layer 1: LLM Providers │
│ Claude Sonnet/Opus · Kimi K2 · GPT-4o/o3 │
├──────────────────────────────────────────────────────┤
│ Layer 2: Escalating LLM Client │
│ Doer (cheap) ←→ Analyst (expensive) │
├──────────────────────────────────────────────────────┤
│ Layer 3: Agent Loop │
│ Interactive Mode · Planning Mode (4 phases) │
├──────────────────────────────────────────────────────┤
│ Layer 4: Tool Registry (22 tools) │
│ Code Analysis · File Editing · Build/Test · Git │
│ Shell · Web · Search · Indexing │
├──────────────────────────────────────────────────────┤
│ Layer 5: Security & Storage │
│ PathValidator · SQLite VectorDB · ONNX Embeddings │
└──────────────────────────────────────────────────────┘
2.2 Agent Loop
The core execution model is an agentic tool-calling loop:
User Input
↓
VectorDB Context Search (topK=3, score ≥ 0.3)
↓
┌─── Loop (max 30 iterations) ───┐
│ 1. Build ChatRequest │
│ 2. Send to LLM (with tools) │
│ 3. Parse response │
│ ├─ Has tool calls? │
│ │ ├─ Parallel reads │
│ │ ├─ Sequential writes │
│ │ └─ Continue loop │
│ └─ No tool calls? │
│ └─ Stream final response│
└─────────────────────────────────┘
Key design decisions:
- Parallel execution for read-only tools (analyze_file, read_file, search_code)
- Sequential execution for write tools (write_file, edit_file) — order matters
- Confirmation gates on destructive operations (bypass with
--AtomicBlonde) - System prompt sent once — LLM APIs cache it after the first request
2.3 Escalating LLM Client
A decorator pattern that wraps two models from the same provider:
┌─────────────┐
│ Escalating │
│ LLM Client │
└──────┬──────┘
│
┌───────────┴───────────┐
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ Doer │ │ Analyst │
│ (Sonnet) │ │ (Opus) │
│ $3/M │ │ $15/M │
└─────────────┘ └─────────────┘
Escalation triggers:
- Keyword-based: System prompt contains "architect" → use analyst
- Error-based: 3+ consecutive errors → escalate to analyst
- Auto-deescalation: Success with analyst → return to doer
Result: ~30-40% cost reduction vs using the expensive model for everything.
2.4 Tool Inventory
| Category | Tools | Phase |
|---|---|---|
| Analysis | analyze_file, analyze_project, list_directory |
Read-only |
| File Read | read_file, search_code, search_knowledge |
Read-only |
| File Write | write_file, edit_file, insert_at_line, find_and_replace, delete_lines, append_to_file |
Execution |
| Build/Test | dotnet_build, dotnet_test |
Either |
| Code Gen | generate_code |
Execution |
| Git | git_status, git_diff, git_commit |
Conditional |
| Shell | run_command |
Execution |
| Web | web_search, web_fetch |
Read-only |
| Index | reindex_project, get_index_stats |
Read-only |
Phase-based filtering prevents the LLM from calling write tools during exploration, or exploration tools during execution.
3. The Optimization Journey
3.1 Benchmark Methodology
Task: "Create a calculator using a .NET API to store operations and a Blazor Server app for the UI. Use SQLite. Make the design attractive."
Measurement: Total API cost (input + output tokens × provider pricing) for one complete execution with --AtomicBlonde (no confirmation pauses).
Control: Same task, same provider, same model, same machine. Only the Diana version changes.
3.2 Cost Evolution
Version Cost Δ vs Baseline Cumulative Spend Key Change
─────── ────── ───────────── ──────────────── ──────────────────────────
v1.0.7 $3.81 baseline $3.81 Initial release
v1.2.0 $3.18 -16.5% $7.16 Early optimizations
v1.2.1 $3.44 -9.7% $10.60 Regression (over-compressed)
v1.2.2 $2.50 -34.4% $13.10 Recovered + improved
v1.2.3 $2.42 -36.5% $15.52 Incremental gains
v1.3.0 $3.92 +2.9% $19.44 REGRESSION: aggressive compression
v1.3.1 $2.51 -34.1% $21.95 Fix: type-aware compression
v1.3.2 $1.59 -58.3% $23.54 Roslyn analyze_file (best)
v1.4.0 $3.19 -16.3% $26.73 FAILED: scaffold_crud experiment
3.3 Visual Cost Curve
$4.00 ┤
│ ●v1.0.7 ●v1.3.0
$3.50 ┤ ●v1.2.1 ╱ ●v1.4.0
│ ╱ ╱
$3.00 ┤ ●v1.2.0 ╱
│ ╲ ╱
$2.50 ┤ ╲ ●v1.2.2 ●v1.2.3 ╱ ●v1.3.1
│ ╲─────────╱ ╱
$2.00 ┤ ╱
│ ╱
$1.50 ┤ ●v1.3.2 ← BEST ($1.59)
│
$1.00 ┤
└──────────────────────────────────────────
1.0 1.2 1.2.1 1.2.2 1.2.3 1.3 1.3.1 1.3.2 1.4
4. What Worked
4.1 Roslyn analyze_file (v1.3.2) — 37% reduction
The single biggest optimization. Instead of reading entire C# files (~800 tokens), we parse them with Roslyn and return a structural summary (~50 tokens):
Before (read_file): ~800 tokens
using System;
using System.Collections.Generic;
using Microsoft.AspNetCore.Mvc;
using Microsoft.EntityFrameworkCore;
// ... 60 more lines of actual code
After (analyze_file): ~50 tokens
// Controllers/UsersController.cs (65 lines)
using: System, Microsoft.AspNetCore.Mvc, Microsoft.EntityFrameworkCore
namespace MyApp.Controllers
[ApiController, Route("api/[controller]")]
public class UsersController : ControllerBase
UsersController(AppDbContext db)
async Task<ActionResult<List<User>>> GetAll()
async Task<ActionResult<User>> GetById(int id)
async Task<ActionResult<User>> Create(UserRequest request)
async Task<IActionResult> Update(int id, UserRequest request)
async Task<IActionResult> Delete(int id)
The LLM gets the same structural understanding at 16x fewer tokens. It only calls read_file when it actually needs to edit a specific file.
System prompt directive:
"Use analyze_file for exploration instead of read_file — much cheaper. Use read_file only when you need exact content for editing."
4.2 Smart History Compression (v1.3.1) — Fixed regression
The conversation history grows with every turn. By turn 20, you're re-sending 15,000+ tokens of old tool results. Compression keeps only what the LLM needs:
Type-aware rules (sliding window of 16 recent messages kept intact):
| Tool Type | Compression Strategy | Rationale |
|---|---|---|
read_file, search_code |
Keep first 500 chars | LLM needs to remember structure |
write_file, edit_file |
[write_file: OK] |
LLM already knows what it wrote |
run_command, dotnet_build |
Errors only, success → [OK] |
Only errors matter |
assistant messages |
Trim to 300 chars | Older reasoning less relevant |
Critical lesson from v1.3.0: We initially compressed read_file to [OK] too. The LLM lost context on what it had read and re-read the same files — costing more than not compressing at all. $3.92 vs $1.80 target. Type-awareness fixed this.
4.3 System Prompt Caching (v1.3.0) — 5-8% reduction
SystemPrompt = iteration == 1 ? systemPrompt : null,
LLM APIs (Claude, OpenAI) cache the system prompt after the first request. Sending it again on every turn wastes ~2,500 tokens per iteration. With 20 turns, that's ~50,000 wasted tokens.
4.4 VectorDB Context Filtering (v1.3.0) — 3-5% reduction
var relevant = results.Where(r => r.Score >= 0.3f).ToList();
Before filtering, the agent would inject 5 code snippets as context — many irrelevant. Score filtering (≥ 0.3) and topK=3 ensures only truly relevant code enters the context.
4.5 Tool Result Truncation (v1.3.0) — 5-10% reduction
const int MaxToolResultLength = 6000;
Build output, test results, and directory listings can be enormous. Truncating to 6,000 chars preserves error information while discarding verbose success output.
4.6 Output Stripping (v1.3.0) — 5-8% reduction
Shell commands produce noise: NuGet restore logs, X.509 certificate warnings, MSBuild telemetry. Stripping these before they enter the conversation history prevents token bloat.
5. What Failed
5.1 scaffold_crud (v1.4.0) — Cost increased from $1.59 to $3.19
Hypothesis: The LLM spends ~8 turns writing boilerplate CRUD (Model, DTOs, DbContext, Controller, Service). A deterministic tool that generates all 6 files in one call should replace those 8 turns with 1.
Implementation: scaffold_crud tool generating:
Models/{Entity}.cs— Entity with Id + propertiesDTOs/{Entity}Request.cs— DTO without IdDTOs/{Entity}Response.cs— DTO with IdData/AppDbContext.cs— DbContext with SQLiteControllers/{Entities}Controller.cs— 5 CRUD endpointsServices/{Entity}ApiService.cs— HttpClient wrapper
What actually happened (53 turns):
Turn 5: scaffold_crud ← Generated 6 generic files (7ms)
Turn 6-8: read_file ×4 ← LLM reads what scaffold generated
Turn 9-14: write_file ×6 ← LLM REWRITES everything with real logic
Root cause: The scaffold generates generic boilerplate. A calculator needs custom logic: expression parsing, operator handling, result computation. The LLM couldn't use generic CRUD — it had to read everything, understand it, then rewrite it. That's 3x the turns (scaffold + read + rewrite) instead of just writing directly (~8 turns).
Lesson learned: Deterministic code generation only helps when the output is usable as-is. If the LLM has to customize it, you're paying for generation + comprehension + rewriting — worse than just writing from scratch.
5.2 Aggressive History Compression (v1.3.0) — $3.92 regression
Compressing read_file results to [OK] caused the LLM to:
- Forget what it had read
- Re-read the same files
- Create a "read → forget → re-read" loop
Lesson learned: Compression must be type-aware. Read results are context; write results are confirmation. Treat them differently.
6. Architecture Details
6.1 Security: PathValidator
Defense-in-depth with 7 validation layers:
- Null/empty check
- Null byte injection detection
- Path traversal pattern blocking (
..,../,..\\) - Obfuscated traversal detection (
....,...) - Absolute path resolution + base directory verification
- Blocked extensions (
.exe,.dll,.msi,.vbs, etc.) - Symlink/reparse point validation
All file-writing tools call PathValidator.IsPathSafe() before any I/O operation.
6.2 VectorDB: Semantic Code Search
┌─────────────────────────────────┐
│ SQLite VectorDB │
├─────────────────────────────────┤
│ CodeChunk table │
│ FilePath · Content · Vector │
├─────────────────────────────────┤
│ FileHash table (incremental) │
│ FilePath · SHA256 · LastIdx │
├─────────────────────────────────┤
│ Embedding: ONNX MiniLM-L6-v2 │
│ Fallback: Simple hash-based │
└─────────────────────────────────┘
- Incremental indexing: Only re-indexes files whose content hash changed
- Semantic embeddings: ONNX all-MiniLM-L6-v2 (~30MB, downloaded on first use)
- Fallback: Hash-based embedding when ONNX unavailable
- Integration: Search results injected into first user message as context
6.3 Multi-Provider LLM Support
public class LLMConfig
{
public string Provider { get; set; } // "kimi" | "claude" | "openai"
public string Model { get; set; } // Doer model
public string? AnalysisModel { get; set; } // Analyst model (optional)
public string ApiKey { get; set; }
public int EscalationErrorThreshold { get; set; } // Default: 3
}
| Provider | Doer | Analyst | API Format |
|---|---|---|---|
| Kimi | kimi-k2-turbo-preview | kimi-k2-thinking | OpenAI-compatible |
| Claude | claude-sonnet-4-6 | claude-opus-4-6 | Anthropic native |
| OpenAI | gpt-4o | o3 | OpenAI native |
When AnalysisModel ≠ Model, the LLMFactory wraps both in EscalatingLLMClient.
7. Quantitative Analysis
7.1 Token Distribution (v1.3.2 benchmark task)
| Category | Tokens | % of Total |
|---|---|---|
| System prompt (1x) | ~2,500 | 4.7% |
| User context (VectorDB) | ~300 | 0.6% |
| LLM reasoning (output) | ~8,000 | 15.1% |
| Tool call arguments | ~5,000 | 9.4% |
| Tool results (compressed) | ~12,000 | 22.6% |
| Conversation history (re-sent) | ~25,000 | 47.1% |
| Total | ~53,000 | 100% |
Key insight: 47% of tokens are conversation history re-sent on each turn. History compression targets this largest category.
7.2 Cost per Optimization Technique
| Technique | Token Savings | Cost Impact | Effort |
|---|---|---|---|
| Roslyn analyze_file | ~16x per file read | -37% | Medium (Roslyn integration) |
| History compression | ~60% of old messages | -25% | Low (sliding window) |
| System prompt caching | ~2,500/turn after first | -8% | Trivial (null check) |
| Tool result truncation | ~30% of large outputs | -7% | Low (string truncation) |
| Output stripping | ~500/turn average | -6% | Low (regex filtering) |
| VectorDB filtering | ~2,000 on first turn | -4% | Trivial (score threshold) |
| scaffold_crud | Negative | +100% | High (wasted) |
7.3 Turns per Version
| Version | Turns | Cost | Cost/Turn |
|---|---|---|---|
| v1.0.7 | ~25 | $3.81 | $0.152 |
| v1.3.0 | ~28 | $3.92 | $0.140 |
| v1.3.2 | ~22 | $1.59 | $0.072 |
| v1.4.0 | 53 | $3.19 | $0.060 |
v1.4.0 paradox: Lowest cost-per-turn ($0.060) but highest turn count (53). The scaffold_crud made each turn cheaper but doubled the number of turns.
8. Lessons Learned
8.1 Compression Must Be Type-Aware
Not all tool results are equal. Read results are context the LLM needs to remember. Write confirmations are redundant. Treating them uniformly causes either context loss (over-compression) or token waste (under-compression).
8.2 Deterministic Generation Fails When Customization Is Required
Code scaffolding only saves tokens if the output is usable without modification. Generic CRUD templates require the LLM to read, understand, and rewrite — costing 3x what direct writing costs.
8.3 The Biggest Wins Are Structural, Not Textual
Shortening system prompts (textual) saves 2-3%. Replacing file reads with Roslyn summaries (structural) saves 37%. The highest-leverage optimizations change what information the LLM receives, not how it's formatted.
8.4 Regressions Are Expensive to Detect
Each benchmark run costs $1.50-$4.00. Testing 9 versions cost $26.73 in total API spend. An automated, cheaper benchmark (shorter task, smaller project) would enable faster iteration.
8.5 The Re-Read Loop Is the #1 Cost Killer
When the LLM loses context on what it previously read, it enters a read → forget → re-read cycle that can double or triple costs. Preserving read context in compression is the single most important rule.
9. Future Directions
9.1 Auto-Split Files
When the LLM writes multiple classes/DTOs in a single file, subsequent edits require loading the entire file. A post-write tool that automatically splits multi-class files into individual files would reduce read_file token costs for later edits.
9.2 Smarter Scaffold with Business Logic Injection
Instead of generic CRUD, a scaffold that accepts business logic hints:
scaffold_crud entity=Calculation properties=...
business_logic="Calculate result from Operand1, Operator, Operand2"
This could generate customized code the LLM doesn't need to rewrite.
9.3 Diff-Based File Editing
Instead of read_file → full content → edit_file, a tool that accepts line ranges would reduce the tokens needed for surgical edits.
9.4 Cheaper Benchmark Task
A simpler benchmark (e.g., "add a property to an existing model") would cost ~$0.20 per run, enabling 10x more optimization iterations per dollar.
10. Conclusion
Through 9 iterations and $26.73 in benchmark spend, we reduced Diana's cost-per-task from $3.81 to $1.59 — a 58% reduction. The key insight is that LLM token costs are dominated by what the model reads, not what it writes. The three highest-impact optimizations all target input tokens:
- Roslyn analyze_file — 16x reduction in code exploration tokens
- Type-aware history compression — 60% reduction in re-sent conversation history
- System prompt caching — Eliminate 2,500 tokens per turn after the first
The failed scaffold_crud experiment (v1.4.0, $3.19) demonstrated that reducing output tokens (code generation) is counterproductive if it increases input tokens (reading and understanding generated code).
The cost of an LLM agent is not the code it writes — it's the context it needs to write it.
11. Roadmap: From Assistant to Autonomous Software Factory
11.1 The Vision
The natural evolution of Diana is not a better assistant — it's a factory. Instead of one developer interacting with one agent on one task, the system receives a Statement of Work (SOW) and autonomously produces a complete project with minimal human intervention.
SOW Document (50 pages)
↓
┌───────────────────────────────────────┐
│ OPUS PLANNER (1 expensive call) │
│ Reads SOW → Generates DAG of tasks │
│ Cost: ~$2-5 │
└───────────────┬───────────────────────┘
↓
┌───────────────────────────────────────┐
│ TASK BOARD (Persistent Queue) │
│ │
│ Phase 1: Foundation │
│ Task 1: Solution + shared models │
│ Task 2: Auth system │
│ ──── GATE: build ✓ ──── │
│ │
│ Phase 2: API Core (parallel) │
│ Task 3: CRUD Users ─┐ │
│ Task 4: CRUD Products ├─ parallel│
│ Task 5: CRUD Orders ─┘ │
│ ──── GATE: build + test ✓ ──── │
│ │
│ Phase 3: Business Logic │
│ Task 6: Order processing │
│ Task 7: Inventory rules │
│ ──── GATE: build + test ✓ ──── │
│ │
│ Phase 4: UI (parallel) │
│ Task 8: Login page ─┐ │
│ Task 9: Dashboard ├─ parallel│
│ Task 10: Product catalog ─┘ │
│ ──── GATE: build ✓ ──── │
│ │
│ Phase 5: Integration │
│ Task 11: Connect UI ↔ API │
│ Task 12: Error handling │
│ ──── GATE: build + test + HUMAN ── │
└───────────────┬───────────────────────┘
↓
┌───────────────────────────────────────┐
│ EXECUTOR (Diana CLI, headless) │
│ Processes tasks respecting DAG order │
│ Sonnet for execution, ~$1.59/task │
└───────────────┬───────────────────────┘
↓
Complete Project
~$32 total cost
~2 hours autonomous execution
11.2 Why Not a Flat Queue
A SOW cannot be decomposed into 20 independent tasks. Real projects have dependencies:
- Task 5 (CRUD Orders) needs the User and Product models from Tasks 3-4
- Task 9 (Dashboard UI) needs the API endpoints from Tasks 3-5
- Task 11 (Integration) needs everything above to exist
A flat queue would cause error propagation — if Task 3 generates different model names than expected, Tasks 5-12 build on assumptions that don't match reality. The system needs a DAG (Directed Acyclic Graph) with gates between phases.
11.3 The Critical Component: Context Injection
The hardest unsolved problem is not execution — it's coherence across tasks.
Task 1 creates User.cs with specific property names. Task 5 must know those exact names, not guess them from the SOW. This requires a Context Injector that:
- After each task completes, extracts what was actually generated (file paths, class names, endpoints, models)
- Before each dependent task starts, injects this real context into the prompt
- Maintains a living project manifest that evolves as tasks complete
Task 3 completes → Context Injector extracts:
- Models: User.cs (Id, Email, Name, PasswordHash)
- DTOs: UserRequest.cs, UserResponse.cs
- Endpoints: GET/POST/PUT/DELETE /api/users
- DbContext: AppDbContext with DbSet<User>
Task 5 receives injected context:
"The project already has User (Id, Email, Name, PasswordHash)
and Product (Id, Name, Price, Stock) models.
AppDbContext has DbSet<User> and DbSet<Product>.
Create Order entity with foreign keys to both."
Without this, each task works from the SOW description (what was planned) instead of the codebase reality (what was built). The gap between plan and reality grows with every task.
11.4 Gate System
Gates are checkpoints between phases that prevent error propagation:
| Gate Type | Trigger | Action on Failure |
|---|---|---|
| Build Gate | dotnet build fails |
Retry task with error context (max 3) |
| Test Gate | dotnet test has failures |
Retry with test output as context |
| Human Gate | End of major phase | Notify human, wait for approval |
| Auto-Repair Gate | Build fails after retry | Escalate to Opus for diagnosis |
The human gate is strategically placed — not after every task (too slow) but after each phase (meaningful checkpoint). A developer reviews the Phase 2 output before Phase 3 begins.
11.5 Economics
| Metric | Value |
|---|---|
| Task decomposition (Opus, 1 call) | ~$2-5 |
| Execution (Sonnet, ~20 tasks × $1.59) | ~$32 |
| Gate retries (estimated 3-4 failures) | ~$6 |
| Total project cost | ~$40-43 |
| Execution time | ~2 hours autonomous |
| Human intervention | ~15 min review at gates |
On the Max plan ($200/month), this means ~5 complete projects per month within budget. On API rates, $40/project is still dramatically cheaper than developer time.
11.6 What Already Exists in Diana
| Component | Status | Notes |
|---|---|---|
Headless mode (diana --auto) |
Exists | Can execute single tasks |
| Tool registry (22 tools) | Exists | Build, test, file ops, git |
| Build verification | Exists | dotnet_build after changes |
| EscalatingLLMClient | Exists | Opus for planning, Sonnet for execution |
| History compression | Exists | Keeps context manageable |
| Roslyn analysis | Exists | Cheap code understanding |
11.7 What Needs to Be Built
| Component | Effort | Description |
|---|---|---|
| SOW Parser | Medium | Opus reads SOW, generates phased DAG in JSON |
| Task Board | Low | Persistent queue with states, deps, phase grouping |
| Context Injector | High | Extracts actuals from completed tasks, injects into next |
| Gate System | Medium | Build/test/human-review between phases |
| DAG Orchestrator | High | Executes tasks respecting dependency order, parallelizes within phases |
| Project Manifest | Medium | Living document of what exists (models, endpoints, files) |
The Context Injector and DAG Orchestrator are the two hard problems. Everything else is plumbing.
11.8 The Real Question
The question is not "can the AI work more hours?" — it's "can the AI maintain coherence across 20 chained tasks?"
A single task (the calculator) works because everything fits in one context window. Twenty chained tasks require the system to:
- Remember what it built (not what it planned)
- Adapt to deviations (Task 3 used
UserEntityinstead ofUser) - Recover from failures (Task 7 failed, Task 8-20 need replanning)
- Maintain architectural consistency across 2 hours of autonomous execution
This is the frontier — not token optimization, not model selection, but multi-task coherence. Solving it transforms Diana from a coding assistant into an autonomous software factory.
12. The Self-Hosted Alternative: When You Own the Inference
12.1 The Hypothesis
Every optimization in this paper — prompt compression, history sliding windows, Roslyn token reduction — exists because tokens cost money. But what if they didn't? What if the cost of inference was a fixed monthly bill, like electricity?
Self-hosting an LLM inverts the entire optimization equation. Instead of minimizing tokens per task, you maximize context utilization per task. The constraint shifts from cost to latency.
12.2 The Economic Inversion
API pricing model (current):
Cost = Σ (input_tokens × $3/M + output_tokens × $15/M) per request
↓
Every token matters → compress, truncate, minimize
Self-hosted model:
Cost = GPU rental per month (fixed)
↓
Tokens are "free" → maximize context, never compress
| Metric | API (Claude Sonnet) | Self-Hosted (70B model) |
|---|---|---|
| Cost model | Per-token | Fixed monthly |
| Context penalty | $0.003/1K input tokens | ~0 (already paid) |
| Compression needed? | Critical | Unnecessary |
| History window | 16 messages (optimized) | Full conversation |
| System prompt optimization | Essential ($0.15/turn saved) | Irrelevant |
| Roslyn analyze_file | 16x savings ($0.04 vs $0.66) | Same speed, no savings |
12.3 What Changes for Diana
Eliminated complexity:
- Smart history compression → Keep full history (no sliding window)
- Prompt caching strategy → No cache needed (no per-token cost)
- System prompt token counting → Use verbose, detailed prompts
- Tool output truncation → Return full file contents always
New optimization target:
- Latency per turn (GPU inference speed, not token cost)
- Throughput (how many concurrent users/tasks)
- Context window size (limited by model architecture, not budget)
What stays the same:
- Turn count still matters (each turn = inference latency)
- Roslyn analysis still useful (faster parsing than reading full files)
- EscalatingLLMClient pattern (small model for simple tasks, large for planning)
12.4 Hardware Cost Analysis
For a company with ~100 developers, not millions of users:
| Setup | Monthly Cost | Context Window | Tokens/sec | Concurrent Users |
|---|---|---|---|---|
| 4× A100 80GB (cloud) | ~$15,000 | 128K | ~40 t/s | 8-12 |
| 4× H100 80GB (cloud) | ~$25,000 | 128K | ~80 t/s | 15-20 |
| 8× A100 on-prem (amortized) | ~$8,000 | 128K | ~80 t/s | 15-20 |
| Max plan × 100 users | $20,000 | 200K | ~100 t/s | 100 (rate limited) |
The crossover point: Self-hosting becomes cheaper than 100 Max subscriptions when:
- You need >20 heavy concurrent users (Max plan rate limits bite hard)
- You need guaranteed latency (no queuing behind other Max users)
- You need data sovereignty (SOW documents, proprietary code never leave your infrastructure)
12.5 Model Candidates (March 2026)
| Model | Parameters | Context | Code Quality | Self-Hostable |
|---|---|---|---|---|
| DeepSeek Coder V3 | 236B (MoE) | 128K | Excellent | Yes, 4× A100 |
| Llama 4 Maverick | 400B (MoE) | 128K | Very Good | Yes, 4× H100 |
| Qwen 2.5 Coder 72B | 72B | 128K | Very Good | Yes, 2× A100 |
| Mistral Large 2 | 123B | 128K | Good | Yes, 2× H100 |
| CodeLlama 70B | 70B | 100K | Good | Yes, 2× A100 |
The MoE (Mixture of Experts) models are the sweet spot — DeepSeek V3 activates only ~37B parameters per token despite having 236B total, giving near-frontier quality at manageable hardware costs.
12.6 The Fine-Tuning Advantage
Self-hosting unlocks something API providers can't offer: fine-tuning on your codebase.
Base model: "Create a CRUD controller" → generic boilerplate
Fine-tuned model: "Create a CRUD controller" → YOUR patterns, YOUR naming,
YOUR DbContext setup, YOUR error handling
This directly addresses the scaffold_crud failure from Section 9. Our deterministic tool generated generic code that the LLM had to rewrite. A fine-tuned model would generate your team's code patterns natively, eliminating the scaffold→read→rewrite cycle entirely.
Fine-tuning data sources:
- Git history (thousands of real commits with diffs)
- Code review comments (what gets approved vs rejected)
- Diana session logs (successful tool-calling patterns)
- Internal coding standards documents
12.7 How Self-Hosting Solves the Factory Problem
Section 11's autonomous factory faces one core challenge: multi-task coherence — maintaining context across 20 chained tasks. Self-hosting simplifies this dramatically:
| Challenge | API Approach | Self-Hosted Approach |
|---|---|---|
| Context between tasks | Compress, summarize, inject key artifacts | Keep full history in extended context |
| Architecture consistency | Hope the summary captures naming decisions | Model remembers everything (no compression loss) |
| Error recovery | Re-inject partial context from failed task | Full history available, just retry |
| Cross-task references | Context Injector extracts + re-injects actuals | All actuals already in context |
| Cost of replanning | ~$2-5 per Opus call | Fixed cost, replan freely |
The Context Injector — identified as the hardest component to build — becomes nearly trivial. Instead of extracting, compressing, and re-injecting artifacts between tasks, you simply... keep the context open.
12.8 Data Sovereignty
For companies handling client SOWs, contracts, and proprietary business logic:
- API model: Your code, your SOW, your client's business rules all transit through Anthropic's servers
- Self-hosted: Everything stays on your infrastructure, your VPC, your compliance boundary
This isn't hypothetical — it's a hard requirement for many enterprise clients. A self-hosted Diana can process SOWs containing confidential business logic without any data leaving the building.
12.9 The Hybrid Architecture
The optimal setup isn't purely self-hosted or purely API — it's hybrid:
┌──────────────────────────────────────────────────┐
│ Tier 1: Self-Hosted 70B (primary) │
│ • All routine coding tasks │
│ • Full context, no compression │
│ • Fine-tuned on company codebase │
│ • Cost: $0/token (fixed infrastructure) │
├──────────────────────────────────────────────────┤
│ Tier 2: Claude Opus (escalation) │
│ • Architectural planning only │
│ • Complex debugging (>3 consecutive failures) │
│ • SOW decomposition into task DAGs │
│ • Cost: ~$2-5 per escalation │
├──────────────────────────────────────────────────┤
│ Tier 3: Fine-tuned Small Model (8-14B) │
│ • Code completion / autocomplete │
│ • Simple refactors, renames │
│ • Cost: negligible (runs on single GPU) │
└──────────────────────────────────────────────────┘
Diana's EscalatingLLMClient already implements this pattern — swap the doer from Claude Sonnet to a self-hosted 70B, keep Opus as the analyst for the hardest 5% of tasks, and add a small model tier for trivial operations.
12.10 What This Means for Our Optimization Work
| Optimization | API Value | Self-Hosted Value |
|---|---|---|
| Roslyn analyze_file | High (16x token savings) | Medium (still faster than raw reads) |
| History compression | Critical ($0.15/turn saved) | Zero (keep full history) |
| System prompt minimization | High (2,500 → 1,800 tokens) | Zero (use verbose prompts) |
| Turn reduction | High (cost + latency) | Medium (latency only) |
| scaffold_crud | Failed at API rates | May work (no cost to read generated files) |
The irony: half of our optimizations become irrelevant with self-hosting. But the methodology — measuring, benchmarking, iterating — transfers directly. Instead of optimizing for $/task, you optimize for seconds/task and tasks/hour.
12.11 The Bottom Line
Self-hosting doesn't make Diana simpler — it makes it differently complex. You trade token economics for infrastructure operations. But for a company running 100+ developers through an autonomous coding agent, the math is compelling:
| Scenario | Monthly Cost | Constraints |
|---|---|---|
| 100 × Max plan | $20,000 | Rate limits, no fine-tuning, data leaves infra |
| 4× H100 + Opus escalation | ~$26,000 | No rate limits, fine-tunable, full sovereignty |
| 8× A100 on-prem (amortized) | ~$9,000 | Same benefits, lower cost after Year 1 |
The premium for self-hosting is ~30% more in Year 1, but you get: unlimited context, zero compression, fine-tuning, data sovereignty, and no rate limits. By Year 2, on-prem hardware pays for itself.
The real question isn't "API or self-hosted?" — it's "at what scale does owning the inference become cheaper than renting it?" For Diana's target use case (autonomous software factory processing SOWs), that scale is approximately 50-100 concurrent developers.
Appendix A: Full Version History
| Version | Commit | Date | Changes |
|---|---|---|---|
| v1.0.7 | c3fc688 | Feb 27, 2026 | Initial release |
| v1.2.x | — | Feb 2026 | Early optimizations (not in current git) |
| v1.3.0 | d3a30f1 | Mar 3, 2026 | Token optimization: prompt caching, compression, filtering |
| v1.3.1 | 5ac38d6 | Mar 3, 2026 | Fix regression: type-aware compression |
| v1.3.2 | 0cdafe1 | Mar 3, 2026 | Roslyn analyze_file tool |
| v1.4.0 | — | Mar 3, 2026 | scaffold_crud experiment (reverted) |
Appendix B: Platform Distribution
Diana v1.3.2 is distributed for 10 platform targets:
| Platform | RID | Architecture |
|---|---|---|
| Windows | win-x64, win-x86, win-arm64 | x64, x86, ARM64 |
| Linux | linux-x64, linux-arm, linux-arm64 | x64, ARM, ARM64 |
| Linux (Alpine) | linux-musl-x64, linux-musl-arm64 | x64, ARM64 |
| macOS | osx-x64, osx-arm64 | Intel, Apple Silicon |
Framework-dependent deployment (requires .NET 9.0 runtime).
Appendix C: Project Metrics
- Source files: ~78 C# files across 9 projects
- Solution structure: CLI, Core, Interactive, LLM, Tools, VectorDB, Server, Plugins
- Tool count: 22 tools in 8 categories
- Supported LLM providers: 3 (Claude, Kimi, OpenAI)
- Supported languages: English, Spanish
- Total benchmark spend: $26.73 across 9 versions
Diana — .NET Development Agent "The cost of an LLM agent is not the code it writes — it's the context it needs to write it."