Diana: Optimizing LLM-Powered Development Agents Through Token Economics

Version 1.3.2 | March 2026 58% cost reduction achieved: $3.81 → $1.59 per task

Abstract

Diana is an autonomous .NET development agent that integrates with multiple LLM providers (Claude, Kimi, OpenAI) to read, write, build, and test code through a tool-calling loop. Over 9 iterations, we systematically reduced the cost-per-task from $3.81 to $1.59 — a 58% reduction — through token-level optimizations, Roslyn-based code analysis, smart history compression, and escalating model selection. This paper documents the architecture, the optimization journey, what worked, what failed, and the lessons learned.

1. Problem Statement

LLM-powered coding agents are expensive. A typical "create a calculator with API + Blazor UI + SQLite" task costs ~$3.81 in API tokens. The agent spends most tokens on:

Reading files it just created — the LLM writes a file, then reads it back to verify
Repeating context — long conversation histories re-sent every turn
Boilerplate generation — CRUD patterns that are identical across projects
System prompt overhead — 2,500+ tokens sent on every turn

Our goal: reduce cost without reducing capability. The agent must still produce working, compilable code with attractive UI.

2. Architecture

2.1 Layer Overview

┌──────────────────────────────────────────────────────┐
│  Layer 1: LLM Providers                              │
│  Claude Sonnet/Opus · Kimi K2 · GPT-4o/o3           │
├──────────────────────────────────────────────────────┤
│  Layer 2: Escalating LLM Client                      │
│  Doer (cheap) ←→ Analyst (expensive)                 │
├──────────────────────────────────────────────────────┤
│  Layer 3: Agent Loop                                 │
│  Interactive Mode · Planning Mode (4 phases)         │
├──────────────────────────────────────────────────────┤
│  Layer 4: Tool Registry (22 tools)                   │
│  Code Analysis · File Editing · Build/Test · Git     │
│  Shell · Web · Search · Indexing                     │
├──────────────────────────────────────────────────────┤
│  Layer 5: Security & Storage                         │
│  PathValidator · SQLite VectorDB · ONNX Embeddings   │
└──────────────────────────────────────────────────────┘

2.2 Agent Loop

The core execution model is an agentic tool-calling loop:

User Input
    ↓
VectorDB Context Search (topK=3, score ≥ 0.3)
    ↓
┌─── Loop (max 30 iterations) ───┐
│  1. Build ChatRequest           │
│  2. Send to LLM (with tools)   │
│  3. Parse response              │
│     ├─ Has tool calls?          │
│     │   ├─ Parallel reads       │
│     │   ├─ Sequential writes    │
│     │   └─ Continue loop        │
│     └─ No tool calls?           │
│         └─ Stream final response│
└─────────────────────────────────┘

Key design decisions:

Parallel execution for read-only tools (analyze_file, read_file, search_code)
Sequential execution for write tools (write_file, edit_file) — order matters
Confirmation gates on destructive operations (bypass with --AtomicBlonde)
System prompt sent once — LLM APIs cache it after the first request

2.3 Escalating LLM Client

A decorator pattern that wraps two models from the same provider:

                ┌─────────────┐
                │ Escalating  │
                │ LLM Client  │
                └──────┬──────┘
                       │
           ┌───────────┴───────────┐
           │                       │
    ┌──────┴──────┐        ┌──────┴──────┐
    │    Doer     │        │   Analyst   │
    │  (Sonnet)   │        │   (Opus)    │
    │   $3/M      │        │   $15/M     │
    └─────────────┘        └─────────────┘

Escalation triggers:

Keyword-based: System prompt contains "architect" → use analyst
Error-based: 3+ consecutive errors → escalate to analyst
Auto-deescalation: Success with analyst → return to doer

Result: ~30-40% cost reduction vs using the expensive model for everything.

2.4 Tool Inventory

Category	Tools	Phase
Analysis	`analyze_file`, `analyze_project`, `list_directory`	Read-only
File Read	`read_file`, `search_code`, `search_knowledge`	Read-only
File Write	`write_file`, `edit_file`, `insert_at_line`, `find_and_replace`, `delete_lines`, `append_to_file`	Execution
Build/Test	`dotnet_build`, `dotnet_test`	Either
Code Gen	`generate_code`	Execution
Git	`git_status`, `git_diff`, `git_commit`	Conditional
Shell	`run_command`	Execution
Web	`web_search`, `web_fetch`	Read-only
Index	`reindex_project`, `get_index_stats`	Read-only

Phase-based filtering prevents the LLM from calling write tools during exploration, or exploration tools during execution.

3. The Optimization Journey

3.1 Benchmark Methodology

Task: "Create a calculator using a .NET API to store operations and a Blazor Server app for the UI. Use SQLite. Make the design attractive."

Measurement: Total API cost (input + output tokens × provider pricing) for one complete execution with --AtomicBlonde (no confirmation pauses).

Control: Same task, same provider, same model, same machine. Only the Diana version changes.

3.2 Cost Evolution

Version   Cost     Δ vs Baseline   Cumulative Spend   Key Change
───────   ──────   ─────────────   ────────────────   ──────────────────────────
v1.0.7    $3.81    baseline        $3.81              Initial release
v1.2.0    $3.18    -16.5%          $7.16              Early optimizations
v1.2.1    $3.44    -9.7%           $10.60             Regression (over-compressed)
v1.2.2    $2.50    -34.4%          $13.10             Recovered + improved
v1.2.3    $2.42    -36.5%          $15.52             Incremental gains
v1.3.0    $3.92    +2.9%           $19.44             REGRESSION: aggressive compression
v1.3.1    $2.51    -34.1%          $21.95             Fix: type-aware compression
v1.3.2    $1.59    -58.3%          $23.54             Roslyn analyze_file (best)
v1.4.0    $3.19    -16.3%          $26.73             FAILED: scaffold_crud experiment

3.3 Visual Cost Curve

$4.00 ┤
      │  ●v1.0.7                        ●v1.3.0
$3.50 ┤       ●v1.2.1                  ╱        ●v1.4.0
      │      ╱                         ╱
$3.00 ┤  ●v1.2.0                      ╱
      │     ╲                         ╱
$2.50 ┤      ╲  ●v1.2.2  ●v1.2.3    ╱   ●v1.3.1
      │       ╲─────────╱          ╱
$2.00 ┤                           ╱
      │                          ╱
$1.50 ┤                    ●v1.3.2 ← BEST ($1.59)
      │
$1.00 ┤
      └──────────────────────────────────────────
        1.0  1.2  1.2.1 1.2.2 1.2.3 1.3  1.3.1 1.3.2 1.4

4. What Worked

4.1 Roslyn analyze_file (v1.3.2) — 37% reduction

The single biggest optimization. Instead of reading entire C# files (~800 tokens), we parse them with Roslyn and return a structural summary (~50 tokens):

Before (read_file): ~800 tokens

using System;
using System.Collections.Generic;
using Microsoft.AspNetCore.Mvc;
using Microsoft.EntityFrameworkCore;
// ... 60 more lines of actual code

After (analyze_file): ~50 tokens

// Controllers/UsersController.cs (65 lines)
using: System, Microsoft.AspNetCore.Mvc, Microsoft.EntityFrameworkCore
namespace MyApp.Controllers
  [ApiController, Route("api/[controller]")]
public class UsersController : ControllerBase
  UsersController(AppDbContext db)
  async Task<ActionResult<List<User>>> GetAll()
  async Task<ActionResult<User>> GetById(int id)
  async Task<ActionResult<User>> Create(UserRequest request)
  async Task<IActionResult> Update(int id, UserRequest request)
  async Task<IActionResult> Delete(int id)

The LLM gets the same structural understanding at 16x fewer tokens. It only calls read_file when it actually needs to edit a specific file.

System prompt directive:

"Use analyze_file for exploration instead of read_file — much cheaper. Use read_file only when you need exact content for editing."

4.2 Smart History Compression (v1.3.1) — Fixed regression

The conversation history grows with every turn. By turn 20, you're re-sending 15,000+ tokens of old tool results. Compression keeps only what the LLM needs:

Type-aware rules (sliding window of 16 recent messages kept intact):

Tool Type	Compression Strategy	Rationale
`read_file`, `search_code`	Keep first 500 chars	LLM needs to remember structure
`write_file`, `edit_file`	`[write_file: OK]`	LLM already knows what it wrote
`run_command`, `dotnet_build`	Errors only, success → `[OK]`	Only errors matter
`assistant` messages	Trim to 300 chars	Older reasoning less relevant

Critical lesson from v1.3.0: We initially compressed read_file to [OK] too. The LLM lost context on what it had read and re-read the same files — costing more than not compressing at all. $3.92 vs $1.80 target. Type-awareness fixed this.

4.3 System Prompt Caching (v1.3.0) — 5-8% reduction

SystemPrompt = iteration == 1 ? systemPrompt : null,

LLM APIs (Claude, OpenAI) cache the system prompt after the first request. Sending it again on every turn wastes ~2,500 tokens per iteration. With 20 turns, that's ~50,000 wasted tokens.

4.4 VectorDB Context Filtering (v1.3.0) — 3-5% reduction

var relevant = results.Where(r => r.Score >= 0.3f).ToList();

Before filtering, the agent would inject 5 code snippets as context — many irrelevant. Score filtering (≥ 0.3) and topK=3 ensures only truly relevant code enters the context.

4.5 Tool Result Truncation (v1.3.0) — 5-10% reduction

const int MaxToolResultLength = 6000;

Build output, test results, and directory listings can be enormous. Truncating to 6,000 chars preserves error information while discarding verbose success output.

4.6 Output Stripping (v1.3.0) — 5-8% reduction

Shell commands produce noise: NuGet restore logs, X.509 certificate warnings, MSBuild telemetry. Stripping these before they enter the conversation history prevents token bloat.

5. What Failed

5.1 scaffold_crud (v1.4.0) — Cost increased from $1.59 to $3.19

Hypothesis: The LLM spends ~8 turns writing boilerplate CRUD (Model, DTOs, DbContext, Controller, Service). A deterministic tool that generates all 6 files in one call should replace those 8 turns with 1.

Implementation: scaffold_crud tool generating:

Models/{Entity}.cs — Entity with Id + properties
DTOs/{Entity}Request.cs — DTO without Id
DTOs/{Entity}Response.cs — DTO with Id
Data/AppDbContext.cs — DbContext with SQLite
Controllers/{Entities}Controller.cs — 5 CRUD endpoints
Services/{Entity}ApiService.cs — HttpClient wrapper

What actually happened (53 turns):

Turn 5:   scaffold_crud             ← Generated 6 generic files (7ms)
Turn 6-8: read_file ×4             ← LLM reads what scaffold generated
Turn 9-14: write_file ×6           ← LLM REWRITES everything with real logic

Root cause: The scaffold generates generic boilerplate. A calculator needs custom logic: expression parsing, operator handling, result computation. The LLM couldn't use generic CRUD — it had to read everything, understand it, then rewrite it. That's 3x the turns (scaffold + read + rewrite) instead of just writing directly (~8 turns).

Lesson learned: Deterministic code generation only helps when the output is usable as-is. If the LLM has to customize it, you're paying for generation + comprehension + rewriting — worse than just writing from scratch.

5.2 Aggressive History Compression (v1.3.0) — $3.92 regression

Compressing read_file results to [OK] caused the LLM to:

Forget what it had read
Re-read the same files
Create a "read → forget → re-read" loop

Lesson learned: Compression must be type-aware. Read results are context; write results are confirmation. Treat them differently.

6. Architecture Details

6.1 Security: PathValidator

Defense-in-depth with 7 validation layers:

Null/empty check
Null byte injection detection
Path traversal pattern blocking (.., ../, ..\\)
Obfuscated traversal detection (...., ...)
Absolute path resolution + base directory verification
Blocked extensions (.exe, .dll, .msi, .vbs, etc.)
Symlink/reparse point validation

All file-writing tools call PathValidator.IsPathSafe() before any I/O operation.

6.2 VectorDB: Semantic Code Search

┌─────────────────────────────────┐
│         SQLite VectorDB         │
├─────────────────────────────────┤
│  CodeChunk table                │
│    FilePath · Content · Vector  │
├─────────────────────────────────┤
│  FileHash table (incremental)   │
│    FilePath · SHA256 · LastIdx  │
├─────────────────────────────────┤
│  Embedding: ONNX MiniLM-L6-v2  │
│  Fallback:  Simple hash-based   │
└─────────────────────────────────┘

Incremental indexing: Only re-indexes files whose content hash changed
Semantic embeddings: ONNX all-MiniLM-L6-v2 (~30MB, downloaded on first use)
Fallback: Hash-based embedding when ONNX unavailable
Integration: Search results injected into first user message as context

6.3 Multi-Provider LLM Support

public class LLMConfig
{
    public string Provider { get; set; }      // "kimi" | "claude" | "openai"
    public string Model { get; set; }          // Doer model
    public string? AnalysisModel { get; set; } // Analyst model (optional)
    public string ApiKey { get; set; }
    public int EscalationErrorThreshold { get; set; } // Default: 3
}

Provider	Doer	Analyst	API Format
Kimi	kimi-k2-turbo-preview	kimi-k2-thinking	OpenAI-compatible
Claude	claude-sonnet-4-6	claude-opus-4-6	Anthropic native
OpenAI	gpt-4o	o3	OpenAI native

When AnalysisModel ≠ Model, the LLMFactory wraps both in EscalatingLLMClient.

7. Quantitative Analysis

7.1 Token Distribution (v1.3.2 benchmark task)

Category	Tokens	% of Total
System prompt (1x)	~2,500	4.7%
User context (VectorDB)	~300	0.6%
LLM reasoning (output)	~8,000	15.1%
Tool call arguments	~5,000	9.4%
Tool results (compressed)	~12,000	22.6%
Conversation history (re-sent)	~25,000	47.1%
Total	~53,000	100%

Key insight: 47% of tokens are conversation history re-sent on each turn. History compression targets this largest category.

7.2 Cost per Optimization Technique

Technique	Token Savings	Cost Impact	Effort
Roslyn analyze_file	~16x per file read	-37%	Medium (Roslyn integration)
History compression	~60% of old messages	-25%	Low (sliding window)
System prompt caching	~2,500/turn after first	-8%	Trivial (null check)
Tool result truncation	~30% of large outputs	-7%	Low (string truncation)
Output stripping	~500/turn average	-6%	Low (regex filtering)
VectorDB filtering	~2,000 on first turn	-4%	Trivial (score threshold)
scaffold_crud	Negative	+100%	High (wasted)

7.3 Turns per Version

Version	Turns	Cost	Cost/Turn
v1.0.7	~25	$3.81	$0.152
v1.3.0	~28	$3.92	$0.140
v1.3.2	~22	$1.59	$0.072
v1.4.0	53	$3.19	$0.060

v1.4.0 paradox: Lowest cost-per-turn ($0.060) but highest turn count (53). The scaffold_crud made each turn cheaper but doubled the number of turns.

8. Lessons Learned

8.1 Compression Must Be Type-Aware

Not all tool results are equal. Read results are context the LLM needs to remember. Write confirmations are redundant. Treating them uniformly causes either context loss (over-compression) or token waste (under-compression).

8.2 Deterministic Generation Fails When Customization Is Required

Code scaffolding only saves tokens if the output is usable without modification. Generic CRUD templates require the LLM to read, understand, and rewrite — costing 3x what direct writing costs.

8.3 The Biggest Wins Are Structural, Not Textual

Shortening system prompts (textual) saves 2-3%. Replacing file reads with Roslyn summaries (structural) saves 37%. The highest-leverage optimizations change what information the LLM receives, not how it's formatted.

8.4 Regressions Are Expensive to Detect

Each benchmark run costs $1.50-$4.00. Testing 9 versions cost $26.73 in total API spend. An automated, cheaper benchmark (shorter task, smaller project) would enable faster iteration.

8.5 The Re-Read Loop Is the #1 Cost Killer

When the LLM loses context on what it previously read, it enters a read → forget → re-read cycle that can double or triple costs. Preserving read context in compression is the single most important rule.

9. Future Directions

9.1 Auto-Split Files

When the LLM writes multiple classes/DTOs in a single file, subsequent edits require loading the entire file. A post-write tool that automatically splits multi-class files into individual files would reduce read_file token costs for later edits.

9.2 Smarter Scaffold with Business Logic Injection

Instead of generic CRUD, a scaffold that accepts business logic hints:

scaffold_crud entity=Calculation properties=...
  business_logic="Calculate result from Operand1, Operator, Operand2"

This could generate customized code the LLM doesn't need to rewrite.

9.3 Diff-Based File Editing

Instead of read_file → full content → edit_file, a tool that accepts line ranges would reduce the tokens needed for surgical edits.

9.4 Cheaper Benchmark Task

A simpler benchmark (e.g., "add a property to an existing model") would cost ~$0.20 per run, enabling 10x more optimization iterations per dollar.

10. Conclusion

Through 9 iterations and $26.73 in benchmark spend, we reduced Diana's cost-per-task from $3.81 to $1.59 — a 58% reduction. The key insight is that LLM token costs are dominated by what the model reads, not what it writes. The three highest-impact optimizations all target input tokens:

Roslyn analyze_file — 16x reduction in code exploration tokens
Type-aware history compression — 60% reduction in re-sent conversation history
System prompt caching — Eliminate 2,500 tokens per turn after the first

The failed scaffold_crud experiment (v1.4.0, $3.19) demonstrated that reducing output tokens (code generation) is counterproductive if it increases input tokens (reading and understanding generated code).

The cost of an LLM agent is not the code it writes — it's the context it needs to write it.

11. Roadmap: From Assistant to Autonomous Software Factory

11.1 The Vision

The natural evolution of Diana is not a better assistant — it's a factory. Instead of one developer interacting with one agent on one task, the system receives a Statement of Work (SOW) and autonomously produces a complete project with minimal human intervention.

SOW Document (50 pages)
        ↓
┌───────────────────────────────────────┐
│  OPUS PLANNER (1 expensive call)      │
│  Reads SOW → Generates DAG of tasks   │
│  Cost: ~$2-5                          │
└───────────────┬───────────────────────┘
                ↓
┌───────────────────────────────────────┐
│  TASK BOARD (Persistent Queue)        │
│                                       │
│  Phase 1: Foundation                  │
│    Task 1: Solution + shared models   │
│    Task 2: Auth system                │
│    ──── GATE: build ✓ ────            │
│                                       │
│  Phase 2: API Core (parallel)         │
│    Task 3: CRUD Users      ─┐        │
│    Task 4: CRUD Products    ├─ parallel│
│    Task 5: CRUD Orders     ─┘        │
│    ──── GATE: build + test ✓ ────     │
│                                       │
│  Phase 3: Business Logic              │
│    Task 6: Order processing           │
│    Task 7: Inventory rules            │
│    ──── GATE: build + test ✓ ────     │
│                                       │
│  Phase 4: UI (parallel)               │
│    Task 8: Login page      ─┐        │
│    Task 9: Dashboard        ├─ parallel│
│    Task 10: Product catalog ─┘        │
│    ──── GATE: build ✓ ────            │
│                                       │
│  Phase 5: Integration                 │
│    Task 11: Connect UI ↔ API          │
│    Task 12: Error handling            │
│    ──── GATE: build + test + HUMAN ── │
└───────────────┬───────────────────────┘
                ↓
┌───────────────────────────────────────┐
│  EXECUTOR (Diana CLI, headless)       │
│  Processes tasks respecting DAG order │
│  Sonnet for execution, ~$1.59/task    │
└───────────────┬───────────────────────┘
                ↓
        Complete Project
        ~$32 total cost
        ~2 hours autonomous execution

11.2 Why Not a Flat Queue

A SOW cannot be decomposed into 20 independent tasks. Real projects have dependencies:

Task 5 (CRUD Orders) needs the User and Product models from Tasks 3-4
Task 9 (Dashboard UI) needs the API endpoints from Tasks 3-5
Task 11 (Integration) needs everything above to exist

A flat queue would cause error propagation — if Task 3 generates different model names than expected, Tasks 5-12 build on assumptions that don't match reality. The system needs a DAG (Directed Acyclic Graph) with gates between phases.

11.3 The Critical Component: Context Injection

The hardest unsolved problem is not execution — it's coherence across tasks.

Task 1 creates User.cs with specific property names. Task 5 must know those exact names, not guess them from the SOW. This requires a Context Injector that:

After each task completes, extracts what was actually generated (file paths, class names, endpoints, models)
Before each dependent task starts, injects this real context into the prompt
Maintains a living project manifest that evolves as tasks complete

Task 3 completes → Context Injector extracts:
  - Models: User.cs (Id, Email, Name, PasswordHash)
  - DTOs: UserRequest.cs, UserResponse.cs
  - Endpoints: GET/POST/PUT/DELETE /api/users
  - DbContext: AppDbContext with DbSet<User>

Task 5 receives injected context:
  "The project already has User (Id, Email, Name, PasswordHash)
   and Product (Id, Name, Price, Stock) models.
   AppDbContext has DbSet<User> and DbSet<Product>.
   Create Order entity with foreign keys to both."

Without this, each task works from the SOW description (what was planned) instead of the codebase reality (what was built). The gap between plan and reality grows with every task.

11.4 Gate System

Gates are checkpoints between phases that prevent error propagation:

Gate Type	Trigger	Action on Failure
Build Gate	`dotnet build` fails	Retry task with error context (max 3)
Test Gate	`dotnet test` has failures	Retry with test output as context
Human Gate	End of major phase	Notify human, wait for approval
Auto-Repair Gate	Build fails after retry	Escalate to Opus for diagnosis

The human gate is strategically placed — not after every task (too slow) but after each phase (meaningful checkpoint). A developer reviews the Phase 2 output before Phase 3 begins.

11.5 Economics

Metric	Value
Task decomposition (Opus, 1 call)	~$2-5
Execution (Sonnet, ~20 tasks × $1.59)	~$32
Gate retries (estimated 3-4 failures)	~$6
Total project cost	~$40-43
Execution time	~2 hours autonomous
Human intervention	~15 min review at gates

On the Max plan ($200/month), this means ~5 complete projects per month within budget. On API rates, $40/project is still dramatically cheaper than developer time.

11.6 What Already Exists in Diana

Component	Status	Notes
Headless mode (`diana --auto`)	Exists	Can execute single tasks
Tool registry (22 tools)	Exists	Build, test, file ops, git
Build verification	Exists	`dotnet_build` after changes
EscalatingLLMClient	Exists	Opus for planning, Sonnet for execution
History compression	Exists	Keeps context manageable
Roslyn analysis	Exists	Cheap code understanding

11.7 What Needs to Be Built

Component	Effort	Description
SOW Parser	Medium	Opus reads SOW, generates phased DAG in JSON
Task Board	Low	Persistent queue with states, deps, phase grouping
Context Injector	High	Extracts actuals from completed tasks, injects into next
Gate System	Medium	Build/test/human-review between phases
DAG Orchestrator	High	Executes tasks respecting dependency order, parallelizes within phases
Project Manifest	Medium	Living document of what exists (models, endpoints, files)

The Context Injector and DAG Orchestrator are the two hard problems. Everything else is plumbing.

11.8 The Real Question

The question is not "can the AI work more hours?" — it's "can the AI maintain coherence across 20 chained tasks?"

A single task (the calculator) works because everything fits in one context window. Twenty chained tasks require the system to:

Remember what it built (not what it planned)
Adapt to deviations (Task 3 used UserEntity instead of User)
Recover from failures (Task 7 failed, Task 8-20 need replanning)
Maintain architectural consistency across 2 hours of autonomous execution

This is the frontier — not token optimization, not model selection, but multi-task coherence. Solving it transforms Diana from a coding assistant into an autonomous software factory.

12. The Self-Hosted Alternative: When You Own the Inference

12.1 The Hypothesis

Every optimization in this paper — prompt compression, history sliding windows, Roslyn token reduction — exists because tokens cost money. But what if they didn't? What if the cost of inference was a fixed monthly bill, like electricity?

Self-hosting an LLM inverts the entire optimization equation. Instead of minimizing tokens per task, you maximize context utilization per task. The constraint shifts from cost to latency.

12.2 The Economic Inversion

API pricing model (current):

Cost = Σ (input_tokens × $3/M + output_tokens × $15/M) per request
      ↓
Every token matters → compress, truncate, minimize

Self-hosted model:

Cost = GPU rental per month (fixed)
      ↓
Tokens are "free" → maximize context, never compress

Metric	API (Claude Sonnet)	Self-Hosted (70B model)
Cost model	Per-token	Fixed monthly
Context penalty	$0.003/1K input tokens	~0 (already paid)
Compression needed?	Critical	Unnecessary
History window	16 messages (optimized)	Full conversation
System prompt optimization	Essential ($0.15/turn saved)	Irrelevant
Roslyn analyze_file	16x savings ($0.04 vs $0.66)	Same speed, no savings

12.3 What Changes for Diana

Eliminated complexity:

Smart history compression → Keep full history (no sliding window)
Prompt caching strategy → No cache needed (no per-token cost)
System prompt token counting → Use verbose, detailed prompts
Tool output truncation → Return full file contents always

New optimization target:

Latency per turn (GPU inference speed, not token cost)
Throughput (how many concurrent users/tasks)
Context window size (limited by model architecture, not budget)

What stays the same:

Turn count still matters (each turn = inference latency)
Roslyn analysis still useful (faster parsing than reading full files)
EscalatingLLMClient pattern (small model for simple tasks, large for planning)

12.4 Hardware Cost Analysis

For a company with ~100 developers, not millions of users:

Setup	Monthly Cost	Context Window	Tokens/sec	Concurrent Users
4× A100 80GB (cloud)	~$15,000	128K	~40 t/s	8-12
4× H100 80GB (cloud)	~$25,000	128K	~80 t/s	15-20
8× A100 on-prem (amortized)	~$8,000	128K	~80 t/s	15-20
Max plan × 100 users	$20,000	200K	~100 t/s	100 (rate limited)

The crossover point: Self-hosting becomes cheaper than 100 Max subscriptions when:

You need >20 heavy concurrent users (Max plan rate limits bite hard)
You need guaranteed latency (no queuing behind other Max users)
You need data sovereignty (SOW documents, proprietary code never leave your infrastructure)

12.5 Model Candidates (March 2026)

Model	Parameters	Context	Code Quality	Self-Hostable
DeepSeek Coder V3	236B (MoE)	128K	Excellent	Yes, 4× A100
Llama 4 Maverick	400B (MoE)	128K	Very Good	Yes, 4× H100
Qwen 2.5 Coder 72B	72B	128K	Very Good	Yes, 2× A100
Mistral Large 2	123B	128K	Good	Yes, 2× H100
CodeLlama 70B	70B	100K	Good	Yes, 2× A100

The MoE (Mixture of Experts) models are the sweet spot — DeepSeek V3 activates only ~37B parameters per token despite having 236B total, giving near-frontier quality at manageable hardware costs.

12.6 The Fine-Tuning Advantage

Self-hosting unlocks something API providers can't offer: fine-tuning on your codebase.

Base model:        "Create a CRUD controller" → generic boilerplate
Fine-tuned model:  "Create a CRUD controller" → YOUR patterns, YOUR naming,
                                                 YOUR DbContext setup, YOUR error handling

This directly addresses the scaffold_crud failure from Section 9. Our deterministic tool generated generic code that the LLM had to rewrite. A fine-tuned model would generate your team's code patterns natively, eliminating the scaffold→read→rewrite cycle entirely.

Fine-tuning data sources:

Git history (thousands of real commits with diffs)
Code review comments (what gets approved vs rejected)
Diana session logs (successful tool-calling patterns)
Internal coding standards documents

12.7 How Self-Hosting Solves the Factory Problem

Section 11's autonomous factory faces one core challenge: multi-task coherence — maintaining context across 20 chained tasks. Self-hosting simplifies this dramatically:

Challenge	API Approach	Self-Hosted Approach
Context between tasks	Compress, summarize, inject key artifacts	Keep full history in extended context
Architecture consistency	Hope the summary captures naming decisions	Model remembers everything (no compression loss)
Error recovery	Re-inject partial context from failed task	Full history available, just retry
Cross-task references	Context Injector extracts + re-injects actuals	All actuals already in context
Cost of replanning	~$2-5 per Opus call	Fixed cost, replan freely

The Context Injector — identified as the hardest component to build — becomes nearly trivial. Instead of extracting, compressing, and re-injecting artifacts between tasks, you simply... keep the context open.

12.8 Data Sovereignty

For companies handling client SOWs, contracts, and proprietary business logic:

API model: Your code, your SOW, your client's business rules all transit through Anthropic's servers
Self-hosted: Everything stays on your infrastructure, your VPC, your compliance boundary

This isn't hypothetical — it's a hard requirement for many enterprise clients. A self-hosted Diana can process SOWs containing confidential business logic without any data leaving the building.

12.9 The Hybrid Architecture

The optimal setup isn't purely self-hosted or purely API — it's hybrid:

┌──────────────────────────────────────────────────┐
│  Tier 1: Self-Hosted 70B (primary)               │
│  • All routine coding tasks                       │
│  • Full context, no compression                   │
│  • Fine-tuned on company codebase                 │
│  • Cost: $0/token (fixed infrastructure)          │
├──────────────────────────────────────────────────┤
│  Tier 2: Claude Opus (escalation)                 │
│  • Architectural planning only                    │
│  • Complex debugging (>3 consecutive failures)    │
│  • SOW decomposition into task DAGs               │
│  • Cost: ~$2-5 per escalation                     │
├──────────────────────────────────────────────────┤
│  Tier 3: Fine-tuned Small Model (8-14B)           │
│  • Code completion / autocomplete                 │
│  • Simple refactors, renames                      │
│  • Cost: negligible (runs on single GPU)          │
└──────────────────────────────────────────────────┘

Diana's EscalatingLLMClient already implements this pattern — swap the doer from Claude Sonnet to a self-hosted 70B, keep Opus as the analyst for the hardest 5% of tasks, and add a small model tier for trivial operations.

12.10 What This Means for Our Optimization Work

Optimization	API Value	Self-Hosted Value
Roslyn analyze_file	High (16x token savings)	Medium (still faster than raw reads)
History compression	Critical ($0.15/turn saved)	Zero (keep full history)
System prompt minimization	High (2,500 → 1,800 tokens)	Zero (use verbose prompts)
Turn reduction	High (cost + latency)	Medium (latency only)
scaffold_crud	Failed at API rates	May work (no cost to read generated files)

The irony: half of our optimizations become irrelevant with self-hosting. But the methodology — measuring, benchmarking, iterating — transfers directly. Instead of optimizing for $/task, you optimize for seconds/task and tasks/hour.

12.11 The Bottom Line

Self-hosting doesn't make Diana simpler — it makes it differently complex. You trade token economics for infrastructure operations. But for a company running 100+ developers through an autonomous coding agent, the math is compelling:

Scenario	Monthly Cost	Constraints
100 × Max plan	$20,000	Rate limits, no fine-tuning, data leaves infra
4× H100 + Opus escalation	~$26,000	No rate limits, fine-tunable, full sovereignty
8× A100 on-prem (amortized)	~$9,000	Same benefits, lower cost after Year 1

The premium for self-hosting is ~30% more in Year 1, but you get: unlimited context, zero compression, fine-tuning, data sovereignty, and no rate limits. By Year 2, on-prem hardware pays for itself.

The real question isn't "API or self-hosted?" — it's "at what scale does owning the inference become cheaper than renting it?" For Diana's target use case (autonomous software factory processing SOWs), that scale is approximately 50-100 concurrent developers.

Appendix A: Full Version History

Version	Commit	Date	Changes
v1.0.7	c3fc688	Feb 27, 2026	Initial release
v1.2.x	—	Feb 2026	Early optimizations (not in current git)
v1.3.0	d3a30f1	Mar 3, 2026	Token optimization: prompt caching, compression, filtering
v1.3.1	5ac38d6	Mar 3, 2026	Fix regression: type-aware compression
v1.3.2	0cdafe1	Mar 3, 2026	Roslyn analyze_file tool
v1.4.0	—	Mar 3, 2026	scaffold_crud experiment (reverted)

Appendix B: Platform Distribution

Diana v1.3.2 is distributed for 10 platform targets:

Platform	RID	Architecture
Windows	win-x64, win-x86, win-arm64	x64, x86, ARM64
Linux	linux-x64, linux-arm, linux-arm64	x64, ARM, ARM64
Linux (Alpine)	linux-musl-x64, linux-musl-arm64	x64, ARM64
macOS	osx-x64, osx-arm64	Intel, Apple Silicon

Framework-dependent deployment (requires .NET 9.0 runtime).

Appendix C: Project Metrics

Source files: ~78 C# files across 9 projects
Solution structure: CLI, Core, Interactive, LLM, Tools, VectorDB, Server, Plugins
Tool count: 22 tools in 8 categories
Supported LLM providers: 3 (Claude, Kimi, OpenAI)
Supported languages: English, Spanish
Total benchmark spend: $26.73 across 9 versions

Diana — .NET Development Agent "The cost of an LLM agent is not the code it writes — it's the context it needs to write it."