← Back to blog
Diana: Optimizing LLM-Powered Development Agents Through Token Economics
whitepaperroslyntoken-optimizationarchitecturellmcost-reduction

Diana: Optimizing LLM-Powered Development Agents Through Token Economics

Version 1.3.2 | March 2026 58% cost reduction achieved: $3.81 → $1.59 per task


Abstract

Diana is an autonomous .NET development agent that integrates with multiple LLM providers (Claude, Kimi, OpenAI) to read, write, build, and test code through a tool-calling loop. Over 9 iterations, we systematically reduced the cost-per-task from $3.81 to $1.59 — a 58% reduction — through token-level optimizations, Roslyn-based code analysis, smart history compression, and escalating model selection. This paper documents the architecture, the optimization journey, what worked, what failed, and the lessons learned.


1. Problem Statement

LLM-powered coding agents are expensive. A typical "create a calculator with API + Blazor UI + SQLite" task costs ~$3.81 in API tokens. The agent spends most tokens on:

  1. Reading files it just created — the LLM writes a file, then reads it back to verify
  2. Repeating context — long conversation histories re-sent every turn
  3. Boilerplate generation — CRUD patterns that are identical across projects
  4. System prompt overhead — 2,500+ tokens sent on every turn

Our goal: reduce cost without reducing capability. The agent must still produce working, compilable code with attractive UI.


2. Architecture

2.1 Layer Overview

┌──────────────────────────────────────────────────────┐
│  Layer 1: LLM Providers                              │
│  Claude Sonnet/Opus · Kimi K2 · GPT-4o/o3           │
├──────────────────────────────────────────────────────┤
│  Layer 2: Escalating LLM Client                      │
│  Doer (cheap) ←→ Analyst (expensive)                 │
├──────────────────────────────────────────────────────┤
│  Layer 3: Agent Loop                                 │
│  Interactive Mode · Planning Mode (4 phases)         │
├──────────────────────────────────────────────────────┤
│  Layer 4: Tool Registry (22 tools)                   │
│  Code Analysis · File Editing · Build/Test · Git     │
│  Shell · Web · Search · Indexing                     │
├──────────────────────────────────────────────────────┤
│  Layer 5: Security & Storage                         │
│  PathValidator · SQLite VectorDB · ONNX Embeddings   │
└──────────────────────────────────────────────────────┘

2.2 Agent Loop

The core execution model is an agentic tool-calling loop:

User Input
    ↓
VectorDB Context Search (topK=3, score ≥ 0.3)
    ↓
┌─── Loop (max 30 iterations) ───┐
│  1. Build ChatRequest           │
│  2. Send to LLM (with tools)   │
│  3. Parse response              │
│     ├─ Has tool calls?          │
│     │   ├─ Parallel reads       │
│     │   ├─ Sequential writes    │
│     │   └─ Continue loop        │
│     └─ No tool calls?           │
│         └─ Stream final response│
└─────────────────────────────────┘

Key design decisions:

2.3 Escalating LLM Client

A decorator pattern that wraps two models from the same provider:

                ┌─────────────┐
                │ Escalating  │
                │ LLM Client  │
                └──────┬──────┘
                       │
           ┌───────────┴───────────┐
           │                       │
    ┌──────┴──────┐        ┌──────┴──────┐
    │    Doer     │        │   Analyst   │
    │  (Sonnet)   │        │   (Opus)    │
    │   $3/M      │        │   $15/M     │
    └─────────────┘        └─────────────┘

Escalation triggers:

  1. Keyword-based: System prompt contains "architect" → use analyst
  2. Error-based: 3+ consecutive errors → escalate to analyst
  3. Auto-deescalation: Success with analyst → return to doer

Result: ~30-40% cost reduction vs using the expensive model for everything.

2.4 Tool Inventory

Category Tools Phase
Analysis analyze_file, analyze_project, list_directory Read-only
File Read read_file, search_code, search_knowledge Read-only
File Write write_file, edit_file, insert_at_line, find_and_replace, delete_lines, append_to_file Execution
Build/Test dotnet_build, dotnet_test Either
Code Gen generate_code Execution
Git git_status, git_diff, git_commit Conditional
Shell run_command Execution
Web web_search, web_fetch Read-only
Index reindex_project, get_index_stats Read-only

Phase-based filtering prevents the LLM from calling write tools during exploration, or exploration tools during execution.


3. The Optimization Journey

3.1 Benchmark Methodology

Task: "Create a calculator using a .NET API to store operations and a Blazor Server app for the UI. Use SQLite. Make the design attractive."

Measurement: Total API cost (input + output tokens × provider pricing) for one complete execution with --AtomicBlonde (no confirmation pauses).

Control: Same task, same provider, same model, same machine. Only the Diana version changes.

3.2 Cost Evolution

Version   Cost     Δ vs Baseline   Cumulative Spend   Key Change
───────   ──────   ─────────────   ────────────────   ──────────────────────────
v1.0.7    $3.81    baseline        $3.81              Initial release
v1.2.0    $3.18    -16.5%          $7.16              Early optimizations
v1.2.1    $3.44    -9.7%           $10.60             Regression (over-compressed)
v1.2.2    $2.50    -34.4%          $13.10             Recovered + improved
v1.2.3    $2.42    -36.5%          $15.52             Incremental gains
v1.3.0    $3.92    +2.9%           $19.44             REGRESSION: aggressive compression
v1.3.1    $2.51    -34.1%          $21.95             Fix: type-aware compression
v1.3.2    $1.59    -58.3%          $23.54             Roslyn analyze_file (best)
v1.4.0    $3.19    -16.3%          $26.73             FAILED: scaffold_crud experiment

3.3 Visual Cost Curve

$4.00 ┤
      │  ●v1.0.7                        ●v1.3.0
$3.50 ┤       ●v1.2.1                  ╱        ●v1.4.0
      │      ╱                         ╱
$3.00 ┤  ●v1.2.0                      ╱
      │     ╲                         ╱
$2.50 ┤      ╲  ●v1.2.2  ●v1.2.3    ╱   ●v1.3.1
      │       ╲─────────╱          ╱
$2.00 ┤                           ╱
      │                          ╱
$1.50 ┤                    ●v1.3.2 ← BEST ($1.59)
      │
$1.00 ┤
      └──────────────────────────────────────────
        1.0  1.2  1.2.1 1.2.2 1.2.3 1.3  1.3.1 1.3.2 1.4

4. What Worked

4.1 Roslyn analyze_file (v1.3.2) — 37% reduction

The single biggest optimization. Instead of reading entire C# files (~800 tokens), we parse them with Roslyn and return a structural summary (~50 tokens):

Before (read_file): ~800 tokens

using System;
using System.Collections.Generic;
using Microsoft.AspNetCore.Mvc;
using Microsoft.EntityFrameworkCore;
// ... 60 more lines of actual code

After (analyze_file): ~50 tokens

// Controllers/UsersController.cs (65 lines)
using: System, Microsoft.AspNetCore.Mvc, Microsoft.EntityFrameworkCore
namespace MyApp.Controllers
  [ApiController, Route("api/[controller]")]
public class UsersController : ControllerBase
  UsersController(AppDbContext db)
  async Task<ActionResult<List<User>>> GetAll()
  async Task<ActionResult<User>> GetById(int id)
  async Task<ActionResult<User>> Create(UserRequest request)
  async Task<IActionResult> Update(int id, UserRequest request)
  async Task<IActionResult> Delete(int id)

The LLM gets the same structural understanding at 16x fewer tokens. It only calls read_file when it actually needs to edit a specific file.

System prompt directive:

"Use analyze_file for exploration instead of read_file — much cheaper. Use read_file only when you need exact content for editing."

4.2 Smart History Compression (v1.3.1) — Fixed regression

The conversation history grows with every turn. By turn 20, you're re-sending 15,000+ tokens of old tool results. Compression keeps only what the LLM needs:

Type-aware rules (sliding window of 16 recent messages kept intact):

Tool Type Compression Strategy Rationale
read_file, search_code Keep first 500 chars LLM needs to remember structure
write_file, edit_file [write_file: OK] LLM already knows what it wrote
run_command, dotnet_build Errors only, success → [OK] Only errors matter
assistant messages Trim to 300 chars Older reasoning less relevant

Critical lesson from v1.3.0: We initially compressed read_file to [OK] too. The LLM lost context on what it had read and re-read the same files — costing more than not compressing at all. $3.92 vs $1.80 target. Type-awareness fixed this.

4.3 System Prompt Caching (v1.3.0) — 5-8% reduction

SystemPrompt = iteration == 1 ? systemPrompt : null,

LLM APIs (Claude, OpenAI) cache the system prompt after the first request. Sending it again on every turn wastes ~2,500 tokens per iteration. With 20 turns, that's ~50,000 wasted tokens.

4.4 VectorDB Context Filtering (v1.3.0) — 3-5% reduction

var relevant = results.Where(r => r.Score >= 0.3f).ToList();

Before filtering, the agent would inject 5 code snippets as context — many irrelevant. Score filtering (≥ 0.3) and topK=3 ensures only truly relevant code enters the context.

4.5 Tool Result Truncation (v1.3.0) — 5-10% reduction

const int MaxToolResultLength = 6000;

Build output, test results, and directory listings can be enormous. Truncating to 6,000 chars preserves error information while discarding verbose success output.

4.6 Output Stripping (v1.3.0) — 5-8% reduction

Shell commands produce noise: NuGet restore logs, X.509 certificate warnings, MSBuild telemetry. Stripping these before they enter the conversation history prevents token bloat.


5. What Failed

5.1 scaffold_crud (v1.4.0) — Cost increased from $1.59 to $3.19

Hypothesis: The LLM spends ~8 turns writing boilerplate CRUD (Model, DTOs, DbContext, Controller, Service). A deterministic tool that generates all 6 files in one call should replace those 8 turns with 1.

Implementation: scaffold_crud tool generating:

  1. Models/{Entity}.cs — Entity with Id + properties
  2. DTOs/{Entity}Request.cs — DTO without Id
  3. DTOs/{Entity}Response.cs — DTO with Id
  4. Data/AppDbContext.cs — DbContext with SQLite
  5. Controllers/{Entities}Controller.cs — 5 CRUD endpoints
  6. Services/{Entity}ApiService.cs — HttpClient wrapper

What actually happened (53 turns):

Turn 5:   scaffold_crud             ← Generated 6 generic files (7ms)
Turn 6-8: read_file ×4             ← LLM reads what scaffold generated
Turn 9-14: write_file ×6           ← LLM REWRITES everything with real logic

Root cause: The scaffold generates generic boilerplate. A calculator needs custom logic: expression parsing, operator handling, result computation. The LLM couldn't use generic CRUD — it had to read everything, understand it, then rewrite it. That's 3x the turns (scaffold + read + rewrite) instead of just writing directly (~8 turns).

Lesson learned: Deterministic code generation only helps when the output is usable as-is. If the LLM has to customize it, you're paying for generation + comprehension + rewriting — worse than just writing from scratch.

5.2 Aggressive History Compression (v1.3.0) — $3.92 regression

Compressing read_file results to [OK] caused the LLM to:

  1. Forget what it had read
  2. Re-read the same files
  3. Create a "read → forget → re-read" loop

Lesson learned: Compression must be type-aware. Read results are context; write results are confirmation. Treat them differently.


6. Architecture Details

6.1 Security: PathValidator

Defense-in-depth with 7 validation layers:

  1. Null/empty check
  2. Null byte injection detection
  3. Path traversal pattern blocking (.., ../, ..\\)
  4. Obfuscated traversal detection (...., ...)
  5. Absolute path resolution + base directory verification
  6. Blocked extensions (.exe, .dll, .msi, .vbs, etc.)
  7. Symlink/reparse point validation

All file-writing tools call PathValidator.IsPathSafe() before any I/O operation.

┌─────────────────────────────────┐
│         SQLite VectorDB         │
├─────────────────────────────────┤
│  CodeChunk table                │
│    FilePath · Content · Vector  │
├─────────────────────────────────┤
│  FileHash table (incremental)   │
│    FilePath · SHA256 · LastIdx  │
├─────────────────────────────────┤
│  Embedding: ONNX MiniLM-L6-v2  │
│  Fallback:  Simple hash-based   │
└─────────────────────────────────┘

6.3 Multi-Provider LLM Support

public class LLMConfig
{
    public string Provider { get; set; }      // "kimi" | "claude" | "openai"
    public string Model { get; set; }          // Doer model
    public string? AnalysisModel { get; set; } // Analyst model (optional)
    public string ApiKey { get; set; }
    public int EscalationErrorThreshold { get; set; } // Default: 3
}
Provider Doer Analyst API Format
Kimi kimi-k2-turbo-preview kimi-k2-thinking OpenAI-compatible
Claude claude-sonnet-4-6 claude-opus-4-6 Anthropic native
OpenAI gpt-4o o3 OpenAI native

When AnalysisModel ≠ Model, the LLMFactory wraps both in EscalatingLLMClient.


7. Quantitative Analysis

7.1 Token Distribution (v1.3.2 benchmark task)

Category Tokens % of Total
System prompt (1x) ~2,500 4.7%
User context (VectorDB) ~300 0.6%
LLM reasoning (output) ~8,000 15.1%
Tool call arguments ~5,000 9.4%
Tool results (compressed) ~12,000 22.6%
Conversation history (re-sent) ~25,000 47.1%
Total ~53,000 100%

Key insight: 47% of tokens are conversation history re-sent on each turn. History compression targets this largest category.

7.2 Cost per Optimization Technique

Technique Token Savings Cost Impact Effort
Roslyn analyze_file ~16x per file read -37% Medium (Roslyn integration)
History compression ~60% of old messages -25% Low (sliding window)
System prompt caching ~2,500/turn after first -8% Trivial (null check)
Tool result truncation ~30% of large outputs -7% Low (string truncation)
Output stripping ~500/turn average -6% Low (regex filtering)
VectorDB filtering ~2,000 on first turn -4% Trivial (score threshold)
scaffold_crud Negative +100% High (wasted)

7.3 Turns per Version

Version Turns Cost Cost/Turn
v1.0.7 ~25 $3.81 $0.152
v1.3.0 ~28 $3.92 $0.140
v1.3.2 ~22 $1.59 $0.072
v1.4.0 53 $3.19 $0.060

v1.4.0 paradox: Lowest cost-per-turn ($0.060) but highest turn count (53). The scaffold_crud made each turn cheaper but doubled the number of turns.


8. Lessons Learned

8.1 Compression Must Be Type-Aware

Not all tool results are equal. Read results are context the LLM needs to remember. Write confirmations are redundant. Treating them uniformly causes either context loss (over-compression) or token waste (under-compression).

8.2 Deterministic Generation Fails When Customization Is Required

Code scaffolding only saves tokens if the output is usable without modification. Generic CRUD templates require the LLM to read, understand, and rewrite — costing 3x what direct writing costs.

8.3 The Biggest Wins Are Structural, Not Textual

Shortening system prompts (textual) saves 2-3%. Replacing file reads with Roslyn summaries (structural) saves 37%. The highest-leverage optimizations change what information the LLM receives, not how it's formatted.

8.4 Regressions Are Expensive to Detect

Each benchmark run costs $1.50-$4.00. Testing 9 versions cost $26.73 in total API spend. An automated, cheaper benchmark (shorter task, smaller project) would enable faster iteration.

8.5 The Re-Read Loop Is the #1 Cost Killer

When the LLM loses context on what it previously read, it enters a read → forget → re-read cycle that can double or triple costs. Preserving read context in compression is the single most important rule.


9. Future Directions

9.1 Auto-Split Files

When the LLM writes multiple classes/DTOs in a single file, subsequent edits require loading the entire file. A post-write tool that automatically splits multi-class files into individual files would reduce read_file token costs for later edits.

9.2 Smarter Scaffold with Business Logic Injection

Instead of generic CRUD, a scaffold that accepts business logic hints:

scaffold_crud entity=Calculation properties=...
  business_logic="Calculate result from Operand1, Operator, Operand2"

This could generate customized code the LLM doesn't need to rewrite.

9.3 Diff-Based File Editing

Instead of read_file → full content → edit_file, a tool that accepts line ranges would reduce the tokens needed for surgical edits.

9.4 Cheaper Benchmark Task

A simpler benchmark (e.g., "add a property to an existing model") would cost ~$0.20 per run, enabling 10x more optimization iterations per dollar.


10. Conclusion

Through 9 iterations and $26.73 in benchmark spend, we reduced Diana's cost-per-task from $3.81 to $1.59 — a 58% reduction. The key insight is that LLM token costs are dominated by what the model reads, not what it writes. The three highest-impact optimizations all target input tokens:

  1. Roslyn analyze_file — 16x reduction in code exploration tokens
  2. Type-aware history compression — 60% reduction in re-sent conversation history
  3. System prompt caching — Eliminate 2,500 tokens per turn after the first

The failed scaffold_crud experiment (v1.4.0, $3.19) demonstrated that reducing output tokens (code generation) is counterproductive if it increases input tokens (reading and understanding generated code).

The cost of an LLM agent is not the code it writes — it's the context it needs to write it.


11. Roadmap: From Assistant to Autonomous Software Factory

11.1 The Vision

The natural evolution of Diana is not a better assistant — it's a factory. Instead of one developer interacting with one agent on one task, the system receives a Statement of Work (SOW) and autonomously produces a complete project with minimal human intervention.

SOW Document (50 pages)
        ↓
┌───────────────────────────────────────┐
│  OPUS PLANNER (1 expensive call)      │
│  Reads SOW → Generates DAG of tasks   │
│  Cost: ~$2-5                          │
└───────────────┬───────────────────────┘
                ↓
┌───────────────────────────────────────┐
│  TASK BOARD (Persistent Queue)        │
│                                       │
│  Phase 1: Foundation                  │
│    Task 1: Solution + shared models   │
│    Task 2: Auth system                │
│    ──── GATE: build ✓ ────            │
│                                       │
│  Phase 2: API Core (parallel)         │
│    Task 3: CRUD Users      ─┐        │
│    Task 4: CRUD Products    ├─ parallel│
│    Task 5: CRUD Orders     ─┘        │
│    ──── GATE: build + test ✓ ────     │
│                                       │
│  Phase 3: Business Logic              │
│    Task 6: Order processing           │
│    Task 7: Inventory rules            │
│    ──── GATE: build + test ✓ ────     │
│                                       │
│  Phase 4: UI (parallel)               │
│    Task 8: Login page      ─┐        │
│    Task 9: Dashboard        ├─ parallel│
│    Task 10: Product catalog ─┘        │
│    ──── GATE: build ✓ ────            │
│                                       │
│  Phase 5: Integration                 │
│    Task 11: Connect UI ↔ API          │
│    Task 12: Error handling            │
│    ──── GATE: build + test + HUMAN ── │
└───────────────┬───────────────────────┘
                ↓
┌───────────────────────────────────────┐
│  EXECUTOR (Diana CLI, headless)       │
│  Processes tasks respecting DAG order │
│  Sonnet for execution, ~$1.59/task    │
└───────────────┬───────────────────────┘
                ↓
        Complete Project
        ~$32 total cost
        ~2 hours autonomous execution

11.2 Why Not a Flat Queue

A SOW cannot be decomposed into 20 independent tasks. Real projects have dependencies:

A flat queue would cause error propagation — if Task 3 generates different model names than expected, Tasks 5-12 build on assumptions that don't match reality. The system needs a DAG (Directed Acyclic Graph) with gates between phases.

11.3 The Critical Component: Context Injection

The hardest unsolved problem is not execution — it's coherence across tasks.

Task 1 creates User.cs with specific property names. Task 5 must know those exact names, not guess them from the SOW. This requires a Context Injector that:

  1. After each task completes, extracts what was actually generated (file paths, class names, endpoints, models)
  2. Before each dependent task starts, injects this real context into the prompt
  3. Maintains a living project manifest that evolves as tasks complete
Task 3 completes → Context Injector extracts:
  - Models: User.cs (Id, Email, Name, PasswordHash)
  - DTOs: UserRequest.cs, UserResponse.cs
  - Endpoints: GET/POST/PUT/DELETE /api/users
  - DbContext: AppDbContext with DbSet<User>

Task 5 receives injected context:
  "The project already has User (Id, Email, Name, PasswordHash)
   and Product (Id, Name, Price, Stock) models.
   AppDbContext has DbSet<User> and DbSet<Product>.
   Create Order entity with foreign keys to both."

Without this, each task works from the SOW description (what was planned) instead of the codebase reality (what was built). The gap between plan and reality grows with every task.

11.4 Gate System

Gates are checkpoints between phases that prevent error propagation:

Gate Type Trigger Action on Failure
Build Gate dotnet build fails Retry task with error context (max 3)
Test Gate dotnet test has failures Retry with test output as context
Human Gate End of major phase Notify human, wait for approval
Auto-Repair Gate Build fails after retry Escalate to Opus for diagnosis

The human gate is strategically placed — not after every task (too slow) but after each phase (meaningful checkpoint). A developer reviews the Phase 2 output before Phase 3 begins.

11.5 Economics

Metric Value
Task decomposition (Opus, 1 call) ~$2-5
Execution (Sonnet, ~20 tasks × $1.59) ~$32
Gate retries (estimated 3-4 failures) ~$6
Total project cost ~$40-43
Execution time ~2 hours autonomous
Human intervention ~15 min review at gates

On the Max plan ($200/month), this means ~5 complete projects per month within budget. On API rates, $40/project is still dramatically cheaper than developer time.

11.6 What Already Exists in Diana

Component Status Notes
Headless mode (diana --auto) Exists Can execute single tasks
Tool registry (22 tools) Exists Build, test, file ops, git
Build verification Exists dotnet_build after changes
EscalatingLLMClient Exists Opus for planning, Sonnet for execution
History compression Exists Keeps context manageable
Roslyn analysis Exists Cheap code understanding

11.7 What Needs to Be Built

Component Effort Description
SOW Parser Medium Opus reads SOW, generates phased DAG in JSON
Task Board Low Persistent queue with states, deps, phase grouping
Context Injector High Extracts actuals from completed tasks, injects into next
Gate System Medium Build/test/human-review between phases
DAG Orchestrator High Executes tasks respecting dependency order, parallelizes within phases
Project Manifest Medium Living document of what exists (models, endpoints, files)

The Context Injector and DAG Orchestrator are the two hard problems. Everything else is plumbing.

11.8 The Real Question

The question is not "can the AI work more hours?" — it's "can the AI maintain coherence across 20 chained tasks?"

A single task (the calculator) works because everything fits in one context window. Twenty chained tasks require the system to:

  1. Remember what it built (not what it planned)
  2. Adapt to deviations (Task 3 used UserEntity instead of User)
  3. Recover from failures (Task 7 failed, Task 8-20 need replanning)
  4. Maintain architectural consistency across 2 hours of autonomous execution

This is the frontier — not token optimization, not model selection, but multi-task coherence. Solving it transforms Diana from a coding assistant into an autonomous software factory.


12. The Self-Hosted Alternative: When You Own the Inference

12.1 The Hypothesis

Every optimization in this paper — prompt compression, history sliding windows, Roslyn token reduction — exists because tokens cost money. But what if they didn't? What if the cost of inference was a fixed monthly bill, like electricity?

Self-hosting an LLM inverts the entire optimization equation. Instead of minimizing tokens per task, you maximize context utilization per task. The constraint shifts from cost to latency.

12.2 The Economic Inversion

API pricing model (current):

Cost = Σ (input_tokens × $3/M + output_tokens × $15/M) per request
      ↓
Every token matters → compress, truncate, minimize

Self-hosted model:

Cost = GPU rental per month (fixed)
      ↓
Tokens are "free" → maximize context, never compress
Metric API (Claude Sonnet) Self-Hosted (70B model)
Cost model Per-token Fixed monthly
Context penalty $0.003/1K input tokens ~0 (already paid)
Compression needed? Critical Unnecessary
History window 16 messages (optimized) Full conversation
System prompt optimization Essential ($0.15/turn saved) Irrelevant
Roslyn analyze_file 16x savings ($0.04 vs $0.66) Same speed, no savings

12.3 What Changes for Diana

Eliminated complexity:

New optimization target:

What stays the same:

12.4 Hardware Cost Analysis

For a company with ~100 developers, not millions of users:

Setup Monthly Cost Context Window Tokens/sec Concurrent Users
4× A100 80GB (cloud) ~$15,000 128K ~40 t/s 8-12
4× H100 80GB (cloud) ~$25,000 128K ~80 t/s 15-20
8× A100 on-prem (amortized) ~$8,000 128K ~80 t/s 15-20
Max plan × 100 users $20,000 200K ~100 t/s 100 (rate limited)

The crossover point: Self-hosting becomes cheaper than 100 Max subscriptions when:

12.5 Model Candidates (March 2026)

Model Parameters Context Code Quality Self-Hostable
DeepSeek Coder V3 236B (MoE) 128K Excellent Yes, 4× A100
Llama 4 Maverick 400B (MoE) 128K Very Good Yes, 4× H100
Qwen 2.5 Coder 72B 72B 128K Very Good Yes, 2× A100
Mistral Large 2 123B 128K Good Yes, 2× H100
CodeLlama 70B 70B 100K Good Yes, 2× A100

The MoE (Mixture of Experts) models are the sweet spot — DeepSeek V3 activates only ~37B parameters per token despite having 236B total, giving near-frontier quality at manageable hardware costs.

12.6 The Fine-Tuning Advantage

Self-hosting unlocks something API providers can't offer: fine-tuning on your codebase.

Base model:        "Create a CRUD controller" → generic boilerplate
Fine-tuned model:  "Create a CRUD controller" → YOUR patterns, YOUR naming,
                                                 YOUR DbContext setup, YOUR error handling

This directly addresses the scaffold_crud failure from Section 9. Our deterministic tool generated generic code that the LLM had to rewrite. A fine-tuned model would generate your team's code patterns natively, eliminating the scaffold→read→rewrite cycle entirely.

Fine-tuning data sources:

12.7 How Self-Hosting Solves the Factory Problem

Section 11's autonomous factory faces one core challenge: multi-task coherence — maintaining context across 20 chained tasks. Self-hosting simplifies this dramatically:

Challenge API Approach Self-Hosted Approach
Context between tasks Compress, summarize, inject key artifacts Keep full history in extended context
Architecture consistency Hope the summary captures naming decisions Model remembers everything (no compression loss)
Error recovery Re-inject partial context from failed task Full history available, just retry
Cross-task references Context Injector extracts + re-injects actuals All actuals already in context
Cost of replanning ~$2-5 per Opus call Fixed cost, replan freely

The Context Injector — identified as the hardest component to build — becomes nearly trivial. Instead of extracting, compressing, and re-injecting artifacts between tasks, you simply... keep the context open.

12.8 Data Sovereignty

For companies handling client SOWs, contracts, and proprietary business logic:

This isn't hypothetical — it's a hard requirement for many enterprise clients. A self-hosted Diana can process SOWs containing confidential business logic without any data leaving the building.

12.9 The Hybrid Architecture

The optimal setup isn't purely self-hosted or purely API — it's hybrid:

┌──────────────────────────────────────────────────┐
│  Tier 1: Self-Hosted 70B (primary)               │
│  • All routine coding tasks                       │
│  • Full context, no compression                   │
│  • Fine-tuned on company codebase                 │
│  • Cost: $0/token (fixed infrastructure)          │
├──────────────────────────────────────────────────┤
│  Tier 2: Claude Opus (escalation)                 │
│  • Architectural planning only                    │
│  • Complex debugging (>3 consecutive failures)    │
│  • SOW decomposition into task DAGs               │
│  • Cost: ~$2-5 per escalation                     │
├──────────────────────────────────────────────────┤
│  Tier 3: Fine-tuned Small Model (8-14B)           │
│  • Code completion / autocomplete                 │
│  • Simple refactors, renames                      │
│  • Cost: negligible (runs on single GPU)          │
└──────────────────────────────────────────────────┘

Diana's EscalatingLLMClient already implements this pattern — swap the doer from Claude Sonnet to a self-hosted 70B, keep Opus as the analyst for the hardest 5% of tasks, and add a small model tier for trivial operations.

12.10 What This Means for Our Optimization Work

Optimization API Value Self-Hosted Value
Roslyn analyze_file High (16x token savings) Medium (still faster than raw reads)
History compression Critical ($0.15/turn saved) Zero (keep full history)
System prompt minimization High (2,500 → 1,800 tokens) Zero (use verbose prompts)
Turn reduction High (cost + latency) Medium (latency only)
scaffold_crud Failed at API rates May work (no cost to read generated files)

The irony: half of our optimizations become irrelevant with self-hosting. But the methodology — measuring, benchmarking, iterating — transfers directly. Instead of optimizing for $/task, you optimize for seconds/task and tasks/hour.

12.11 The Bottom Line

Self-hosting doesn't make Diana simpler — it makes it differently complex. You trade token economics for infrastructure operations. But for a company running 100+ developers through an autonomous coding agent, the math is compelling:

Scenario Monthly Cost Constraints
100 × Max plan $20,000 Rate limits, no fine-tuning, data leaves infra
4× H100 + Opus escalation ~$26,000 No rate limits, fine-tunable, full sovereignty
8× A100 on-prem (amortized) ~$9,000 Same benefits, lower cost after Year 1

The premium for self-hosting is ~30% more in Year 1, but you get: unlimited context, zero compression, fine-tuning, data sovereignty, and no rate limits. By Year 2, on-prem hardware pays for itself.

The real question isn't "API or self-hosted?" — it's "at what scale does owning the inference become cheaper than renting it?" For Diana's target use case (autonomous software factory processing SOWs), that scale is approximately 50-100 concurrent developers.


Appendix A: Full Version History

Version Commit Date Changes
v1.0.7 c3fc688 Feb 27, 2026 Initial release
v1.2.x Feb 2026 Early optimizations (not in current git)
v1.3.0 d3a30f1 Mar 3, 2026 Token optimization: prompt caching, compression, filtering
v1.3.1 5ac38d6 Mar 3, 2026 Fix regression: type-aware compression
v1.3.2 0cdafe1 Mar 3, 2026 Roslyn analyze_file tool
v1.4.0 Mar 3, 2026 scaffold_crud experiment (reverted)

Appendix B: Platform Distribution

Diana v1.3.2 is distributed for 10 platform targets:

Platform RID Architecture
Windows win-x64, win-x86, win-arm64 x64, x86, ARM64
Linux linux-x64, linux-arm, linux-arm64 x64, ARM, ARM64
Linux (Alpine) linux-musl-x64, linux-musl-arm64 x64, ARM64
macOS osx-x64, osx-arm64 Intel, Apple Silicon

Framework-dependent deployment (requires .NET 9.0 runtime).

Appendix C: Project Metrics


Diana — .NET Development Agent "The cost of an LLM agent is not the code it writes — it's the context it needs to write it."