Intelligence Design

#philosophy #speculative #research

A Note on This Article

This article represents speculative thinking about agentic system architecture—a framework for reasoning about AI agents that emerged from building them. The claims here are heuristics and mental models, not proven laws. They're useful for thinking about agent design, but remain work-in-progress ideas requiring empirical validation. Treat this as one lens among many, not gospel.

The Wrong Mental Model

The dominant frame for AI agents treats them as deterministic executors. You craft a prompt, the agent follows instructions, you get output. When it fails, you "fix the prompt." This model is inherited from traditional programming: input → function → output.

But LLMs aren't functions. They're probability distributions over outputs. Each call samples from that distribution. The same prompt produces different outputs across runs. The model hallucinates, drifts, misunderstands—not as bugs, but as intrinsic properties of the substrate.

This reframe changes everything about how you design agent systems. Agent design isn't about making agents "do the right thing." It's about engineering systems where correct outcomes emerge despite unreliable components.

Deterministic FrameProbabilistic Frame
Agent as executorAgent as noisy channel
Prompt → correct outputP(correct output) given prompt
Failure = bad promptFailure = signal below threshold
Fix the instructionEngineer the distribution
One perfect callN calls + filtering + feedback

The deterministic frame leads to endless prompt tweaking. The probabilistic frame leads to system architecture. One is reactive. The other is causal.

The Universal Pattern

What is intelligence? Strip away the mysticism and you find a simple pattern: reliably producing good outcomes from uncertainty.

How do you construct it? Generate possibilities, then filter to good ones.

That's it. That's the universal algorithm. Every system that exhibits intelligence runs some version of this loop.

DomainGenerateFilter
EvolutionRandom mutationSelection pressure
BrainNeuronal noise / candidate actionsPrediction error / reward signal
ScienceHypothesesExperiments
MarketsVenturesProfit/loss
LLM trainingToken samplingTraining signal (RLHF)
Agent systemsN outputsAuto-evaluation

Evolution doesn't craft perfect organisms. It generates variants and lets selection filter. Science doesn't produce correct theories. It generates hypotheses and lets experiments filter. Markets don't allocate capital optimally on first try. They fund ventures and let profit/loss filter.

Intelligence isn't magic. It's generate + filter running until good outcomes emerge. The substrate varies—biological, economic, computational—but the pattern is invariant.

Agency in System Design

Agency is the ability to be causal—to generate effects rather than respond to them. In agent system design, this translates to two modes:

Low agency design: Hope each prompt produces correct output. Be responsive to LLM variance. When it fails, tweak and retry. You're at the mercy of the distribution.

High agency design: Engineer the system so correct outcomes emerge. Be causal about architecture. Design pipelines where even unreliable components produce reliable results. You're reshaping the distribution.

The high-agency move: don't optimize individual LLM calls. Engineer the distribution so that correct outputs become high-probability across the system.

This is probability space bending applied to AI. You're not predicting which outputs will be correct. You're building systems where the probability mass shifts toward correctness.

The Signal Metaphor

From information theory: a signal passes through a channel, noise is added, the receiver must reconstruct the original signal.

An LLM call works the same way. Your intent is the signal. The LLM is the channel. Noise sources corrupt the transmission:

  • Hallucination — model generates false content
  • Drift — model loses thread over long context
  • Misinterpretation — model reads intent differently than you meant
  • Format errors — output doesn't match expected structure
  • Knowledge gaps — model lacks required information

The output is signal + noise. Your job as system designer is to maximize signal-to-noise ratio at the system level, not at the individual call level.

Why Single Attempts Fail

Every system has a detection threshold—the minimum signal strength required to distinguish your signal from background noise. A single LLM call, no matter how well-crafted, is one sample from a noisy distribution.

The prompting paradigm assumes reliability: "If I phrase it right, the agent will work." This is the intensity strategy—maximize quality of single attempt.

Intensity strategy fails when:

  1. You cannot reliably produce outlier-quality prompts
  2. The system has high noise floor (LLM variance)
  3. Single attempts fall below detection threshold (non-zero hallucination rate)

The math is clarifying. If P(correct) = 0.7 per agent call:

  • 1 call: 70% success
  • 3 calls + majority vote: 93% success
  • 5 calls + majority vote: 97% success

Prompting optimizes the 0.7. System design optimizes the pipeline that produces 97% from 0.7 components. The leverage is in the architecture, not the prompt.

The Core Primitive: Amplification

The solution isn't to craft a stronger single signal. It's to amplify weak signals until they cross the detection threshold.

If P(success) per attempt is 0.02:

  • 1 attempt: P(at least one success) = 0.02
  • 50 attempts: P(at least one success) = 0.64
  • 100 attempts: P(at least one success) = 0.87
  • 200 attempts: P(at least one success) = 0.98

Volume doesn't just increase chances linearly—it compounds. This is the fundamental algorithm for changing probability distributions in noisy systems.

All sophisticated techniques reduce to one primitive:

Generate N → Auto-evaluate → Select best

Everything else is implementation detail:

TechniqueIt's Just...
Self-consistencyN reasoning paths → vote on answer
Best-of-NN outputs → score → pick top
AlphaCodeN million programs → test filter → pick passing
Tree of ThoughtsN branches → evaluate → expand best
Rejection samplingN outputs → filter until one passes
Beam searchN candidates → score → keep top-k → repeat

The papers have different names. The mechanism is identical.

The Structural Requirement

For generate + filter to work, evaluation must be cheaper and more reliable than generation.

If...Then...
Eval is deterministic (tests, schema)Works perfectly
Eval is LLM but easier than generationWorks with some noise
Eval is as hard as generationYou've just doubled compute for nothing

Filtering works when verification is asymmetrically easy:

DomainWhy Filtering Works
CodeTests are deterministic—ground truth exists
MathComputation is checkable—verify step by step
FactualSources exist—check against documents
FormatSchema is defined—validate structure
ExtractionSource document exists—verify against input

Filtering struggles when verification is as hard as generation:

DomainWhy Filtering Is Hard
Creative writingNo ground truth, judgment is subjective
Open-ended reasoningValidating reasoning is as hard as reasoning
Novel problemsNo known correct answer to check against
Taste/qualityScoring function is as uncertain as generation

This asymmetry determines where the generate + filter pattern provides leverage. Code generation? Massive leverage—tests exist. Creative writing? Limited leverage—no ground truth to filter against.

Distribution Control Variables

"Engineering the distribution" sounds abstract. In practice, these are the concrete control variables:

VariableWhat It ControlsHow to Adjust
TemperatureVariance (spread)Lower = tighter distribution, higher = more diverse
Prompt structureDistribution centerClearer prompt = center closer to desired output
Examples (few-shot)Distribution shapeExamples pull distribution toward their pattern
Output constraintsDistribution truncationJSON mode, function calling = cut off invalid regions
Model selectionBase distributionDifferent models = different priors
Context windowConditional distributionWhat's in context shifts P(output)
DecompositionTask distributionSmaller task = tighter distribution per step
N samplesSampling coverageMore samples = better coverage of distribution
Scoring functionSelection pressureFilter + select shifts effective distribution

Each variable gives you a lever. Prompt engineering adjusts distribution center. Temperature adjusts spread. Constraints truncate invalid regions. Volume increases coverage. Scoring applies selection pressure.

System design is combining these levers to produce distributions where correct outputs are high-probability.

Amplification Strategies

Different situations call for different amplification approaches:

StrategyWhen to UseMechanism
Temporal (retries)Single channel, need reliabilitySame call N times, majority vote
Spatial (parallel)Multiple approaches possibleDifferent prompts/models in parallel, combine
Filter (validation)Ground truth existsGenerate → validate → select passing
DiversityCorrelated errors likelyVary prompt/temperature/model to sample different regions

Simple majority voting assumes independence. "P(correct) = 0.7, so 5 calls with majority vote = 0.97" only holds if calls are independent. But they're not—same prompt, same model, same systematic biases.

The technique: sample from different regions of the distribution, not just sample more. Self-consistency works because it varies the reasoning path (chain of thought), not just re-samples. AlphaCode works because it generates structurally diverse programs.

Diversity beats repetition when errors are correlated.

Signal Function Taxonomy

Each LLM call in an agent system serves a signal processing function. The function determines reliability requirements and amplification strategy.

Source Functions (Generate Signal)

Generator — Produce raw content, options, drafts. Reliability need is low (quantity over quality). Strategy: high volume, filter downstream. Example: brainstorm 20 approaches, filter to 3.

Planner — Decompose intent into executable steps. Reliability need is medium-high (structure matters). Strategy: validate plan before execution, allow revision.

Routing Functions (Direct Signal)

Router / Classifier — Determine which path signal takes. Reliability need is very high (wrong path = cascading error). Strategy: constrained outputs, explicit categories, fallback paths.

Orchestrator — Coordinate multi-agent execution. Reliability need is very high (controls all flow). Strategy: simple logic, deterministic where possible, minimal LLM reliance.

Transformation Functions (Modify Signal)

Specialist — Execute one defined transformation. Reliability need is medium (can retry). Strategy: clear scope + volume + filtering.

Translator — Convert between representations. Natural language → SQL, text → structured data.

Compressor — Reduce dimensionality, preserve essence. Summarization, distillation.

Extractor — Isolate specific signal from noisy input. Entity extraction, key information retrieval.

Synthesizer — Combine multiple signals into coherent output. Research synthesis, multi-source integration.

Filtering Functions (Reduce Noise)

Validator / Evaluator — Check output against criteria, provide gradient. Reliability need is high (feedback accuracy determines learning). Strategy: multiple validators, explicit rubrics, cross-check.

Critic — Second-pass noise filter. Review generated content for errors before use.

Recovery — Handle failures, adjust parameters. Fallback hierarchies, error classification.

Memory Functions (Persist Signal)

Memory — Store and retrieve across time. Reliability need is high (corruption propagates). Strategy: structured storage, validation on write.

Composition Patterns

Signal functions compose into systems. These patterns appear repeatedly:

Pattern 1: Reliable Output from Unreliable Source

Generator(n=10) → Evaluator → Filter(threshold) → Output

Use when: Single generation unreliable, verification cheap. The generator produces quantity, the evaluator scores, the filter selects. Individual generators can be noisy because the system tolerates it.

Pattern 2: Domain-Appropriate Processing

Router → Specialist[domain] → Validator → Output

Use when: Different inputs need different processing. The router directs traffic, specialists handle their domain, validators check output. Router reliability is critical—wrong routing corrupts everything downstream.

Pattern 3: Iterative Refinement

Generator → Critic → Refiner → Critic → ... → Output

Use when: Quality improves with iteration, critic is reliable. Each cycle adds signal, removes noise. Works when the critic can provide useful gradient.

Pattern 4: Parallel Decomposition

Planner → [Specialist × N in parallel] → Synthesizer → Output

Use when: Task decomposes into independent subtasks. Planner breaks down, specialists work in parallel, synthesizer recombines. Massive parallelism opportunity.

Pattern 5: Generate-Test-Refine Loop

Generator(n) → Tester → [passing] → Select best
                     → [failing] → Analyzer → Generator(n, with feedback)

Use when: Tests exist, feedback improves generation. Failing tests become learning signal for next generation round. This is how AlphaCode works.

Foundational Observations

Prompting Is Necessary But Not Sufficient

Prompting IS how you shift the distribution center. You can't escape it—signal engineering still requires good prompts at each node.

But prompting alone (optimizing single-call quality) hits a ceiling. System design (volume, filtering, decomposition) breaks through that ceiling.

Prompting = signal clarity at source. System design = amplification + filtering through pipeline.

Both matter. Prompting is component design. Signal engineering is system design.

Reliability Requirements Vary by Function

Not all agent calls need the same reliability. A router that misclassifies corrupts everything downstream. A generator that produces one bad option among ten is fine—you filter later.

High reliability required → Minimize LLM dependence, constrain outputs, multiple verification, deterministic fallbacks.

Low reliability acceptable → Maximize LLM freedom, high volume generation, filter downstream, accept noise and extract signal.

Design accordingly.

Noise Budget Is Finite

Every stage adds noise. Design question: where can you afford noise, and where must you eliminate it?

Error at Router: 3 downstream agents do wrong task → total waste. Error at Generator: 19/20 outputs filtered → system still works.

Allocate noise budget to stages where filtering can recover. Minimize noise at routing and orchestration where errors cascade.

Evals Measure Distributions

Evaluating a single output tells you almost nothing about the distribution. This is why:

  • "It worked in testing" doesn't mean it works in production
  • "It failed once" doesn't mean it always fails
  • Prompt tweaks have inconsistent effects

Evals measure distributions, not outputs. 100 runs gives you P(correct). Then you engineer the system until P(correct) crosses your threshold.

The Framework Applies Where Asymmetry Is Largest

Generate + filter wins big where verification is cheap:

  • Code (tests exist)
  • Structured extraction (source exists)
  • Factual tasks (documents exist)
  • Format compliance (schema exists)

This is also where most production agent use cases live. The asymmetry is your leverage.

Evidence

Published results support the generate + filter pattern:

MethodMechanismImprovement
Self-consistency (Wang et al.)Sample N reasoning chains, majority vote+10-20% on reasoning benchmarks
AlphaCode (DeepMind)Generate millions of programs, filter with testsCompetitive with humans (top 54%)
Best-of-N samplingGenerate N, score, select topConsistent gains across tasks
Constitutional AI (Anthropic)Generate → critique → revise loopReduced harmful outputs
Tree of Thoughts (Yao et al.)Branch generation + evaluation + selection+20-30% on planning tasks
Verifier models (Cobbe et al.)Separate model scores solutions+15% on math word problems

All of these are volume + filtering strategies. None are "better single-shot prompting."

What remains unproven: the specific signal function taxonomy (descriptive, needs validation), generalization to all agent tasks, specific reliability claims per function type, optimal compositions for various domains. These require empirical measurement.

The Meta-Principle

The pattern connects to agency and forcing functions beyond AI systems.

Stop optimizing single instances. Start engineering distributions.

Applied to behavior: Don't force each gym visit through willpower. Reshape P(gym) through architecture.

Applied to agents: Don't perfect each prompt. Reshape P(correct) through system design.

The substrate is different. The engineering is identical. You're not being causal about individual outcomes. You're being causal about the generator that produces outcomes.

This is Level 4 agency: engineering probability distributions rather than predicting or responding to them. Weather forecasters describe distributions. Climate engineers modify them. Prompt engineers describe what agents might do. Intelligence designers reshape what agents probably do.

ℹ️Key Principle

Intelligence design is system architecture for unreliable components. LLMs are probability distributions, not deterministic functions—the same prompt produces different outputs across runs. The universal pattern is generate + filter: produce possibilities, then select good ones. This pattern appears in evolution, science, markets, and AI systems. The core primitive is "Generate N → Auto-evaluate → Select best." For this to work, evaluation must be cheaper than generation—which is why the pattern provides most leverage in code (tests exist), extraction (sources exist), and structured output (schemas exist). Design agent systems by composing signal functions: Generators produce volume, Routers direct flow (high reliability required), Specialists transform, Validators filter, Synthesizers combine. Allocate noise budget strategically—errors at routing cascade, errors at generation filter out. Prompting shifts distribution center; system design amplifies and filters. Both matter, but system design breaks through the ceiling that prompting alone cannot. The meta-principle: stop optimizing single instances, start engineering distributions. This is probability space bending applied to AI.


Intelligence isn't magic. It's generate + filter running until good outcomes emerge. The substrate varies—biological, economic, computational—but the pattern is invariant. Stop optimizing single instances. Start engineering distributions. The leverage is in the architecture.