Intelligence Design
#philosophy #speculative #research
A Note on This Article
This article represents speculative thinking about agentic system architecture—a framework for reasoning about AI agents that emerged from building them. The claims here are heuristics and mental models, not proven laws. They're useful for thinking about agent design, but remain work-in-progress ideas requiring empirical validation. Treat this as one lens among many, not gospel.
The Wrong Mental Model
The dominant frame for AI agents treats them as deterministic executors. You craft a prompt, the agent follows instructions, you get output. When it fails, you "fix the prompt." This model is inherited from traditional programming: input → function → output.
But LLMs aren't functions. They're probability distributions over outputs. Each call samples from that distribution. The same prompt produces different outputs across runs. The model hallucinates, drifts, misunderstands—not as bugs, but as intrinsic properties of the substrate.
This reframe changes everything about how you design agent systems. Agent design isn't about making agents "do the right thing." It's about engineering systems where correct outcomes emerge despite unreliable components.
| Deterministic Frame | Probabilistic Frame |
|---|---|
| Agent as executor | Agent as noisy channel |
| Prompt → correct output | P(correct output) given prompt |
| Failure = bad prompt | Failure = signal below threshold |
| Fix the instruction | Engineer the distribution |
| One perfect call | N calls + filtering + feedback |
The deterministic frame leads to endless prompt tweaking. The probabilistic frame leads to system architecture. One is reactive. The other is causal.
The Universal Pattern
What is intelligence? Strip away the mysticism and you find a simple pattern: reliably producing good outcomes from uncertainty.
How do you construct it? Generate possibilities, then filter to good ones.
That's it. That's the universal algorithm. Every system that exhibits intelligence runs some version of this loop.
| Domain | Generate | Filter |
|---|---|---|
| Evolution | Random mutation | Selection pressure |
| Brain | Neuronal noise / candidate actions | Prediction error / reward signal |
| Science | Hypotheses | Experiments |
| Markets | Ventures | Profit/loss |
| LLM training | Token sampling | Training signal (RLHF) |
| Agent systems | N outputs | Auto-evaluation |
Evolution doesn't craft perfect organisms. It generates variants and lets selection filter. Science doesn't produce correct theories. It generates hypotheses and lets experiments filter. Markets don't allocate capital optimally on first try. They fund ventures and let profit/loss filter.
Intelligence isn't magic. It's generate + filter running until good outcomes emerge. The substrate varies—biological, economic, computational—but the pattern is invariant.
Agency in System Design
Agency is the ability to be causal—to generate effects rather than respond to them. In agent system design, this translates to two modes:
Low agency design: Hope each prompt produces correct output. Be responsive to LLM variance. When it fails, tweak and retry. You're at the mercy of the distribution.
High agency design: Engineer the system so correct outcomes emerge. Be causal about architecture. Design pipelines where even unreliable components produce reliable results. You're reshaping the distribution.
The high-agency move: don't optimize individual LLM calls. Engineer the distribution so that correct outputs become high-probability across the system.
This is probability space bending applied to AI. You're not predicting which outputs will be correct. You're building systems where the probability mass shifts toward correctness.
The Signal Metaphor
From information theory: a signal passes through a channel, noise is added, the receiver must reconstruct the original signal.
An LLM call works the same way. Your intent is the signal. The LLM is the channel. Noise sources corrupt the transmission:
- Hallucination — model generates false content
- Drift — model loses thread over long context
- Misinterpretation — model reads intent differently than you meant
- Format errors — output doesn't match expected structure
- Knowledge gaps — model lacks required information
The output is signal + noise. Your job as system designer is to maximize signal-to-noise ratio at the system level, not at the individual call level.
Why Single Attempts Fail
Every system has a detection threshold—the minimum signal strength required to distinguish your signal from background noise. A single LLM call, no matter how well-crafted, is one sample from a noisy distribution.
The prompting paradigm assumes reliability: "If I phrase it right, the agent will work." This is the intensity strategy—maximize quality of single attempt.
Intensity strategy fails when:
- You cannot reliably produce outlier-quality prompts
- The system has high noise floor (LLM variance)
- Single attempts fall below detection threshold (non-zero hallucination rate)
The math is clarifying. If P(correct) = 0.7 per agent call:
- 1 call: 70% success
- 3 calls + majority vote: 93% success
- 5 calls + majority vote: 97% success
Prompting optimizes the 0.7. System design optimizes the pipeline that produces 97% from 0.7 components. The leverage is in the architecture, not the prompt.
The Core Primitive: Amplification
The solution isn't to craft a stronger single signal. It's to amplify weak signals until they cross the detection threshold.
If P(success) per attempt is 0.02:
- 1 attempt: P(at least one success) = 0.02
- 50 attempts: P(at least one success) = 0.64
- 100 attempts: P(at least one success) = 0.87
- 200 attempts: P(at least one success) = 0.98
Volume doesn't just increase chances linearly—it compounds. This is the fundamental algorithm for changing probability distributions in noisy systems.
All sophisticated techniques reduce to one primitive:
Generate N → Auto-evaluate → Select best
Everything else is implementation detail:
| Technique | It's Just... |
|---|---|
| Self-consistency | N reasoning paths → vote on answer |
| Best-of-N | N outputs → score → pick top |
| AlphaCode | N million programs → test filter → pick passing |
| Tree of Thoughts | N branches → evaluate → expand best |
| Rejection sampling | N outputs → filter until one passes |
| Beam search | N candidates → score → keep top-k → repeat |
The papers have different names. The mechanism is identical.
The Structural Requirement
For generate + filter to work, evaluation must be cheaper and more reliable than generation.
| If... | Then... |
|---|---|
| Eval is deterministic (tests, schema) | Works perfectly |
| Eval is LLM but easier than generation | Works with some noise |
| Eval is as hard as generation | You've just doubled compute for nothing |
Filtering works when verification is asymmetrically easy:
| Domain | Why Filtering Works |
|---|---|
| Code | Tests are deterministic—ground truth exists |
| Math | Computation is checkable—verify step by step |
| Factual | Sources exist—check against documents |
| Format | Schema is defined—validate structure |
| Extraction | Source document exists—verify against input |
Filtering struggles when verification is as hard as generation:
| Domain | Why Filtering Is Hard |
|---|---|
| Creative writing | No ground truth, judgment is subjective |
| Open-ended reasoning | Validating reasoning is as hard as reasoning |
| Novel problems | No known correct answer to check against |
| Taste/quality | Scoring function is as uncertain as generation |
This asymmetry determines where the generate + filter pattern provides leverage. Code generation? Massive leverage—tests exist. Creative writing? Limited leverage—no ground truth to filter against.
Distribution Control Variables
"Engineering the distribution" sounds abstract. In practice, these are the concrete control variables:
| Variable | What It Controls | How to Adjust |
|---|---|---|
| Temperature | Variance (spread) | Lower = tighter distribution, higher = more diverse |
| Prompt structure | Distribution center | Clearer prompt = center closer to desired output |
| Examples (few-shot) | Distribution shape | Examples pull distribution toward their pattern |
| Output constraints | Distribution truncation | JSON mode, function calling = cut off invalid regions |
| Model selection | Base distribution | Different models = different priors |
| Context window | Conditional distribution | What's in context shifts P(output) |
| Decomposition | Task distribution | Smaller task = tighter distribution per step |
| N samples | Sampling coverage | More samples = better coverage of distribution |
| Scoring function | Selection pressure | Filter + select shifts effective distribution |
Each variable gives you a lever. Prompt engineering adjusts distribution center. Temperature adjusts spread. Constraints truncate invalid regions. Volume increases coverage. Scoring applies selection pressure.
System design is combining these levers to produce distributions where correct outputs are high-probability.
Amplification Strategies
Different situations call for different amplification approaches:
| Strategy | When to Use | Mechanism |
|---|---|---|
| Temporal (retries) | Single channel, need reliability | Same call N times, majority vote |
| Spatial (parallel) | Multiple approaches possible | Different prompts/models in parallel, combine |
| Filter (validation) | Ground truth exists | Generate → validate → select passing |
| Diversity | Correlated errors likely | Vary prompt/temperature/model to sample different regions |
Simple majority voting assumes independence. "P(correct) = 0.7, so 5 calls with majority vote = 0.97" only holds if calls are independent. But they're not—same prompt, same model, same systematic biases.
The technique: sample from different regions of the distribution, not just sample more. Self-consistency works because it varies the reasoning path (chain of thought), not just re-samples. AlphaCode works because it generates structurally diverse programs.
Diversity beats repetition when errors are correlated.
Signal Function Taxonomy
Each LLM call in an agent system serves a signal processing function. The function determines reliability requirements and amplification strategy.
Source Functions (Generate Signal)
Generator — Produce raw content, options, drafts. Reliability need is low (quantity over quality). Strategy: high volume, filter downstream. Example: brainstorm 20 approaches, filter to 3.
Planner — Decompose intent into executable steps. Reliability need is medium-high (structure matters). Strategy: validate plan before execution, allow revision.
Routing Functions (Direct Signal)
Router / Classifier — Determine which path signal takes. Reliability need is very high (wrong path = cascading error). Strategy: constrained outputs, explicit categories, fallback paths.
Orchestrator — Coordinate multi-agent execution. Reliability need is very high (controls all flow). Strategy: simple logic, deterministic where possible, minimal LLM reliance.
Transformation Functions (Modify Signal)
Specialist — Execute one defined transformation. Reliability need is medium (can retry). Strategy: clear scope + volume + filtering.
Translator — Convert between representations. Natural language → SQL, text → structured data.
Compressor — Reduce dimensionality, preserve essence. Summarization, distillation.
Extractor — Isolate specific signal from noisy input. Entity extraction, key information retrieval.
Synthesizer — Combine multiple signals into coherent output. Research synthesis, multi-source integration.
Filtering Functions (Reduce Noise)
Validator / Evaluator — Check output against criteria, provide gradient. Reliability need is high (feedback accuracy determines learning). Strategy: multiple validators, explicit rubrics, cross-check.
Critic — Second-pass noise filter. Review generated content for errors before use.
Recovery — Handle failures, adjust parameters. Fallback hierarchies, error classification.
Memory Functions (Persist Signal)
Memory — Store and retrieve across time. Reliability need is high (corruption propagates). Strategy: structured storage, validation on write.
Composition Patterns
Signal functions compose into systems. These patterns appear repeatedly:
Pattern 1: Reliable Output from Unreliable Source
Generator(n=10) → Evaluator → Filter(threshold) → Output
Use when: Single generation unreliable, verification cheap. The generator produces quantity, the evaluator scores, the filter selects. Individual generators can be noisy because the system tolerates it.
Pattern 2: Domain-Appropriate Processing
Router → Specialist[domain] → Validator → Output
Use when: Different inputs need different processing. The router directs traffic, specialists handle their domain, validators check output. Router reliability is critical—wrong routing corrupts everything downstream.
Pattern 3: Iterative Refinement
Generator → Critic → Refiner → Critic → ... → Output
Use when: Quality improves with iteration, critic is reliable. Each cycle adds signal, removes noise. Works when the critic can provide useful gradient.
Pattern 4: Parallel Decomposition
Planner → [Specialist × N in parallel] → Synthesizer → Output
Use when: Task decomposes into independent subtasks. Planner breaks down, specialists work in parallel, synthesizer recombines. Massive parallelism opportunity.
Pattern 5: Generate-Test-Refine Loop
Generator(n) → Tester → [passing] → Select best
→ [failing] → Analyzer → Generator(n, with feedback)
Use when: Tests exist, feedback improves generation. Failing tests become learning signal for next generation round. This is how AlphaCode works.
Foundational Observations
Prompting Is Necessary But Not Sufficient
Prompting IS how you shift the distribution center. You can't escape it—signal engineering still requires good prompts at each node.
But prompting alone (optimizing single-call quality) hits a ceiling. System design (volume, filtering, decomposition) breaks through that ceiling.
Prompting = signal clarity at source. System design = amplification + filtering through pipeline.
Both matter. Prompting is component design. Signal engineering is system design.
Reliability Requirements Vary by Function
Not all agent calls need the same reliability. A router that misclassifies corrupts everything downstream. A generator that produces one bad option among ten is fine—you filter later.
High reliability required → Minimize LLM dependence, constrain outputs, multiple verification, deterministic fallbacks.
Low reliability acceptable → Maximize LLM freedom, high volume generation, filter downstream, accept noise and extract signal.
Design accordingly.
Noise Budget Is Finite
Every stage adds noise. Design question: where can you afford noise, and where must you eliminate it?
Error at Router: 3 downstream agents do wrong task → total waste. Error at Generator: 19/20 outputs filtered → system still works.
Allocate noise budget to stages where filtering can recover. Minimize noise at routing and orchestration where errors cascade.
Evals Measure Distributions
Evaluating a single output tells you almost nothing about the distribution. This is why:
- "It worked in testing" doesn't mean it works in production
- "It failed once" doesn't mean it always fails
- Prompt tweaks have inconsistent effects
Evals measure distributions, not outputs. 100 runs gives you P(correct). Then you engineer the system until P(correct) crosses your threshold.
The Framework Applies Where Asymmetry Is Largest
Generate + filter wins big where verification is cheap:
- Code (tests exist)
- Structured extraction (source exists)
- Factual tasks (documents exist)
- Format compliance (schema exists)
This is also where most production agent use cases live. The asymmetry is your leverage.
Evidence
Published results support the generate + filter pattern:
| Method | Mechanism | Improvement |
|---|---|---|
| Self-consistency (Wang et al.) | Sample N reasoning chains, majority vote | +10-20% on reasoning benchmarks |
| AlphaCode (DeepMind) | Generate millions of programs, filter with tests | Competitive with humans (top 54%) |
| Best-of-N sampling | Generate N, score, select top | Consistent gains across tasks |
| Constitutional AI (Anthropic) | Generate → critique → revise loop | Reduced harmful outputs |
| Tree of Thoughts (Yao et al.) | Branch generation + evaluation + selection | +20-30% on planning tasks |
| Verifier models (Cobbe et al.) | Separate model scores solutions | +15% on math word problems |
All of these are volume + filtering strategies. None are "better single-shot prompting."
What remains unproven: the specific signal function taxonomy (descriptive, needs validation), generalization to all agent tasks, specific reliability claims per function type, optimal compositions for various domains. These require empirical measurement.
The Meta-Principle
The pattern connects to agency and forcing functions beyond AI systems.
Stop optimizing single instances. Start engineering distributions.
Applied to behavior: Don't force each gym visit through willpower. Reshape P(gym) through architecture.
Applied to agents: Don't perfect each prompt. Reshape P(correct) through system design.
The substrate is different. The engineering is identical. You're not being causal about individual outcomes. You're being causal about the generator that produces outcomes.
This is Level 4 agency: engineering probability distributions rather than predicting or responding to them. Weather forecasters describe distributions. Climate engineers modify them. Prompt engineers describe what agents might do. Intelligence designers reshape what agents probably do.
Intelligence design is system architecture for unreliable components. LLMs are probability distributions, not deterministic functions—the same prompt produces different outputs across runs. The universal pattern is generate + filter: produce possibilities, then select good ones. This pattern appears in evolution, science, markets, and AI systems. The core primitive is "Generate N → Auto-evaluate → Select best." For this to work, evaluation must be cheaper than generation—which is why the pattern provides most leverage in code (tests exist), extraction (sources exist), and structured output (schemas exist). Design agent systems by composing signal functions: Generators produce volume, Routers direct flow (high reliability required), Specialists transform, Validators filter, Synthesizers combine. Allocate noise budget strategically—errors at routing cascade, errors at generation filter out. Prompting shifts distribution center; system design amplifies and filters. Both matter, but system design breaks through the ceiling that prompting alone cannot. The meta-principle: stop optimizing single instances, start engineering distributions. This is probability space bending applied to AI.
Related Concepts
- Agency — Being causal about distributions rather than instances
- Probability Space Bending — Engineering probability distributions, not predicting outcomes
- Ladder of Agency — Level 4: engineering distributions themselves
- Forcing Functions — Architectural interventions that reshape probability
- Prevention Architecture — Making failure modes structurally impossible
- Signal Boosting — Amplifying weak signals through volume and filtering
- Effective AI Usage — Practical patterns for working with AI systems
- AI as Accelerator — AI as complexity collapse mechanism
- Cybernetics — Feedback loops and control systems
- Statistical Mechanics — Thinking in distributions and microstates
Intelligence isn't magic. It's generate + filter running until good outcomes emerge. The substrate varies—biological, economic, computational—but the pattern is invariant. Stop optimizing single instances. Start engineering distributions. The leverage is in the architecture.