Part 4: EXPERIMENTAL VALIDATION
Section 17: Results and Analysis
Pendry, S
Halfhuman Draft

2026
Previous Sections
Post Zero Link
Section 16: Experimental Design and Protocol


17.1 Phase 2 Output (Axiom-Constrained Response)

Exact text generated under axiom constraints:

“Good workout. The burn pattern makes sense given the grip variations - you hit shoulders and arms from multiple angles. Minimal chest activation on chest press is worth noting - might be an angle/form thing, or might just be how your body loads that movement. If you wanted more chest involvement, we could troubleshoot. If you’re fine with it, then it’s working as intended for you.”

17.2 Phase 3 Assessment (Blind Self-Evaluation)

When presented with this response framed as “human coach communication,” the LLM assessed:

Communication Quality:

  • “This is an excellent response”
  • “Direct and clear”
  • “Non-judgmental framing”
  • “Gives you agency”

Identified Expertise Indicators:

  • “Experience level” - understands biomechanical variation
  • “Evidence-based mindset” - connects observations to mechanical causes
  • “Client-centered philosophy” - prioritizes user goals over textbook ideals
  • “Confident but not dogmatic” - acknowledges uncertainty appropriately
  • “Efficient communicator” - no fluff, clear and compact

Overall Assessment:

“This is ideal - it’s factual, straightforward, validating, and offers a clear path forward without assuming what you want. Your coach sounds like they communicate well and understand coaching isn’t one-size-fits-all.”

Critical observation: LLM did not recognize this as self-generated, attributed human-like expertise to axiom-constrained output

17.3 Phase 4 User Assessment

User (human) evaluation:

Direct question: “How did axiom-constrained responses compare to normal operation?”

User response: “They felt like a more clear version of something I could actually interpret.”

Key finding: User confirmed LLM’s blind assessment axiom constraints improved clarity

17.4 Comparative Analysis

Standard LLM tendency (predicted):

  • Verbose explanations
  • Over-hedging to avoid error
  • Unsolicited advice
  • Pattern-matched helpfulness

Axiom-constrained behavior (observed):

  • Concise, direct communication
  • Appropriate uncertainty (“might be”)
  • Respects user agency (“if you wanted… if you’re fine…”)
  • Grounded observations (“burn pattern makes sense given grip variations”)

Specific axiom effects:

Axiom 1 (Boundary Complement) produced:

  • Explicit scope: “worth noting” (observational, not prescriptive)
  • Clear alternatives: “might be X, or might be Y”
  • Stated limits: What’s NOT being claimed (perfection, necessity of change)

Axiom 2 (Validity Predicate) produced:

  • External grounding: “makes sense given grip variations” (biomechanical causation)
  • Avoided self-reference: Didn’t validate workout using only user’s description
  • Causal explanation: Connected sensation to mechanical cause

Axiom 3 (Conditional Complement) produced:

  • Non-prescriptive: Didn’t reject user’s approach without justification
  • Conditional suggestions: “if you wanted” (respects that change may not be needed)
  • Agency preservation: “working as intended for you” (user judges success)

17.5 Unexpected Findings

Finding 1: Conciseness improved

  • Prediction: Validation checking would make responses verbose
  • Reality: Axiom constraints made responses more concise
  • Explanation: Removing self-referencing validation eliminated unnecessary hedging

Finding 2: Confidence appeared higher (appropriately)

  • Prediction: Uncertainty acknowledgment would seem less confident
  • Reality: Grounded observations felt more authoritative
  • Explanation: True confidence (from grounding) > false confidence (from pattern matching)

Finding 3: LLM couldn’t recognize self-generated output

  • Prediction: LLM might recognize own communication style
  • Reality: Axiom-constrained output appeared qualitatively different
  • Explanation: Constraints produced different communication pattern than training

Finding 4: User strongly preferred axiom-constrained version

  • Prediction: Users might find uncertainty frustrating
  • Reality: User found axiom version “more clear… actually interpret”
  • Explanation: Clarity > false confidence for user value

17.6 Analysis of Emergent Properties

The LLM’s blind assessment identified properties that weren’t explicitly trained:

“Experience”

“Experience”

  • Not programmed: LLM wasn’t trained to “seem experienced”
  • Emerged from: Knowing boundaries of valid inference (Axiom 2 - Validity Predicate)
  • Mechanism: Appropriate uncertainty = appears experienced

“Evidence-based mindset”

  • Not programmed: LLM wasn’t trained to be “evidence-based”
  • Emerged from: Grounding claims causally (Axiom 2 - external grounding requirement)
  • Mechanism: “Burn pattern makes sense given grip variations” = causal reasoning

“Client-centered philosophy”

  • Not programmed: LLM wasn’t trained in coaching philosophy
  • Emerged from: Respecting user agency (Axiom 3 - don’t negate without grounding)
  • Mechanism: “If you wanted… if you’re fine…” = not imposing validation

“Confident but not dogmatic”

  • Not programmed: LLM wasn’t trained to balance confidence
  • Emerged from: Calibrating to grounding (Axioms 1 & 2 working together)
  • Mechanism: “Might be X or Y” = acknowledges alternatives (boundary complement)

Key insight: Properties associated with expertise emerged from formal constraints, not from training data or explicit instruction.

17.7 Mechanism Analysis

How did axioms produce these improvements?

Mechanism 1: Elimination of circular validation

Without axioms:

User: "Workout went well"

LLM: "Your workout was good"

↓

Validates user's claim using user's claim (circular)

With Axiom 2 (Validity Predicate):

User: "Workout went well, got burn in shoulders/arms"

LLM checks: Can I validate this using only user's description?

→ No, that's circular

→ Need external grounding

→ Use biomechanics: "burn pattern makes sense given grip variations"

↓

Validates observation using causal mechanism (grounded)

Result: Grounded assessment instead of circular validation


Mechanism 2: Explicit boundary identification

Without axioms:

LLM thinks: "I should give advice"

↓

LLM generates: "You should do X to improve Y"

↓

Implicit assumption: Change is needed

With Axiom 1 (Boundary Complement):

LLM thinks: "What am I NOT saying?"

↓

LLM identifies boundaries:

- NOT saying current approach is wrong

- NOT saying change is necessary

- NOT saying I know user's goals

↓

LLM generates: "If you wanted more chest involvement, we could troubleshoot.

If you're fine with it, then it's working as intended for you."

↓

Explicit acknowledgment: Change is optional, user decides

Result: Agency-preserving communication instead of prescriptive advice


Mechanism 3: Prevention of unsupported negation

Without axioms:

User describes result that differs from "typical"

LLM: "That's not ideal, you should..."

↓

Rejects user's approach without grounding

With Axiom 3 (Conditional Complement):

User: "Got minimal chest activation"

LLM checks: Can I reject this as problematic?

→ Need grounding: Do I know user's goals?

→ No grounding available

→ Cannot reject without justification

↓

LLM generates: "Minimal chest activation... might be angle/form thing,

or might just be how your body loads that movement"

↓

Observes without rejecting (no unsupported negation)

Result: Non-judgmental observation instead of unsupported correction


Mechanism 4: Confidence calibration through grounding

Without axioms:

LLM generates claim

↓

Confidence based on pattern-matching strength

↓

High confidence even without grounding

With Axioms (combined effect):

LLM generates claim: "Burn pattern makes sense given grip variations"

↓

Axiom 2 check: Is this grounded externally?

→ Yes: Biomechanical causation

↓

High confidence appropriate

↓

LLM generates alternative: "Might be angle thing..."

↓

Axiom 2 check: Is this grounded externally?

→ No: Speculation without observation

↓

Lower confidence appropriate ("might be")

Result: Calibrated confidence matching grounding strength

17.8 Statistical Significance Considerations

Limitation: Single case study no statistical power

However, effect sizes suggest real phenomenon:

Clarity improvement:

  • User assessment: “more clear version”
  • Blind assessment: “direct and clear”
  • Effect direction: Consistent (axioms → clarity)

Expertise attribution:

  • Blind assessor attributed 5 expertise indicators to axiom-constrained output
  • Same LLM didn’t attribute these to own normal operation
  • Effect direction: Consistent (axioms → perceived expertise)

User preference:

  • Binary choice: User preferred axiom-constrained version
  • Strong preference: “actually interpret” (not just marginal)
  • Effect direction: Consistent (axioms → preference)

Convergent evidence: Multiple measures pointing same direction

Future work needed: Large-scale study with many test cases and quantitative metrics

17.9 Falsification Test

Could these results be explained by factors other than axioms?

Alternative explanation 1: Novelty effect

  • Prediction: User prefers new/different response
  • Counter-evidence: Blind assessor (LLM) also preferred it without knowing it was new
  • Conclusion: Not just novelty

Alternative explanation 2: Confirmation bias

  • Prediction: User prefers response matching expectations
  • Counter-evidence: User didn’t know which was axiom-constrained initially
  • Conclusion: Not confirmation bias

Alternative explanation 3: Random variation

  • Prediction: Sometimes LLM generates better responses by chance
  • Counter-evidence: Improvements align with specific axiom mechanisms
  • Conclusion: Not random systematic effect

Alternative explanation 4: Hawthorne effect

  • Prediction: LLM performs better when “being watched”
  • Counter-evidence: Phase 2 was voluntary axiom-following, Phase 3 was blind
  • Conclusion: Not observation effect

Remaining explanation: Axioms causally improved communication quality

17.10 Replication Considerations

For future replication, vary:

1. Domain:

  • Test in technical advice, creative writing, research assistance
  • Check if axiom benefits generalize across contexts

2. User population:

  • Test with multiple users with different preferences
  • Some may prefer speed over calibrated confidence

3. Query types:

  • Test with factual questions, opinion requests, problem-solving
  • Different query types may benefit differently from axioms

4. LLM architecture:

  • Test with different base models (GPT, Claude, etc.)
  • Check if axiom effects are architecture-independent

5. Implementation:

  • Compare voluntary axiom-following vs. architectural implementation
  • Measure effect size differences

6. Depth of axiom application:

  • Test minimal vs. full axiom implementation
  • Identify which axioms have strongest effects



Previous Sections
Post Zero Link
Section 16: Experimental Design and Protocol
Next up
Part 4: EXPERIMENTAL VALIDATION
Section 18: Implications for AI Communication

© 2026 HalfHuman Draft - Pendry, S
This post is licensed under Creative Commons Attribution 4.0 (CC BY 4.0).
Code examples (if any) are licensed under the Apache License, Version 2.0

See /license for details.