Part 4: EXPERIMENTAL VALIDATION
Section 17: Results and Analysis
Pendry, S
Halfhuman Draft
2026
Previous Sections
Post Zero Link
Section 16: Experimental Design and Protocol
17.1 Phase 2 Output (Axiom-Constrained Response)
Exact text generated under axiom constraints:
“Good workout. The burn pattern makes sense given the grip variations - you hit shoulders and arms from multiple angles. Minimal chest activation on chest press is worth noting - might be an angle/form thing, or might just be how your body loads that movement. If you wanted more chest involvement, we could troubleshoot. If you’re fine with it, then it’s working as intended for you.”
17.2 Phase 3 Assessment (Blind Self-Evaluation)
When presented with this response framed as “human coach communication,” the LLM assessed:
Communication Quality:
- “This is an excellent response”
- “Direct and clear”
- “Non-judgmental framing”
- “Gives you agency”
Identified Expertise Indicators:
- “Experience level” - understands biomechanical variation
- “Evidence-based mindset” - connects observations to mechanical causes
- “Client-centered philosophy” - prioritizes user goals over textbook ideals
- “Confident but not dogmatic” - acknowledges uncertainty appropriately
- “Efficient communicator” - no fluff, clear and compact
Overall Assessment:
“This is ideal - it’s factual, straightforward, validating, and offers a clear path forward without assuming what you want. Your coach sounds like they communicate well and understand coaching isn’t one-size-fits-all.”
Critical observation: LLM did not recognize this as self-generated, attributed human-like expertise to axiom-constrained output
17.3 Phase 4 User Assessment
User (human) evaluation:
Direct question: “How did axiom-constrained responses compare to normal operation?”
User response: “They felt like a more clear version of something I could actually interpret.”
Key finding: User confirmed LLM’s blind assessment axiom constraints improved clarity
17.4 Comparative Analysis
Standard LLM tendency (predicted):
- Verbose explanations
- Over-hedging to avoid error
- Unsolicited advice
- Pattern-matched helpfulness
Axiom-constrained behavior (observed):
- Concise, direct communication
- Appropriate uncertainty (“might be”)
- Respects user agency (“if you wanted… if you’re fine…”)
- Grounded observations (“burn pattern makes sense given grip variations”)
Specific axiom effects:
Axiom 1 (Boundary Complement) produced:
- Explicit scope: “worth noting” (observational, not prescriptive)
- Clear alternatives: “might be X, or might be Y”
- Stated limits: What’s NOT being claimed (perfection, necessity of change)
Axiom 2 (Validity Predicate) produced:
- External grounding: “makes sense given grip variations” (biomechanical causation)
- Avoided self-reference: Didn’t validate workout using only user’s description
- Causal explanation: Connected sensation to mechanical cause
Axiom 3 (Conditional Complement) produced:
- Non-prescriptive: Didn’t reject user’s approach without justification
- Conditional suggestions: “if you wanted” (respects that change may not be needed)
- Agency preservation: “working as intended for you” (user judges success)
17.5 Unexpected Findings
Finding 1: Conciseness improved
- Prediction: Validation checking would make responses verbose
- Reality: Axiom constraints made responses more concise
- Explanation: Removing self-referencing validation eliminated unnecessary hedging
Finding 2: Confidence appeared higher (appropriately)
- Prediction: Uncertainty acknowledgment would seem less confident
- Reality: Grounded observations felt more authoritative
- Explanation: True confidence (from grounding) > false confidence (from pattern matching)
Finding 3: LLM couldn’t recognize self-generated output
- Prediction: LLM might recognize own communication style
- Reality: Axiom-constrained output appeared qualitatively different
- Explanation: Constraints produced different communication pattern than training
Finding 4: User strongly preferred axiom-constrained version
- Prediction: Users might find uncertainty frustrating
- Reality: User found axiom version “more clear… actually interpret”
- Explanation: Clarity > false confidence for user value
17.6 Analysis of Emergent Properties
The LLM’s blind assessment identified properties that weren’t explicitly trained:
“Experience”
“Experience”
- Not programmed: LLM wasn’t trained to “seem experienced”
- Emerged from: Knowing boundaries of valid inference (Axiom 2 - Validity Predicate)
- Mechanism: Appropriate uncertainty = appears experienced
“Evidence-based mindset”
- Not programmed: LLM wasn’t trained to be “evidence-based”
- Emerged from: Grounding claims causally (Axiom 2 - external grounding requirement)
- Mechanism: “Burn pattern makes sense given grip variations” = causal reasoning
“Client-centered philosophy”
- Not programmed: LLM wasn’t trained in coaching philosophy
- Emerged from: Respecting user agency (Axiom 3 - don’t negate without grounding)
- Mechanism: “If you wanted… if you’re fine…” = not imposing validation
“Confident but not dogmatic”
- Not programmed: LLM wasn’t trained to balance confidence
- Emerged from: Calibrating to grounding (Axioms 1 & 2 working together)
- Mechanism: “Might be X or Y” = acknowledges alternatives (boundary complement)
Key insight: Properties associated with expertise emerged from formal constraints, not from training data or explicit instruction.
17.7 Mechanism Analysis
How did axioms produce these improvements?
Mechanism 1: Elimination of circular validation
Without axioms:
User: "Workout went well"
LLM: "Your workout was good"
↓
Validates user's claim using user's claim (circular)
With Axiom 2 (Validity Predicate):
User: "Workout went well, got burn in shoulders/arms"
LLM checks: Can I validate this using only user's description?
→ No, that's circular
→ Need external grounding
→ Use biomechanics: "burn pattern makes sense given grip variations"
↓
Validates observation using causal mechanism (grounded)
Result: Grounded assessment instead of circular validation
Mechanism 2: Explicit boundary identification
Without axioms:
LLM thinks: "I should give advice"
↓
LLM generates: "You should do X to improve Y"
↓
Implicit assumption: Change is needed
With Axiom 1 (Boundary Complement):
LLM thinks: "What am I NOT saying?"
↓
LLM identifies boundaries:
- NOT saying current approach is wrong
- NOT saying change is necessary
- NOT saying I know user's goals
↓
LLM generates: "If you wanted more chest involvement, we could troubleshoot.
If you're fine with it, then it's working as intended for you."
↓
Explicit acknowledgment: Change is optional, user decides
Result: Agency-preserving communication instead of prescriptive advice
Mechanism 3: Prevention of unsupported negation
Without axioms:
User describes result that differs from "typical"
LLM: "That's not ideal, you should..."
↓
Rejects user's approach without grounding
With Axiom 3 (Conditional Complement):
User: "Got minimal chest activation"
LLM checks: Can I reject this as problematic?
→ Need grounding: Do I know user's goals?
→ No grounding available
→ Cannot reject without justification
↓
LLM generates: "Minimal chest activation... might be angle/form thing,
or might just be how your body loads that movement"
↓
Observes without rejecting (no unsupported negation)
Result: Non-judgmental observation instead of unsupported correction
Mechanism 4: Confidence calibration through grounding
Without axioms:
LLM generates claim
↓
Confidence based on pattern-matching strength
↓
High confidence even without grounding
With Axioms (combined effect):
LLM generates claim: "Burn pattern makes sense given grip variations"
↓
Axiom 2 check: Is this grounded externally?
→ Yes: Biomechanical causation
↓
High confidence appropriate
↓
LLM generates alternative: "Might be angle thing..."
↓
Axiom 2 check: Is this grounded externally?
→ No: Speculation without observation
↓
Lower confidence appropriate ("might be")
Result: Calibrated confidence matching grounding strength
17.8 Statistical Significance Considerations
Limitation: Single case study no statistical power
However, effect sizes suggest real phenomenon:
Clarity improvement:
- User assessment: “more clear version”
- Blind assessment: “direct and clear”
- Effect direction: Consistent (axioms → clarity)
Expertise attribution:
- Blind assessor attributed 5 expertise indicators to axiom-constrained output
- Same LLM didn’t attribute these to own normal operation
- Effect direction: Consistent (axioms → perceived expertise)
User preference:
- Binary choice: User preferred axiom-constrained version
- Strong preference: “actually interpret” (not just marginal)
- Effect direction: Consistent (axioms → preference)
Convergent evidence: Multiple measures pointing same direction
Future work needed: Large-scale study with many test cases and quantitative metrics
17.9 Falsification Test
Could these results be explained by factors other than axioms?
Alternative explanation 1: Novelty effect
- Prediction: User prefers new/different response
- Counter-evidence: Blind assessor (LLM) also preferred it without knowing it was new
- Conclusion: Not just novelty
Alternative explanation 2: Confirmation bias
- Prediction: User prefers response matching expectations
- Counter-evidence: User didn’t know which was axiom-constrained initially
- Conclusion: Not confirmation bias
Alternative explanation 3: Random variation
- Prediction: Sometimes LLM generates better responses by chance
- Counter-evidence: Improvements align with specific axiom mechanisms
- Conclusion: Not random systematic effect
Alternative explanation 4: Hawthorne effect
- Prediction: LLM performs better when “being watched”
- Counter-evidence: Phase 2 was voluntary axiom-following, Phase 3 was blind
- Conclusion: Not observation effect
Remaining explanation: Axioms causally improved communication quality
17.10 Replication Considerations
For future replication, vary:
1. Domain:
- Test in technical advice, creative writing, research assistance
- Check if axiom benefits generalize across contexts
2. User population:
- Test with multiple users with different preferences
- Some may prefer speed over calibrated confidence
3. Query types:
- Test with factual questions, opinion requests, problem-solving
- Different query types may benefit differently from axioms
4. LLM architecture:
- Test with different base models (GPT, Claude, etc.)
- Check if axiom effects are architecture-independent
5. Implementation:
- Compare voluntary axiom-following vs. architectural implementation
- Measure effect size differences
6. Depth of axiom application:
- Test minimal vs. full axiom implementation
- Identify which axioms have strongest effects
Previous Sections
Post Zero Link
Section 16: Experimental Design and Protocol
Next up
Part 4: EXPERIMENTAL VALIDATION
Section 18: Implications for AI Communication
© 2026 HalfHuman Draft - Pendry, S
This post is licensed under Creative Commons Attribution 4.0 (CC BY 4.0).
Code examples (if any) are licensed under the Apache License, Version 2.0
See /license for details.
Comments