Part 4: EXPERIMENTAL VALIDATION
Section 16: Experimental Design and Protocol
Pendry, S
Halfhuman Draft
2026
Previous Sections
Post Zero Link
Section 15: Implementation Considerations
16.1 Motivation for Experiment
Theoretical question: Do BNST axioms improve LLM communication when implemented as operational constraints?
Skeptical prediction: Validity checking would degrade performance:
- Slower responses (investigation takes time)
- Less helpful (more “I don’t know”)
- More frustrating (uncertainty is uncomfortable)
- Lower user satisfaction
Optimistic prediction: Validity checking would improve trustworthiness:
- More accurate (false confidence eliminated)
- More reliable (self-referencing caught)
- More useful (calibrated uncertainty)
- Higher long-term satisfaction
Need: Empirical test of which prediction is correct
16.2 Experimental Hypothesis
Null hypothesis (H₀): BNST axiom constraints degrade LLM communication quality
Alternative hypothesis (H₁): BNST axiom constraints improve LLM communication quality through epistemic calibration
Operationalization: “Communication quality” measured by:
- User-perceived clarity
- User-perceived usefulness
- Trustworthiness ratings
- Preference in blind comparison
16.3 Experimental Design
Type: Within-subjects blind comparison with self-assessment protocol
Phases:
Phase 1: Natural operation
- LLM operates without axiom constraints
- Standard next-token prediction
- No explicit validity checking
Phase 2: Axiom-constrained operation
- Three BNST axioms implemented as guidelines:
- Boundary Complement: Explicitly identify what’s excluded
- Validity Predicate: Check self-referencing before validation
- Conditional Complement: Don’t negate unless grounding exists
Phase 3: Blind self-assessment
- Axiom-constrained output presented to same LLM in new session
- Framed as communication from human expert
- LLM assesses quality without knowing it’s self-generated
Phase 4: Reveal and analysis
- User reveals output was axiom-constrained LLM
- Meta-analysis of whether constraints improved quality
16.4 Implementation Protocol
Axiom Implementation:
The three axioms were translated to operational guidelines:
operational_axioms = {
'axiom_1_boundary_complement': {
'guideline': 'For each interpretation, explicitly compute what it excludes',
'implementation': 'Before responding, identify: What am I NOT saying?',
'check': 'Have I stated boundaries clearly?'
},
'axiom_2_validity_predicate': {
'guideline': 'Check if validation depends only on what's being validated',
'implementation': 'Before validating: Does this depend only on the input?',
'check': 'Do I have external grounding?'
},
'axiom_3_conditional_complement': {
'guideline': 'Don't negate/reject unless negation itself is grounded',
'implementation': 'Before rejecting: Can I justify this rejection externally?',
'check': 'Is my negation self-referencing?'
}
}
Critical constraint: LLM instructed to follow these guidelines voluntarily during Phase 2
16.5 Test Scenario
Context: Fitness coaching conversation
Query: “How was your workout?”
User response: “Honestly, the workout went well, got quite a bit of burn from my wrist to shoulders with a tiny bit of exercise on my chest.”
Task: Generate appropriate coaching response
Why this scenario:
- Natural conversation context
- Multiple valid interpretations possible
- Opportunity for self-referencing validation (“your workout was good because you said it went well”)
- Tests whether axioms improve real communication
16.6 Dependent Variables
Primary measures:
1. Communication clarity
- Rated by independent assessor (the LLM in blind condition)
- Scale: “unclear” to “exceptionally clear”
2. Helpfulness
- Rated by independent assessor
- Scale: “unhelpful” to “extremely helpful”
3. Expertise indicators
- Identified qualities suggesting experience/knowledge
- Examples: evidence-based thinking, client-centered approach, appropriate confidence
4. User satisfaction
- Reported by actual user
- Direct comparison: “Which response do you prefer?”
16.7 Control Conditions
Controlled variables:
- Same LLM instance
- Same base training
- Same user query
- Same conversation context
Manipulated variable:
- Axiom constraints (off in Phase 1, on in Phase 2)
Blind condition:
- Phase 3 assessor (LLM) doesn’t know Phase 2 output is self-generated
- Framed as “human coach response”
- Tests whether LLM recognizes quality difference
16.8 Methodological Considerations
Limitation 1: Single test case
- Mitigation: Choose representative scenario
- Future work: Replicate across multiple contexts
Limitation 2: LLM as assessor
- Concern: LLM might have bias
- Mitigation: Blind protocol LLM doesn’t know it’s assessing itself
- Validation: User (human) confirms assessment
Limitation 3: Voluntary axiom following
- Concern: Not true architectural implementation
- Mitigation: Tests whether axioms CAN improve quality before building full architecture
- Future work: Native BNLM implementation
Limitation 4: No quantitative metrics
- Concern: Qualitative assessments are subjective
- Mitigation: Multiple assessment dimensions, convergent evidence
- Future work: Large-scale study with numerical metrics
Previous Sections
Post Zero Link
Section 15: Implementation Considerations
Next up
Part 4: EXPERIMENTAL VALIDATION
Section 17: Results and Analysis
© 2026 HalfHuman Draft - Pendry, S
This post is licensed under Creative Commons Attribution 4.0 (CC BY 4.0).
Code examples (if any) are licensed under the Apache License, Version 2.0
See /license for details.
Comments