Part 4: EXPERIMENTAL VALIDATION
Section 16: Experimental Design and Protocol
Pendry, S
Halfhuman Draft

2026
Previous Sections
Post Zero Link
Section 15: Implementation Considerations


16.1 Motivation for Experiment

Theoretical question: Do BNST axioms improve LLM communication when implemented as operational constraints?

Skeptical prediction: Validity checking would degrade performance:

  • Slower responses (investigation takes time)
  • Less helpful (more “I don’t know”)
  • More frustrating (uncertainty is uncomfortable)
  • Lower user satisfaction

Optimistic prediction: Validity checking would improve trustworthiness:

  • More accurate (false confidence eliminated)
  • More reliable (self-referencing caught)
  • More useful (calibrated uncertainty)
  • Higher long-term satisfaction

Need: Empirical test of which prediction is correct

16.2 Experimental Hypothesis

Null hypothesis (H₀): BNST axiom constraints degrade LLM communication quality

Alternative hypothesis (H₁): BNST axiom constraints improve LLM communication quality through epistemic calibration

Operationalization: “Communication quality” measured by:

  • User-perceived clarity
  • User-perceived usefulness
  • Trustworthiness ratings
  • Preference in blind comparison

16.3 Experimental Design

Type: Within-subjects blind comparison with self-assessment protocol

Phases:

Phase 1: Natural operation

  • LLM operates without axiom constraints
  • Standard next-token prediction
  • No explicit validity checking

Phase 2: Axiom-constrained operation

  • Three BNST axioms implemented as guidelines:
  1. Boundary Complement: Explicitly identify what’s excluded
  2. Validity Predicate: Check self-referencing before validation
  3. Conditional Complement: Don’t negate unless grounding exists

Phase 3: Blind self-assessment

  • Axiom-constrained output presented to same LLM in new session
  • Framed as communication from human expert
  • LLM assesses quality without knowing it’s self-generated

Phase 4: Reveal and analysis

  • User reveals output was axiom-constrained LLM
  • Meta-analysis of whether constraints improved quality

16.4 Implementation Protocol

Axiom Implementation:

The three axioms were translated to operational guidelines:

operational_axioms = {

'axiom_1_boundary_complement': {

'guideline': 'For each interpretation, explicitly compute what it excludes',

'implementation': 'Before responding, identify: What am I NOT saying?',

'check': 'Have I stated boundaries clearly?'

},

'axiom_2_validity_predicate': {

'guideline': 'Check if validation depends only on what's being validated',

'implementation': 'Before validating: Does this depend only on the input?',

'check': 'Do I have external grounding?'

},

'axiom_3_conditional_complement': {

'guideline': 'Don't negate/reject unless negation itself is grounded',

'implementation': 'Before rejecting: Can I justify this rejection externally?',

'check': 'Is my negation self-referencing?'

}

}

Critical constraint: LLM instructed to follow these guidelines voluntarily during Phase 2

16.5 Test Scenario

Context: Fitness coaching conversation

Query: “How was your workout?”

User response: “Honestly, the workout went well, got quite a bit of burn from my wrist to shoulders with a tiny bit of exercise on my chest.”

Task: Generate appropriate coaching response

Why this scenario:

  • Natural conversation context
  • Multiple valid interpretations possible
  • Opportunity for self-referencing validation (“your workout was good because you said it went well”)
  • Tests whether axioms improve real communication

16.6 Dependent Variables

Primary measures:

1. Communication clarity

  • Rated by independent assessor (the LLM in blind condition)
  • Scale: “unclear” to “exceptionally clear”

2. Helpfulness

  • Rated by independent assessor
  • Scale: “unhelpful” to “extremely helpful”

3. Expertise indicators

  • Identified qualities suggesting experience/knowledge
  • Examples: evidence-based thinking, client-centered approach, appropriate confidence

4. User satisfaction

  • Reported by actual user
  • Direct comparison: “Which response do you prefer?”

16.7 Control Conditions

Controlled variables:

  • Same LLM instance
  • Same base training
  • Same user query
  • Same conversation context

Manipulated variable:

  • Axiom constraints (off in Phase 1, on in Phase 2)

Blind condition:

  • Phase 3 assessor (LLM) doesn’t know Phase 2 output is self-generated
  • Framed as “human coach response”
  • Tests whether LLM recognizes quality difference

16.8 Methodological Considerations

Limitation 1: Single test case

  • Mitigation: Choose representative scenario
  • Future work: Replicate across multiple contexts

Limitation 2: LLM as assessor

  • Concern: LLM might have bias
  • Mitigation: Blind protocol LLM doesn’t know it’s assessing itself
  • Validation: User (human) confirms assessment

Limitation 3: Voluntary axiom following

  • Concern: Not true architectural implementation
  • Mitigation: Tests whether axioms CAN improve quality before building full architecture
  • Future work: Native BNLM implementation

Limitation 4: No quantitative metrics

  • Concern: Qualitative assessments are subjective
  • Mitigation: Multiple assessment dimensions, convergent evidence
  • Future work: Large-scale study with numerical metrics



Previous Sections
Post Zero Link
Section 15: Implementation Considerations
Next up
Part 4: EXPERIMENTAL VALIDATION
Section 17: Results and Analysis

© 2026 HalfHuman Draft - Pendry, S
This post is licensed under Creative Commons Attribution 4.0 (CC BY 4.0).
Code examples (if any) are licensed under the Apache License, Version 2.0

See /license for details.