Primitives Results Methodology Implications

Seven Primitives of Distributed Agent Systems: A Gender Analysis

We define seven atomic principles for distributed agent coordination, then measure gendered engagement with each principle using simulated focus groups under two methodologies to quantify systematic biases and derive design implications for multi-agent systems.

Multi-Agent Systems Gender Research Methodology
February 25, 2026
16
Independent
Agents
635K
Total
Tokens
2
Distinct
Methodologies

The Seven Primitives

Each primitive is defined as an atomic, composable unit of distributed coordination:

Primitive Definition Failure Mode
PrioritizationRanking by severity x tractabilityRandom action selection
Hypothesis TestingFalsifiable predictions with Set A/B comparisonChange without measurement
Attention1:1 ratio of possible actions to intended actionsThrashing across objectives
DeduplicationEliminating redundant information at the earliest stageEffort multiplication, signal dilution
Message PassingStructured exchange: corroborate, contradict, fill-gapIndependent rediscovery
ConsensusConfidence from corroboration graph topology, not votingAuthority-based decisions
Progressive DiscoveryEach action's residue reducing cost of next actionConstant cold starts

These map onto a loop: discover (progressive discovery) → detect (deduplication, consensus) → decide (prioritization, attention) → act (hypothesis testing) → communicate (message passing). The loop feeds back: message passing outputs become inputs for the next discovery cycle.

Methodology

V1: Breadth-First (2 agents)

Two LLM agents (Claude Sonnet), each assigned a gender identity and six diverse professional personas, discussed all seven principles in sequence. Each agent produced per-principle scores on three dimensions: natural aptitude (1-10), effectiveness under pressure (1-10), and same-gender amplification (1-10, 5=neutral). Total: ~89,500 tokens, ~410s.

V2: Depth-First Map-Reduce (14 agents)

14 LLM agents (Claude Sonnet), each assigned one principle and one gender, discussed their single principle in depth for 10-15 exchanges. Each returned structured JSON with the same three-dimension scoring plus strengths, weaknesses, key quote, dissent, and same-gender effect analysis. Total: ~546,000 tokens, ~60s per agent (parallel).

Reconciliation

V1 and V2 scores were averaged to produce reconciled estimates. The averaging removes V1's narrative anchoring bias and V2's score inflation bias.

Reconciled Scores

Principle F-Apt F-Press M-Apt M-Press Delta Apt Delta Press
Prioritization77790+2 M
Hypothesis Testing7566+1 F+1 M
Attention65690+4 M
Deduplication7557+2 F+2 M
Message Passing8679+1 F+3 M
Consensus8556+3 F+1 M
Progressive Discovery8676+1 F0

Score Distribution Analysis

Female scores cluster: aptitude std dev = 0.7, pressure std dev = 0.7. Male scores are bimodal: aptitude std dev = 0.8, pressure std dev = 1.5.

The male pressure distribution is the most important structural finding. Four principles cluster at 9 (prioritization, attention, message passing, deduplication) while three cluster at 6 (hypothesis testing, consensus, progressive discovery). This bimodality maps cleanly onto an execution/reflection split: men spike on execution primitives under pressure and maintain baseline on reflection primitives.

Female pressure scores show no such bimodality. Performance under pressure is uniformly 5-7, suggesting a more context-invariant engagement pattern.

Methodological Findings: V1 vs V2

Bias V1 (Breadth) V2 (Depth)
Narrative anchoringStrong. A compelling failure mode gets applied uniformly across principles.Weak. Each agent develops its own narrative.
Context sensitivityLow. Absolute claims ("men are bad at X").High. Conditional claims ("men are bad at X in context Y").
Score inflationLower. Single agent maintains internal calibration.Higher. Independent agents trend toward finding nuance that pushes scores up.
Cross-principle coherenceStrong. Single agent sees themes across all seven.Weak. Cross-principle synthesis requires explicit reduce step.

Largest V1-V2 Divergences

Male deduplication: V1 scored 3/4 (apt/press). V2 scored 7/9. Delta: +4/+5. V1 anchored on the "ego tax" narrative and applied it uniformly. V2 discovered that men have strong operational dedup instincts (ER delta-only handoffs, military "once up once down") that only fail in low-stakes social contexts. Depth revealed the context-dependency that breadth missed.

Female same-gender amplification: V1 mean 3.4. V2 mean 5.4. Delta: +2.0. V1's single agent overapplied the "relational override" failure mode. V2's per-principle agents found that relational dynamics also amplify corroboration, accelerate information pooling, and lower ego-attachment to predictions.

IMPLICATION FOR LLM-AS-EVALUATOR

LLM scoring is sensitive to context window allocation. A single agent scoring seven constructs produces more internally consistent but less nuanced results than seven independent agents scoring one construct each. Neither is strictly superior. The reconciled average removes the worst biases of both.

This has direct implications for bench scoring in multi-dimensional evaluation: per-dimension specialist scorers may produce different rankings than holistic scorers, and the difference is systematic, not random.

Design Implications for Multi-Agent Systems

1. Phase-specific agent configuration

Discovery/detection agents (progressive discovery, deduplication, consensus): optimize for information sharing, failure surfacing, and source independence verification. These are the primitives where the female behavioral pattern outperforms.

Execution/communication agents (prioritization, attention, message passing): optimize for protocol adherence, single-objective focus, and structured data exchange. These are the primitives where the male behavioral pattern outperforms under pressure.

In synthetic agents, this translates to prompt engineering choices: discovery agents should share intermediate findings and flag uncertainty. Execution agents should follow the protocol, ignore tangents, and transmit structured payloads.

2. Contradiction channel engineering

The most asymmetric finding: the contradiction operation in message passing is systematically attenuated by both genders, through different mechanisms. Female agents soften contradiction into ambiguous language. Male agents suppress gap-fill messages across domain boundaries.

For synthetic agents: make message types explicit in the schema. A message tagged type: contradiction cannot be misread regardless of how the content is phrased.

3. Claim-claimant separation

Both genders conflate the claim graph (what is believed and why) with the social graph (who said it and what their status is). Women read contradiction as relational rupture. Men read correction as status challenge. Both errors collapse when the system architecture forces separation.

For synthetic agents: implement "claim cards" (written, attributed claims submitted before group discussion). Require agents to evaluate claims without access to claimant identity during the consensus phase.

4. Structure as the universal intervention

12 of 14 V2 agents independently recommended formalized protocols as the primary intervention. For collaborative/relational agents, structure provides permission to execute difficult operations. For competitive/hierarchical agents, structure provides constraint that prevents social dynamics from corrupting evaluation.

Agent coordination protocols should be explicit, schema-enforced, and non-negotiable. Letting agents develop their own coordination norms through emergence will reproduce the same social-layer failures the human focus groups exhibited.

Five Highest-Confidence Claims

These findings survived both methodologies with minimal score movement:

  1. Female agents outscore on consensus aptitude (V1: +4, V2: +2, reconciled: +3). Mechanism: faster corroboration network formation, lower information hoarding, active source independence verification.
  2. Male agents outscore on attention under pressure (V1: +6, V2: +3, reconciled: +4). Mechanism: environmental trigger activation produces exceptional single-objective focus when stakes are physical and immediate.
  3. Message passing is asymmetric: female aptitude > male aptitude, male pressure > female pressure. Mechanism: female agents have broader informal networks and stronger fill-gap instincts; male agents have higher protocol fidelity under load.
  4. Both genders fail at progressive discovery externalization. Mechanism differs (relational storage vs. status hoarding) but outcome is identical: knowledge compounds individually and dissipates organizationally.
  5. Structure is the universal equalizer. Explicit protocol narrows gender performance gaps on every primitive. Mechanism differs (permission vs. constraint) but effectiveness is consistent.

Limitations

1. Simulated, not empirical. These are LLM-generated focus groups, not human subjects research. The findings reflect the model's training data on gender dynamics in professional settings, not direct observation.

2. Cultural specificity. The professional personas are drawn from Western (primarily American) professional contexts. The gender dynamics described may not generalize across cultures.

3. Binary gender framing. The experiment used a male/female binary. Non-binary and gender-diverse dynamics are not captured.

4. Scorer sensitivity. V1/V2 divergence demonstrates that LLM-generated scores are method-sensitive. The reconciled scores are more trustworthy than either alone, but should be treated as directional estimates, not precise measurements.

5. Persona anchoring. The same six professions were used across all simulations. Different profession sets might produce different scores, particularly for principles that are highly domain-dependent.

Three-Part Series

Part 1: What Men and Women Are Actually Good At

Part 2: Why Same-Gender Teams Underperform

Part 3: Technical Methodology (you are here)

Reproducibility

All agent transcripts, structured JSON outputs, and reconciliation methodology are documented. Agent IDs are recorded for potential session resumption and audit. Research conducted by Voxos.ai using Claude as both simulation substrate and evaluation framework. 16 independent agents, ~635,000 total tokens, two distinct methodologies.