Feb 5, 2026 • 6 mins read
How Consistent Are LLM Recommendations? We Tested 4 Models Across 12,000 Runs
We asked the same three prompts to four LLMs — 12,000 times in total.What we found: rankings aren’t fixed, models behave very differently, and consistency depends on more than you’d expect.

Usman Akram
|

Akilan Arumugam
In this Research
Study Parameters
Parameter
Prompts
Models Tested
Runs Per Model Per Prompt
Total Runs
Tools Per Response
Total Data Points
Search Volume Source
Analytical Modules
Cross-Prompt Mgmt
"What are the best email marketing software tools?" / "What are the best video editing software tools?" / "What is the best project management software tool?"
OpenAI (ChatGPT), Google Gemini, Anthropic Claude, Perplexity
1,000
12,000
10 (ranked list)
120,000 individual tool recommendations
Ahrefs (US market, monthly)
6 (Jaccard Similarity, Brand Rank Volatility, Unique List Count, Position Volatility, Cross-Model Agreement, Search Volume Correlation)
Executive Summary
This report analyses how consistently four major LLMs — OpenAI (ChatGPT), Google Gemini, Anthropic Claude, and Perplexity — recommend software tools when asked the same query repeatedly.
The goal was simple: understand how deterministic these models actually are, and what that means for brands trying to optimise their visibility inside AI-generated results.
We set out to answer a few core questions:
How stable is each model across repeated runs?
How much overlap exists between outputs?
How much do rankings fluctuate from one run to another?
To what extent do models agree or disagree with each other?
Do AI-generated rankings correlate with real-world brand demand?
To generate statistically meaningful insights, we tested three commercial prompts:
What are the best email marketing software tools?
What are the best video editing software tools?
What are the best project management software tools?
Each prompt was run 1,000 times across all four models.
This resulted in 12,000 total responses, each containing ranked recommendation lists, which were then analysed using metrics like Jaccard similarity, rank volatility, cross-model agreement, and correlation with brand search demand.
The findings provide a clearer view into how consistent — or inconsistent — LLM outputs really are, and what that means for Generative Engine Optimization (AEO) strategies.
Model

Open AI

Gemini

Claude

Perplexity
Email Marketing
0.871
0.577
0.682
0.692
Video Edit
0.665
0.556
0.771
0.546
Project Mgmt
0.676
0.727
0.918
0.649
Cross-Prompt Mgmt
0.737
0.620
0.790
0.629
Green means highest in category and orange means highest volatility in category
Key Takeaways at a Glance
Dimension
Most Stable LLM
Most Volatile LLM
Most Deterministic LLM
Universal #1 Brand
Biggest Brand Perception Gap
SV ↔ Rank Correlation
Most Similar Models
Most Divergent Pair
Finding
Claude (Jaccard 0.918 in PM) — stability increases with category maturity across all three prompts
Gemini — lowest or near-lowest Jaccard in 2 of 3 prompts; 99.6% unique ordered lists in email
Claude (PM) — only 41 unique ordered lists from 4 unique brand sets; top list appeared 211 times (21.1% of runs)
Asana — only brand with 100% frequency across all 4 models in any category (PM)
Jira — 100% on Perplexity, 99.6% on Gemini, 94.8% on OpenAI, but 0% on Claude across all 1,000 PM runs
Category-dependent — no correlation in email on any model; significant across all 4 models in PM only
OpenAI × Gemini (Jaccard 0.660) in PM — highest cross-model agreement in the entire study
Perplexity consistently in the bottom two pairs across every prompt and every category
How to Read This Report
Each of the 6 modules below presents data from all three software categories side by side, followed by a unified cross-prompt insight.
Module
Jaccard Similarity
Brand Rank Volatility
Unique List Count
Position Volatility
Cross-Model Agreement
Search Volume Correlation
What It Answers
How consistently does each model recommend the same set of tools across 1,000 runs?
Which brands are locked into specific positions, and which bounce unpredictably?
How deterministic is each model - how many genuinely different responses did it produce?
How many different brands ever occupied each rank slot (1–10) across all runs?
How similar are the four LLMs to each other, and which pairs diverge most?
Does brand search popularity predict better LLM ranking in each category?
Methodology
This study measured output consistency under controlled conditions.
Each prompt was run 1,000 times per model using identical inputs, with no variation in wording or context. Responses were parsed into ranked lists of 10 tools, preserving position for analysis.
Brand names were standardised to remove spelling and naming inconsistencies (e.g. legacy names or variations).
The cleaned datasets were then analysed to track:
Stability across repeated runs
Ranking changes over time
Overlap between results
Agreement across models
Analysis was done at both the list level (which tools appear) and ranking level (where they appear).
Prompts Analysed
What are the best email marketing software tools?
What are the best video editing software tools?
What are the best project management software tools?
Each prompt was analysed independently before comparing patterns across categories.
MODULE 1 - List Stability (Jaccard Similarity)
Jaccard similarity measures how consistent each model’s recommendations are across runs. A score of 1.0 means identical lists every time. Lower scores indicate more variation in which tools appear. We computed this across ~499,500 pairwise comparisons per model.
Metric
Mean Jaccard
Mean Jaccard
Mean Jaccard
Standard Deviation
Standard Deviation
Standard Deviation
Worst-Case Min
Worst-Case Min
Worst-Case Min
Prompt
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Open AI
0.871
0.665
0.676
0.109
0.136
0.076
0.538
0.333
0.538
Gemini
0.577
0.556
0.727
0.104
0.093
0.097
0.333
0.250
0.429
Claude
0.682
0.771
0.918
0.112
0.099
0.091
0.429
0.538
0.818
Perplexity
0.692
0.546
0.649
0.141
0.153
0.179
0.053
0.250
0.250
Green means highest in category and orange means highest volatility in category
Email Marketing - Jaccard Distribution by Model

Video Editing - Jaccard Distribution by Model

Project Management - Jaccard Distribution by Model

Key Findings
OpenAI is the most stable — but only in email (0.871)
It leads overall, driven by strong consistency in email marketing. Outside of that, its stability drops closer to the rest.Claude’s stability is category-dependent — not intrinsic
It improves from 0.682 (email) → 0.771 (video) → 0.918 (project management). The more established and well-documented the category, the more “crystallised” its outputs become.Gemini is weakest in email and video, but improves in mature categories
It lags in email (0.577) and video (0.556), but jumps to 0.727 in project management, suggesting it responds better when the category is more standardised.Perplexity is the most volatile model overall
Despite a decent mean (0.692), it has the highest standard deviation (up to 0.153) and the worst single outlier (min 0.053) — meaning some runs are highly inconsistent.Worst-case reliability separates Claude from Perplexity
Claude maintains relatively strong minimum overlap (0.429–0.818 depending on category), while Perplexity drops as low as 0.053, making it far less predictable.
MODULE 2 - Brand Rank Volatility
This module looks at how stable each brand’s position is across runs.
For every brand, we calculated:
Mean Rank
Standard deviation (σ)
Minimum and maximum position
Appearance rate across 1,000 runs
A low σ means the brand holds a consistent position.
A high σ means its ranking shifts significantly between runs.
Email Marketing Software - Top Tools by Model: Mean Rank & Appearance Rate
OpenAI — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Gemini — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Claude — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Perplexity — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
Email Marketing - Key Brand Signals
ActiveCampaign is the only universally dominant brand
It appears in 100% of runs across all models and ranks in the top 2 on three of them.Claude fully locks its top 3 — zero variation
Customer.io (#1), HubSpot (#2), and ActiveCampaign (#3) all have σ = 0.00, meaning their positions never change.HubSpot shows extreme model divergence
It is fixed at #2 on Claude, but drops to 53.8% appearance and ~rank 7 on Perplexity, showing large disagreement across models.Perplexity promotes a different toolset entirely
Tools like Brevo (99.9%), MailerLite (99.6%), and Moosend (95.9%) are heavily favoured here but far less visible on other models.High visibility ≠ strong positioning
MailerLite appears frequently on OpenAI (88.6%) but has σ = 1.92, moving anywhere from rank 2 to 10.
Email Marketing - Brand Rank Volatility: Open AI

Email Marketing - Brand Rank Volatility: Gemini

Email Marketing - Brand Rank Volatility: Claude

Email Marketing - Brand Rank Volatility: Perplexity

Video Editing - Top Tools by Model: Mean Rank & Appearance Rate
OpenAI — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Gemini — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Claude — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Perplexity — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Video Editing - Key Brand Signals
Descript dominates — but not universally
It ranks #1 on OpenAI (1.36), Gemini (1.19), and Claude (1.00), but appears only ~1.5% on Perplexity, showing sharp model disagreement.Claude again shows “rank locking” behavior
Descript holds rank #1 with σ = 0.00 across all runs — the only tool in this category with zero variation.Perplexity favours traditional desktop tools
Its top results are Adobe Premiere Pro, DaVinci Resolve, and Final Cut Pro, unlike other models which lean more toward newer or SaaS tools.No shared consensus across models
No two models agree on the same top 3 — making video editing the most contested category in the study.Visibility without authority is common
Kapwing appears in 86% of Claude runs but has σ = 3.31, swinging widely in rank — high presence, low positional trust.
Video Editing - Brand Rank Volatility: Open AI

Video Editing - Brand Rank Volatility: Gemini

Video Editing - Brand Rank Volatility: Claude

Video Editing - Brand Rank Volatility: Perplexity

Project Management - Top Tools by Model: Mean Rank & Appearance Rate
OpenAI — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Gemini — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Claude — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Perplexity — Top 10 Tools
Metric
Mean Jaccard
Minimum
Maximum
Standard Deviation
Open AI
4
5
5
2
Gemini
4
5
5
2
Claude
4
5
5
2
Perplexity
4
5
5
2
Project Management- Key Brand Signals
Asana is the only truly universal brand in the entire study
It appears in 100% of runs across all four models, with consistently top rankings.Claude shows extreme rank determinism
Asana is locked at #1 and Shortcut at #6, both with σ = 0.00, meaning completely fixed positions.Jira disappears entirely on Claude
Despite near-universal presence on other models (100% Perplexity, 99.6% Gemini, 94.8% OpenAI), it has 0% appearance on Claude.Perplexity surfaces a different ecosystem
Tools like Paymo (62%), Scoro (25%), and Celoxis (20%) appear here but are largely invisible elsewhere.Wrike is heavily model-dependent
It ranks in Perplexity’s top 3 (100%, mean rank 4.12) but barely appears on other models.
Project Management - Brand Rank Volatility: Open AI

Project Management - Brand Rank Volatility: Gemini

Project Management - Brand Rank Volatility: Claude

Project Management - Brand Rank Volatility: Perplexity

Key Findings
Visibility without authority” is the dominant pattern across all categories
Brands appearing in 85–99% of runs with high σ (>1.5) (e.g. MailerLite, Kapwing, Monday.com) are consistently shown but rarely in strong, convincing positions.Claude is the most deterministic at the ranking level
Its σ = 0.00 rank locks (e.g. fixed positions like #1 or #6) are unique — certain rankings do not change across 1,000 runs.Perplexity operates on a fundamentally different data layer
It consistently surfaces a different set of tools in every category, diverging from the other three models.Perplexity’s outputs are driven by review ecosystems, not SaaS-native signals
Tools that rank well on G2 and Capterra are heavily favoured, while brands absent from these platforms are effectively invisible.
MODULE 3 - Unique List Count & Response Determinism
This module measures how often models return the same vs different results.
We track:
Unique ordered lists (exact ranking matters)
Unique sets (only which brands appear)
The gap between the two shows whether a model is:
Reordering the same brands
Introducing entirely new ones
Higher diversity = more randomness.
Lower diversity = more deterministic outputs.
Model

Open AI

Open AI

Open AI

Gemini

Gemini

Gemini

Claude

Claude

Claude

Perplexity

Perplexity

Perplexity
Category
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Total Runs
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
Unique Ordered
738
988
768
996
984
896
186
529
41
565
565
435
Unique Sets
109
460
35
564
574
113
56
69
4
215
215
98
Diversity
73.8%
98.8%
76.8%
99.6%
98.4%
89.6%
18.6%
52.9%
4.1%
56.5%
56.5%
43.5%
Top List %
3.7%
0.3%
0.7%
0.2%
0.3%
0.8%
11.4%
4.2%
21.1%
2.9%
2.9%
4.2%
Rating
Low
Very Low
Low
Very Low
Very Low
Very Low
Low
Low
Very High
Low
Low
Moderate
Chart 1 - Diversity Score (%) by Model & Category
Higher % = more randomness (more unique ordered lists across 1,000 runs).
Lower % = more deterministic outputs (fewer distinct lists).
Example: Claude PM at 4.1% is the most deterministic result in the entire study.
Model

Open AI

Open AI

Open AI

Gemini

Gemini

Gemini

Claude

Claude

Claude

Perplexity

Perplexity

Perplexity
Category
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Diversity Score (Unique Ordered Lists / Total Runs)
73.8%
98.8%
73.8%
99.6%
98.4%
18.6%
52.9%
4.1%
56.5%
56.5%
43.5%
Raw Data - Diversity Score
Metric

Open AI

Gemini

Claude

Perplexity
Email marketing
73.8%
99.6%
18.6%
56.5%
Video Editing
98.8%
98.4%
52.9%
56.5%
Project Management
76.8%
89.6%
4.1%
43.5%
Key Signals
Gemini is consistently the most random model
It produces near-unique outputs every time (99.6%, 98.4%, 89.6%), showing almost no repeatability across runs.Claude becomes more deterministic as category maturity increases
It moves from 18.6% (email) → 52.9% (video) → 4.1% (PM), with project management being the most locked-in output in the entire study.OpenAI shows moderate randomness across categories
High variation in video (98.8%) but more consistency in email and PM (~73–76%), suggesting partial stabilization.Perplexity sits between randomness and consistency
It shows moderate diversity (43–56%) across all categories, without extreme behavior in either direction.
Chart 2 - Top List % by Model & Category
Higher % = more deterministic (the same exact list appears more frequently).
Lower % = more variation (top list rarely repeats).
Claude PM at 21.1% means that the exact list appeared 211 times out of 1,000 runs.
Model

Open AI

Open AI

Open AI

Gemini

Gemini

Gemini

Claude

Claude

Claude

Perplexity

Perplexity

Perplexity
Category
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Top List % (Most Common Single List Frequency)
3.7%
0.3%
0.7%
0.2%
0.3%
0.8%
11.4%
4.2%
21.1%
2.9%
2.9%
4.2%
Raw Data - Top List %
Metric

Open AI

Gemini

Claude

Perplexity
Email marketing
3.7%
0.2%
11.4%
2.9%
Video Editing
0.3%
0.3%
4.2%
2.9%
Project Management
0.7%
0.8%
21.1%
4.2%
Key Signals
Claude is the most deterministic at the list level
Its top list appears 11.4% (email), 4.2% (video), and 21.1% (PM) — with PM being the most locked-in single output in the entire study.Gemini almost never repeats the same list
With 0.2% (email), 0.3% (video), and 0.8% (PM), its top list is nearly always different — confirming extreme randomness.OpenAI shows low repeatability despite moderate stability
Its top list appears only 0.3–3.7% of the time, suggesting it reshuffles outputs even when working with similar sets.Perplexity sits in the middle but still leans variable
With ~2.9–4.2%, it shows occasional repetition but no strong convergence on a single dominant list.
Key Findings
Gemini is the clear randomness engine across all categories
With 99.6%, 98.4%, and 89.6% unique ordered lists, almost no two responses are identical.Gemini’s randomness is partly real, partly reshuffling
In email, it produces 564 unique brand sets (true diversity), but in PM only 113 sets across 896 lists, meaning it often rotates the same combinations.Claude is the most deterministic — especially in project management
It generates just 41 ordered lists from 4 brand sets, with its top list appearing 21.1% of the time, the strongest repeat pattern in the study.Determinism increases with category maturity
Models (especially Claude) become more stable as the category becomes more standardised — most visible in project management.Video editing is the least stable category overall
All models show higher variation here, suggesting weaker consensus on what qualifies as “best.”
MODULE 4 - Position Volatility (By Rank 1–10)
How many different brands occupied each rank position (1–10) across 1,000 runs. Fewer unique tools per position = more locked; more = more chaotic. Full heatmap table in companion spreadsheet.
This metric was calculated in 2 different ways: a) Unique tools per position and b) Average position volatility.
A) Unique Tools Per Position
Email Marketing - Unique Tools Per Position
Position
Position #1
Position #2
Position #3
Position #4
Position #5
Position #6
Position #7
Position #8
Position #9
Position #10
Average
Total Unique Tools

Open AI
3
5
8
10
13
19
20
17
25
19
13.9
34

Gemini
4
6
13
14
20
24
25
33
35
38
21.2
49

Claude
1
1
1
5
6
7
7
13
12
17
7.0
24

Perplexity
5
8
11
14
19
18
20
22
26
24
16.7
38
Video Editing - Unique Tools Per Position
Position
Position #1
Position #2
Position #3
Position #4
Position #5
Position #6
Position #7
Position #8
Position #9
Position #10
Average
Total Unique Tools

Open AI
2
4
6
9
12
15
18
20
22
18
12.6
28

Gemini
2
4
5
10
15
20
22
30
44
35
18.7
44

Claude
1
2
2
4
5
6
8
10
12
14
6.4
18

Perplexity
3
4
4
8
12
14
16
18
20
18
11.7
32
Project Management - Unique Tools Per Position
Position
Position #1
Position #2
Position #3
Position #4
Position #5
Position #6
Position #7
Position #8
Position #9
Position #10
Average
Total Unique Tools

Open AI
3
4
5
7
9
12
14
15
18
16
10.3
22

Gemini
2
3
4
6
8
10
12
14
16
18
9.3
30

Claude
1
1
1
2
3
1
4
6
8
9
3.6
12

Perplexity
3
4
5
8
12
14
16
18
23
20
12.3
28
Video editing category saw 44 unique tools - the most chaotic single slot in the entire study.
B) Average Position Volatility
This measures the average number of unique tools per rank position across 1,000 runs.
Lower values indicate more stable, locked rankings.
Higher values indicate more variation and weaker consensus.
Results are grouped by model, with separate values for Email, Video, and Project Management.
Model

Open AI

Open AI

Open AI

Gemini

Gemini

Gemini

Claude

Claude

Claude

Perplexity

Perplexity

Perplexity
Category
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Email Marketing
Video Editing
Project Management
Top List % (Most Common Single List Frequency)
13.9%
12.6%
10.3%
21.2%
18.7%
9.3%
7.0%
6.4%
3.6%
16.7%
11.7%
12.3%
Key Insight
Claude is the most position-stable model across all three categories, with multiple ranks fully locked in place. Gemini sits on the opposite end, showing the highest volatility — especially in mid-to-lower positions where brand variety peaks. Across models, project management shows the strongest convergence, with fewer tools competing for each position, suggesting it is the most mature and settled category in terms of LLM consensus.
Finding
Most Locked Position
Most Chaotic Slot
OpenAI Pattern
Perplexity Pattern
Email Marketing
Claude Pos 1 = 100% Customer.io, every run
Gemini Position 9 - high brand variety
Middle positions 4–7 highly contested, no single tool dominates
Top positions most volatile; wide variance in brand choice and order
Video Editing
Claude Pos 1 = 100% Descript, every run
Gemini Position 9 = 44 unique tools, highest in entire study
~460 brand combos shuffled into different orders, same pool repeatedly
Top 4 most position-locked (63–72% dominance) despite volatile overall
Project Management
Claude Pos 1 = 100% Asana + Pos 6 = 100% Shortcut (2 frozen slots)
Minimum
Perplexity Position 9 = 23 unique tools
Widest spread - lowest min Jaccard (0.250), highest std dev (0.179)
Key Findings
Position 1 is the most locked and competitive slot across all models
Claude fixes it completely in every category (Customer.io, Descript, Asana), making #1 effectively “hardcoded.”Displacing Claude’s #1 is the hardest AEO challenge in this study
These rankings are not just frequent — they are fully deterministic (0.00 σ).Gemini’s Position 9 is the most chaotic slot overall
It shows the highest brand variety across all categories (up to 44 tools in video), making it the least stable position.Lower positions offer the biggest opportunity for new brands
Unlike #1, these slots are highly volatile, with constant turnover in tools and rankings.OpenAI converges strongly in mature, developer-led categories
In project management, Position 1 is nearly frozen (Linear at 96.4%), showing near-Claude levels of consensus.Category maturity drives ranking stability
The more established the category, the more models converge toward fixed answers, especially at the top.
MODULE 5 - Cross-Model Agreement
This module measures how similar the models are to each other. We compute pairwise Jaccard similarity between every model pair across runs, comparing the overlap in tools they recommend. This is done run-by-run to capture how closely their outputs align in practice, not just on average. Higher scores mean models consistently recommend the same tools. Lower scores indicate they rely on different sources or reasoning patterns, leading to divergent outputs.
Email Marketing — Average Tools Overlap (out of 10)

Open AI

Gemini

Claude

Perplexity

Open AI
10
6.1
6.5
5.3

Gemini
6.1
10
5.8
4.5

Claude
6.5
5.8
10
4.5

Perplexity
5.3
4.5
4.5
11
Video Editing — Average Tools Overlap (out of 10)

Open AI

Gemini

Claude

Perplexity

Open AI
10
6.1
6.5
3.7

Gemini
6.1
10
6.5
4.5

Claude
6.5
6.5
10
4.5

Perplexity
3.7
4.5
4.5
10
Project Management — Average Tools Overlap (out of 10)

Open AI

Gemini

Claude

Perplexity

Open AI
10
7.9
6.6
6.9

Gemini
7.9
10
7
7.4

Claude
6.6
7
10
5.3

Perplexity
6.9
7.4
5.3
10
Model Pair
OpenAI × Claude
OpenAI × Gemini
Gemini × Claude
Gemini × Perplexity
Claude × Perplexity
OpenAI × Perplexity
Rank
#1 Email / #2 Video
#2 Email / #3 Video / #1 PM
#3 Email / #1 Video
#4 Email & Video
#5 across prompts
#6 (least aligned) Email & Video
Email jaccard
0.485
0.458
0.399
0.270
0.293
0.372
Video Jaccard
0.485
0.450
0.487
0.299
0.293
0.231
PM Jaccard
0.492
0.660
0.542
0.608
0.369
0.536
Pattern
Consistently close alignment
Best aligned in PM (same SaaS stack)
Closest for video; moderate elsewhere
Diverge in email/video; overlap in PM
Lowest pairing in 2 of 3 prompts
Biggest divergence in video editing
Email Marketing - Cross-Model Agreement Heatmap

Video Editing - Cross-Model Agreement Heatmap

Project Management - Cross-Model Agreement Heatmap

Key Findings
No single model pair is consistently the most aligned
The closest pair changes by category, showing that agreement is context-dependent rather than fixed.OpenAI × Claude are most aligned in email and video
They show the strongest overlap (~0.485), indicating similar priors in these categories.OpenAI × Gemini align most in project management
Their overlap reaches 0.660 — the highest in the entire study, suggesting convergence on the same core PM toolset.Gemini × Claude align best in video but diverge elsewhere
Their agreement peaks in video (~0.487), but drops in other categories, indicating partial overlap.Perplexity is consistently the least aligned model
It appears in the bottom pairings across categories, with the lowest overlap scores overall.Cross-model agreement increases with category maturity
Project management shows higher overlap across all pairs, indicating stronger shared consensus.
MODULE 6 - Search Volume Correlation
This module tests whether brand popularity (search volume) influences LLM recommendations.
We measure Spearman ρ and Pearson r correlation between US search volume (Ahrefs) and:
Mean LLM rank
Appearance rate
Negative correlation = higher search volume brands rank better. Stronger correlation = LLMs reflect real-world brand popularity more closely.
Email Marketing
Chart 1 - Search Volume vs. Mean LLM Rank
No model shows a statistically significant relationship between search volume and ranking. Smaller brands consistently outperform larger ones — for example, Customer.io (1.8K SV) ranks 7+ positions higher than Constant Contact (225K SV) across all models.

Chart 2 - Search Volume vs. Appearance Rate
Only Perplexity shows a significant SV-to-appearance-rate correlation (ρ=+0.528, p=0.024). All other models show no meaningful relationship between brand size and how often they appear.

Video Editing
Chart 3 - Search Volume vs. Mean LLM Rank
OpenAI and Gemini show a significant relationship — higher search volume brands tend to rank better. Claude and Perplexity show no meaningful correlation. Smaller tools like Descript (60K SV) still outperform larger ones like CapCut (2.24M SV) on most models.

Chart 4 - Search Volume vs. Appearance Rate
OpenAI and Gemini again show the strongest relationship between search volume and appearance. Claude and Perplexity rely less on popularity signals and more on internal model logic or retrieval sources.

Project Management
Chart 5 - Search Volume vs. Mean LLM Rank
All four models show a strong and statistically significant relationship. This is the only category where higher search volume consistently leads to better rankings across every model. Gemini shows the strongest correlation in the entire study.

Chart 6 - Search Volume vs. Appearance Rate
Only Perplexity shows a significant SV-to-appearance-rate correlation (ρ=+0.528, p=0.024). All other models show no meaningful relationship between brand size and how often they appear.

Model

Open AI

Gemini

Claude

Perplexity
Verdict
Email Marketing
Not significant (p > 0.05)
Not significant
Not significant
Not significant
No model - SV irrelevant for email
Video Editing
Significant ρ = −0.577, p = 0.004
Significant ρ = −0.424, p = 0.018
Not significant
Not significant
OpenAI + Gemini only
Project Management
Significant p = 0.032
Significant ρ = −0.712, p < 0.001
Significant ρ = −0.607, p = 0.047
Significant p = 0.039
ALL 4 models - only universal category
Pattern
SV matters for visual/PM; irrelevant for email
Strongest SV sensitivity of all models in PM
Internal logic dominates - PM is the lone exception
Retrieval sources happen to index popular PM brands
PM is the only brand-driven category
Key Signals
Email: Customer.io (1,800 searches/mo) consistently outranks Constant Contact (225,000 searches/mo) by 7+ positions - LLMs appear to prioritise documentation quality over raw brand popularity
Video: Descript (60,500 SV/mo) outranks CapCut (2.24M SV/mo) on 3 models - product quality and documentation depth > search volume tier
PM: Linear (135K SV, mean rank ~2) consistently outperforms its search volume tier - developer mindshare and technical documentation are the differentiator
Gemini is the most SV-sensitive model overall: strongest ρ in both video (−0.424) and PM (−0.712).
Key Findings
Search volume is a category-dependent signal, not universal
Its influence varies clearly: Email (none) → Video (partial) → Project Management (strong across all models).Category maturity drives whether popularity matters
The more a category has clear market leaders (e.g. Asana, Monday, ClickUp), the more LLMs reflect real-world brand hierarchy.Email marketing shows no correlation with brand popularity
Rankings are driven by documentation quality and content signals, not search demand.Video editing shows partial alignment with search volume
Only OpenAI and Gemini reflect popularity signals, while Claude and Perplexity do not.Project management is fully brand-driven
All models show strong correlation, making search volume a reliable predictor of ranking.Implication: ranking strategy must be category-specific
In PM, high search volume is a structural advantage. In email, that advantage disappears — content authority becomes the only lever.
Strategic Recommendations for AEO
Based on the analysis across 12,000 data points, this report is designed to help you better understand how LLM recommendations behave — and what to expect across different models and categories. The data is empirical and repeatable
Important: The interpretations are not. What we present here is our best attempt to explain the patterns — not a definitive explanation of them.
Strategy
Difficulty to Influence
Key Lever
Position Stability
Brand Blindness
SV Influence on Rank
Best Entry Point
Recommended Priority

Open AI
Moderate
Content signals + SV (video/PM)
Stable core in email; shuffled tail in video/PM
Significant for video/PM; none for email
Significant for video/PM; none for email
Target positions 4–8 via content signals
P1 — stable, worth protecting

Gemini
Easiest
Content diversity & structured data
Almost fully random across all categories
Very High
Zero for email; moderate for video/PM
Widest window — high diversity scores (74–100%)
P0 — easiest gains for new brands

Claude
Hardest
Training data authority + long-form content
Locked top 3 — hardest to break into
Medium
PM only (p=0.047); none elsewhere
Requires fundamental authority shift
P2 — long-term investment only

Perplexity
Moderate
Review site citations (G2, Capterra, PCMag)
Volatile throughout all categories
Moderate
PM only (p=0.039); none elsewhere
Focus on review directories + comparison pages
For Brands Already Visible in AI Search
Expect stable rankings on Claude and OpenAI. These models show the most locked behaviour across all categories. If you are already in the top 3, your position is unlikely to change frequently and may remain consistent across runs unless the model itself updates.
Expect high volatility on Gemini. Rankings change almost every time, especially in email and video. Your position may fluctuate even if nothing changes on your side. Consistent inclusion matters more than holding a fixed rank.
Expect category-specific visibility. Strong performance in one category does not translate to others. A brand ranking highly in PM may not appear at all in email or video. Each category behaves independently.
Expect Perplexity to be heavily influenced by external sources. Visibility is strongly tied to how often your brand is referenced across the web — especially on review platforms and publisher sites like G2, Capterra, PCMag, and TechRadar. Compared to other models, Perplexity more directly reflects what is already documented and cited online.
Expect search volume to matter more in competitive categories. Correlation between search volume and LLM rankings is not consistent across all categories, but becomes much stronger in more competitive spaces. In our study, PM — the most competitive category — was the only one where all models aligned with search demand, suggesting that brand popularity plays a bigger role as markets mature.
For Brands That Are Invisible in AI Search
Gemini offers the most entry opportunities. It shows the highest diversity across all categories, meaning new or lesser-known brands have a real chance to appear. Inclusion is more fluid here than on other models.
Search volume does not consistently drive visibility in email. Brand popularity alone is not enough. Inclusion is more influenced by content quality, documentation, and third-party validation signals.
Search demand becomes more influential in competitive spaces. The impact of brand popularity increases as categories become more mature and competitive. In those environments, well-known brands are more likely to dominate rankings.
Visibility is strongly tied to being mentioned alongside category leaders. Models often surface brands that appear in the same context as established tools. Being part of comparison sets and review ecosystems increases the likelihood of inclusion.
Model behaviour varies significantly — there is no single pattern. Each model surfaces a different set of tools and uses different signals. A brand may appear in one model and be completely absent in another.
Inclusion comes before ranking. The first milestone is simply appearing in the recommendation set. Only after consistent inclusion does rank position start to matter.
Conclusion
This study analysed 12,000 LLM recommendation queries across three software categories. What the data shows clearly is that LLM recommendations do not follow a single pattern. The underlying behaviour changes depending on the category, the model, and how established the market appears to be.
What is less certain — and more interesting — is why these patterns exist. Based on the data, and combined with our interpretation, a few themes stand out:
Search popularity is not a consistent driver — and likely depends on category dynamics
The data shows that search volume does not uniformly influence LLM rankings.In email marketing, there is no meaningful correlation. Smaller brands consistently outrank much larger ones. In project management, the opposite pattern appears — higher search volume brands tend to rank better across all models. Video editing sits somewhere in between, with mixed behaviour depending on the model.
One possible explanation is category competitiveness. Project management is a more mature and crowded space with clearer market leaders, which may make it easier for LLMs to reflect real-world brand hierarchy. Email, on the other hand, appears more fragmented, which may push models to rely on other signals like content quality or documentation.
This is not something the data directly proves — but it is the most consistent explanation across what we observed.
Each model shows a distinct pattern — likely tied to how it retrieves and ranks information
The differences between models are very clear in the data, but the reasons behind them require interpretation.
• Claude consistently produces stable, repeatable rankings.
• OpenAI appears to maintain a fixed pool of tools, while varying their order.
• Gemini generates highly diverse outputs, often producing completely different lists.
• Perplexity surfaces a noticeably different set of tools, often aligned with review and aggregator sites.
These patterns likely reflect differences in training data, retrieval layers, and ranking logic. For example, Perplexity’s behaviour suggests a stronger reliance on external sources, while Claude appears more “opinionated” or internally consistent.
While we cannot directly observe the internal systems, the consistency of these patterns across all runs suggests that each model operates with its own stable logic.The “locked top positions” pattern suggests a strong positional advantage — but the cause is less clear
Across all models and categories, the top positions behave very differently from the rest. Positions 1–3 are significantly more stable than positions further down the list. In some cases, these positions do not change at all across 1,000 runs. Most variation happens outside this top tier.
What this suggests is a form of “lock-in” at the top. Once a brand reaches these positions, it tends to remain there.
Why this happens is harder to pin down. It could be driven by stronger training signals, clearer consensus in source data, or reinforcement effects where frequently recommended tools continue to appear. Most likely, it is a combination of these factors.
What we can say with confidence is the outcome, the top of the list behaves very differently from the rest, even if the exact mechanism remains uncertain.
Appendix: Methodology Notes
Data Collection
1,000 independent runs per prompt per LLM, no session continuity between runs
All runs conducted in the same calendar period to minimise model update drift
Perplexity citation brackets ([1], [2]) stripped before analysis
Brand name normalisation applied: spelling variants, case differences, and rebranding aliases unified
Data Collection
1,000 independent runs per prompt per LLM, no session continuity between runs
All runs conducted in the same calendar period to minimise model update drift
Perplexity citation brackets ([1], [2]) stripped before analysis
Brand name normalisation applied: spelling variants, case differences, and rebranding aliases unified
Limitations
Search volumes are US-based estimates from Ahrefs
LLM outputs can change with model updates; findings reflect the specific model versions active during data collection
Covers three categories - dynamics in other verticals (cybersecurity, fintech, HR tech) may differ