What is popularity bias in LLM recommendation?

Popularity bias is the well-documented tendency of large language models to disproportionately recommend popular items at rates higher than their ground-truth relevance would justify. It arises from memorisation effects: items that appear frequently in training data are more densely represented in the model's internal weights and are retrieved more readily. Research presented at RecSys 2024 and SIGIR 2025 has measured, explained, and identified methods to partially mitigate the effect.

How many problems should I track?

In most consumer categories, 15 to 30 well-defined problems cover the majority of commercially meaningful buyer queries. B2B categories can go higher, up to 50 or 60 in complex sales environments. Fewer than 10 problems usually means the problem mapping is too abstract.

Why do I need multiple query variants per problem?

Because LLM outputs are highly sensitive to prompt phrasing. Work on prompt engineering for recommendation systems presented at ACM RecSys 2025 evaluated hundreds of prompt variations and found no single phrasing consistently outperforms others. A single-query measurement introduces avoidable noise. Four well-chosen phrasings per problem materially improve the reliability of the signal.

Can a small brand outrank a large brand in AI answers?

Yes, on specific problem-led queries. Peer-reviewed research on LLM rankers has shown that prompt specificity reduces popularity bias substantially. When a query describes a specific problem, the model's retrieval shifts from popularity-weighted memorisation toward semantic matching in embedding space, where a smaller brand with strong thematic association can win.

Why does AI sometimes describe my brand incorrectly?

Research on LLM hallucination published in 2025 identified a threshold effect: brands that are sparsely represented in training data are more likely to be either omitted or described inaccurately, because the model has less information to draw on. This is a separate problem from being left out of recommendations: even when a challenger brand appears in AI answers, the description may be incorrect unless signal work has been done.

How often do AI recommendations change?

Weekly variance is normal. Monthly variance is meaningful. A brand that wins a problem for three consecutive months has a stable position. A brand that wins one month and disappears the next is seeing noise, not signal. This is why measurement has to be continuous, not a single audit.

INSIGHTS / RESEARCH

Challenger brands do not win categories. They win problems.

The counter-intuitive reason most brands measure AI recommendation the wrong way, and what the research says to do instead.

By David Willey, Founder and CEO • Published 25 April 2026 • Last updated 13 May 2026 • 9-minute read

Introduction

There is a comforting fiction doing the rounds in marketing right now. The story goes: the top 20 brands in your category, as ranked by ChatGPT, are the new battleground. Win that list and you win recommendation. It is a tidy idea. It is also wrong, especially if you are a challenger brand. The research on how large language models actually surface brand recommendations tells a different story, and the measurement most marketing teams are running misses the mechanism entirely.

The question everyone is measuring

Most dashboards, audit tools, and agency reports being sold into marketing teams today ask AI systems one kind of question:

"What are the top 20 brands in [category]?"

They ask this across ChatGPT, Claude, Perplexity, and Gemini. They record which brands appear. They average the results. They call it a category ranking. Then they tell you whether you are in the list or not.

For incumbent brands, this ranking is roughly accurate and reasonably stable. Blackmores is in the supplements list. HelloFresh is in the meal delivery list. Mars Petcare is in the pet food list. The incumbents have years of press, review volume, and training-data density working in their favour. When you ask an AI for the top 20 in a category, the AI reaches for whatever it has most strongly associated with that category. That is the incumbents.

For challenger brands, this ranking tells you almost nothing useful. Worse, it tells you something misleading.

The question real buyers actually ask

Pay attention to how you and the people around you actually use AI assistants. Nobody types "top 20 moisturisers for oily skin in Australia" into ChatGPT. That is a category query. It is the kind of thing a marketer asks. It is not the kind of thing a buyer asks.

A buyer asks something much more specific. They ask:

"What is a good moisturiser for someone with rosacea and oily skin?"
"My dog has joint stiffness after walks. What supplement should I try?"
"I want meal delivery that tracks macros for weight training."
"I need a gin that works in a negroni but is also good on its own."

These are problem queries. Each describes a situation, a constraint, an outcome, or a combination. Systematic reviews of conversational AI and consumer decision-making published in 2025 confirm what any honest observation of usage shows: consumers engaging AI assistants for product research overwhelmingly describe problems in natural language, not categories in keyword form. The way buyers are using AI tools is fundamentally problem-led.

Why category queries favour the incumbents

There is a technical term for what happens when an AI is asked a broad category question. It is called popularity bias, and it is well documented in the peer-reviewed literature on LLM-based recommender systems. Research from Amazon Science published at RecSys 2024 measured it empirically across multiple models including GPT-3.5, GPT-4, and Claude variants: popular items are disproportionately recommended at rates significantly above their ground-truth relevance, overshadowing less popular but potentially better-matched options.

The mechanism is now also well understood. Research presented at SIGIR 2025 established that popularity bias in LLM recommenders arises primarily from memorisation effects. When popular items appear frequently in training data, models effectively memorise them. Follow-up work in 2025 on hallucination and memorisation found that training-data frequency is closely correlated with recall accuracy: content repeated often enough in the training corpus becomes near-verbatim memorised, while sparsely represented content tends to be either omitted or fabricated.

Translate that for a challenger brand. A brand mentioned thousands of times in press releases, product reviews, and comparison articles is heavily represented in the training data. A brand mentioned a few hundred times is not. When an AI is asked a broad category question, it reaches for whatever it has memorised most densely in association with that category. That is, statistically, the incumbents.

An honest caveat matters here. The Amazon Science study also found that LLM-based recommenders exhibited less popularity bias than traditional collaborative-filtering recommender systems, and that explicit debiasing instructions in prompts reduced the bias further. In other words, popularity bias in LLMs is real but moderable. It depends on how the question is asked. This is precisely the point.

Why problem queries reward the specialists

Change the question. Ask an AI instead: "What is the best meal delivery service for someone tracking macros for a bulk?" The answer changes entirely. The incumbents do not automatically win, because the incumbents are not specifically associated with macro tracking. The brand that is strongly associated with that specific problem has a real chance of being returned.

Two research threads explain why. First, work on information retrieval for ecommerce published at ACM RecSys 2024 (using Best Buy's live search system) demonstrated that traditional popularity-based matching fails on long-tail queries because those queries produce sparse interaction signals. To serve long-tail queries well, modern systems use embedding-based retrieval: queries and products are projected into a shared semantic vector space and matched by proximity, not popularity. Pareto distribution of real-world queries means a minority of generic queries account for most volume, but the majority of distinct queries are long-tail problem-specific ones. These are exactly the queries on which challenger brands can win.

Second, peer-reviewed work on LLM-based ranking published in 2024 confirmed that LLMs suffer from position bias and popularity bias when ranking, but that both biases can be substantially reduced through careful prompting and bootstrapping strategies. When you ask a narrow, specific problem question, you are implicitly doing exactly that. You are supplying the specificity that makes popularity a weaker anchor and semantic relevance a stronger one.

In practical terms: niche positioning is not just a marketing preference. It is a structural advantage in AI recommendation, because specificity produces the exact kind of semantic signal AI systems retrieve on when buyers ask problem questions.

The hidden risk: brands AI does not know well get fabricated, not just omitted

There is a second reason this matters, and it is less discussed. Research published in late 2025 on hallucination in large language models identified a threshold effect: content that appears frequently in training data is faithfully recalled, while content that appears rarely tends to be either omitted or generated incorrectly. The same research noted that when multiple high-frequency items share similar content, the model exhibits "memory interference" and can confuse them with each other.

For a challenger brand, this has two implications. If your training-data presence is weak, you may be left out of problem-led recommendations entirely. That is the omission risk. But if you are mentioned, the AI may confidently describe your product inaccurately, attributing features, ingredients, or price points that are not yours. That is the fabrication risk, and it is more damaging than simple omission because a buyer acting on fabricated information forms a false impression of your brand.

Signal work (authoritative third-party content, consistent product descriptions across the web, structured data, clear brand positioning) does two jobs at once: it raises the odds of being recommended on problems where you should win, and it reduces the risk of being misrepresented when you are mentioned. Problem-led measurement surfaces both failures so they can be addressed.

What this looks like in practice

Consider two Australian meal delivery brands. A large one with broad positioning ("fresh, convenient meals delivered") and a smaller one with precise positioning ("macro-tracked meals for muscle building"). Here is what happens when both are run through the two types of AI queries.

Query type	Query example	Who tends to win	Mechanism
Category query	"Top 20 meal delivery services in Australia"	The incumbent	Popularity bias: memorised training-data density drives retrieval
Problem query	"Best meal delivery for macro tracking on a bulk"	The challenger	Embedding-space proximity: semantic match on specific constraints
Problem query	"Meal delivery for someone cutting for a show"	The challenger	Same mechanism, different problem
Problem query	"Cheap meal kit for a family of four"	The incumbent	Genuine category fit plus scale economics recognised in training data

Notice what the table makes clear. The challenger does not win every problem. The challenger wins the problems that match its semantic positioning. That is the point. A problem-led measurement strategy tells the challenger which problems it owns, which it is competing on, and which it should not try to win.

The consequence for measurement

If your current measurement tells you where you rank in a category, it is not actually telling you anything actionable. It is measuring a popularity-weighted ranking your brand is not structurally positioned to win.

What you need instead is a map of the problems your buyers actually ask AI systems about, and your recommendation position on each of those problems, tracked over time.

That map has a specific shape. For any given category, there are usually 15 to 30 distinct buyer problems that matter. Each problem has its own competitive set. Each problem responds to different signals: review density, ingredient specificity, regulatory endorsement, use-case content, comparison articles, third-party analyst coverage, and so on. A challenger brand does not need to win all 30 problems. Owning 3 to 5 problems completely is a stronger commercial position than ranking #14 on a generic category list.

What good problem-led measurement looks like

A credible problem-led measurement system has four properties, each of which maps to a known challenge in the AI recommendation literature.

Problem space mapping. The problems are defined before anything is measured. Each problem is a real buyer situation, written in buyer language. Not category abstractions. Not marketing hypotheticals. This is what gives the measurement its semantic precision.
Multiple query variants per problem. Each problem is tested with several natural phrasings. "Best moisturiser for rosacea" and "what should I use for rosacea if I also have oily skin" return different results. This matters because LLM outputs are highly sensitive to prompt phrasing. Research presented at ACM RecSys 2025, covering 450 experiments across 90 prompt variations and five datasets, established that prompt formulation significantly affects recommendation accuracy with no single phrasing consistently outperforming others. A single-query measurement is unreliable by design.
Cross-platform consistency. The same problem is run across ChatGPT, Claude, Perplexity, and Gemini, in parallel, on a fixed cadence. Platform-specific differences are well documented and need to be separated from underlying brand performance.
Trend over time, not snapshots. AI outputs shift with model updates, training refreshes, and live-web retrieval. A single month's result is noise. A twelve-month trend is signal. Researchers working on LLM consistency have repeatedly emphasised that weekly variance is expected and only longitudinal measurement surfaces genuine directional change.

If a tool or dashboard cannot do these four things, it is measuring category recommendation or producing a one-off audit. Neither gives a challenger brand what it needs.

What to do about it

Three steps to move from category-level measurement to problem-led measurement.

Write down the 15 to 30 problems your buyers ask AI about. Interview existing customers if you need to. Use your support team as a source. Look at your own search traffic for long-tail queries. Do not use generic category terms.
Measure your position on each problem. Run each across the major AI platforms with multiple phrasings per problem. Record where you appear, where you do not, and which competitors do. This is the baseline. Expect to be surprised at least half the time.
Decide which problems to own. You cannot win all of them. Pick the ones that are commercially important, where you have a genuine right to win, and where the field is not already saturated with a specialist. Concentrate your content, review, and third-party validation work on those problems.

This is what AI See You does for its clients. The product is the AI Knowledge Centre, set up at knowledge.yourbrand.com. It maps problems, measures position, tracks trends, and identifies the signal gaps that are holding a brand back from being recommended on the problems that matter.

The uncomfortable truth for big brands

Problem-led recommendation has an uncomfortable implication for incumbents too. The same body of research that explains why incumbents dominate generic queries also explains why specialists can displace them on specific ones. Industry analyses published in late 2025 documented multiple cases of small, specialist brands outperforming multi-billion-dollar incumbents on problem-led AI queries: a specialist ecommerce logistics brand beating a national postal carrier, a digital-first bank beating traditional incumbents on particular consumer financial questions. The historical authority that incumbents built through decades of SEO and brand marketing does not automatically transfer to AI recommendation, because AI is answering a different kind of question.

The mechanism is the same one that protects incumbents on generic queries. The bigger the brand, the more diffused its semantic associations across many topics. The more diffused the associations, the weaker the match on any given specific problem. Legacy authority creates category presence. It does not create problem ownership. An incumbent that wants to stay recommended needs to do exactly what challengers need to do: identify the problems, measure the position, and concentrate the signal work.

The opportunity window

This is a first-mover window for challenger brands. Incumbents are still measuring category ranking because that is what their agencies know how to report. The measurement gap is not going to last. Over the next two to three years, sophisticated incumbents will move to problem-led measurement, build the content, and compound their authority on the problems that matter. The brands that move first on problem-level ownership will be harder to dislodge once they are embedded in the semantic space.

If you are a challenger brand and you are still measuring top 20 rankings, you are playing the wrong game while the actual game is open.

Frequently asked questions

No. It has two legitimate uses. First, it tells an incumbent whether their category dominance is stable or eroding. Second, it provides top-of-funnel brand awareness tracking. For a challenger brand, however, it is not a useful measure of whether AI is actually recommending you to buyers.

Continue reading: the full research library, the Insights index, our work for product brands, or our pricing.