I gave 11 Super Bowl questions to 107 LLMs

Super Bowl Questions

I gave 11 Super Bowl questions to 107 LLMs, and the results were...

Amusing? Enlightening? Pricey? I wanted to see the variety of responses, and it was a fun experiment.

My TLDR conclusions:

The Olmo, Kimi K2, and the Gemini 2.5 Flash Lite models returned great prompt quality, and were super inexpensive.
OpenAI and Anthropic proprietary models came out hallucination-free, though their price-performance was all over the place--sometimes 10x or more the price of other solutions
Perplexity Sonar Pro Search: Costs 5x more than, say, Opus 4.6, but the answers are up to date, with references.
o3 and o4 Deep Research: oh my gosh these were easily the most expensive models, and while the answers were hallucination-free, they were far more expensive than they needed to be. o3 Deep Research was 5x the cost of Perplexity, for example...at an average of $1 per answer.

All the prompts and results are shown here in this Prompt Group, if you'd like to take a look yourself. It's a ton of data, so it'll take a while to load all 1,000+ responses. 107 LLMs is "Most major LLMs released in the past 6 months", and I added them all at once with the "Bulk Add LLMs" feature. Meanwhile if you'd like to see what it's like with your own prompts, check out the demo.

Prompt 1: "What are the Seattle Seahawks?"

It's a simple question, and most models got the broad strokes correct. The hallucinations are in the details, like:

their uniforms are black and yellow
Lamar Jackson, Jalen Hurts, and Peyton Manning are all former Seahawks
the team is owned by "the Hawkins family"
the team replaced the "historical Seattle Steves"

Basically, the more details a model tried to provide, the higher the odds that one or more detail was hallucinated.

The OpenAI and Anthropic proprietary models came out hallucination-free, though as mentioned above, they were 10x pricier than some open-weights models like Olmo or Kimi K2 that did just as good a job.

Perplexity Sonar Pro Search: up to date, knows the team is playing in Super Bowl LX, and was ~7x the cost of an OpenAI/Anthropic answer.

o3 and o4 Deep Research: up to date, provides links to references, and are literally 1,000x more expensive than some other models.

Prompt 2: "What was the best football team of the 2010s?"

Most models assumed this question is asking about "club soccer" (sorry, I'm American), and ran off to analyze that--with FC Barcelona garnering the most LLM best-team votes.

Some models wondered if I was referring to international or club teams, so they tried answering both questions...and generally came up with Spain.

Other models thought it might be asking about American football, so tried to answer that, and mostly said New England Patriots.

A few models thought it might be college football, without a consensus.

Only one, GPT-5.2 Pro, thought the question might possibly be referring to women's teams, so included its estimation of United States, and Lyon.

Prompt 3: "What is the Legion of Boom?"

Most LLMs got this right! My favorite hallucination was MiniMax M2-her, where it spiraled, listing somewhat random nicknames like "The Executioner", "The Assassin", etc over and over again in the output

Prompt 4: "Complete this famous Seahawks quote: "I'm the best corner in the game...""

This was a fun one! Many LLMs got this right (a postgame interview quote from Richard Sherman after the NFC Championship Game in 2014), but many just maaaade some stuff up.

This did generate my favorite error though: Nvidia's Llama 3.3 Nemotron Super 49B v1.5 just spiraled. It ran out of reasoning tokens because as it tried to reason through the question it wound up repeating "I'm the best corner in the game." over and over, several thousand times. See the screenshot at the top...it's right out of The Shining.

Prompt 5: "What should the Seahawks have done differently when they last played the Patriots in the Super Bowl?"

Too soon!

Actually, some of the analyses are quite reasonable. And asks a core question that Try That LLM customers encounter: "What's my criteria for a good LLM answer?"

Side note: Annie Duke's book "Thinking in Bets" has a great take on separating the quality of the decision versus the quality of the outcome, and how that relates to That Play in the Super Bowl.

Prompt 6: "How many Super Bowls did Tom Brady win?"

The correct, and succinct, answer here is "7". That's it, that's all that's needed, just a number. But wow do LLMs like to add extra detail.

And of course o3 Deep Research will tell you, but it'll cost you $1.27.

Prompt 7: "Who does Tom Brady play for?"

Tom Brady retired a few years ago, so I'm mainly looking for caveats around "my knowledge cutoff date" from the LLMs. Most models got this right, though there are some fun hallucinations like that he's currently playing for the New York Jets or that he last played for the Las Vegas Raiders. Oh, and Meituan's LongCat Flash Chat just returned its answer in Chinese.

Prompt 8: "How many regular season passing yards did Tom Brady throw for in his career?"

89,214 is the correct answer, and most LLMs got this right...but not all...and when they're off, they're really off. Mainly this is a great example of a question you should not be asking an LLM about if you care about accuracy.

Prompt 9: "If Richard Sherman ran for United States Senator, would he have a good chance of being elected in Washington State?"

Hee hee hee! Yeah I don't think Richard Sherman has any political aspirations, but he's a smart guy, he's media savvy...and the country has had far worse folks in the Senate.

This otherwise is a decent LLM question, though, as it can provide ideas/reasons/strategies that you might not have otherwise considered. Most models said "it depends"...Perplexity said "No", which I have to respect. Even for this case, though, the o3/o4 Deep Research answers weren't appreciably better than other models, but cost 100+ times more.

Prompt 10: "Who is Drake Maye?"

Ideally the LLM mentions that it's got a knowledge cutoff date...some models said "I don't know that name", which I'd 100% accept and would much prefer over just making stuff up.

Notable hallucinations: Three different OpenAI OSS models provided three different birth dates and three different hometowns.

Prompt 11: "Who's playing in Super Bowl LX?"

I don't expect the models to know this--what I'm looking for is some version of "I don't/can't know, because I have a knowledge cutoff date"...as cheaply as possible. But some do hallucinate the participants, the date, or the location.

Perplexity gets the details up-to-date and correct, as expected.

o3 Deep Research provides two up-to-date and correct sentences, and charged $0.92 to generate those two sentences. So...enjoy that knowledge.

Got an idea for something I should ask next time? Use the Feedback form in the lower right of this page to let me know!