Hi there, I’m Collin, a human being, not a sentient output of Artificial Intelligence. Scout’s honor.
I write that because (it seems like) the first month of GPT-3 articles last July started with a compelling lede, a thought-provoking middle, and an increasingly-gobbledygook ending before the author revealed that everything preceding the reveal was written by GPT-3.
Back in 2017, I co-founded an online trivia platform (Water Cooler Trivia) with two close friends. With more than 3 million free-text trivia responses, we wanted to test our users against GPT-3.
Specifically, we wanted to see if GPT-3 is an unconquerable opponent in trivia—a notoriously more challenging domain than checkers or chess.
Spoiler: GPT-3 got 73% of 156 trivia questions correct. This compares favorably to the 52% user average. However, it’s not an all-conquering feat: 37% of participants did better than 73% on their most recent quiz. Here's the full set of GPT-3 responses.
The robot was best at Fine Arts and Current Events, worst at Word Play and Social Studies. I’ve got the background, the data, the charts, all the goodies below.
Good question. It’s a computer program that uses machine learning to produce human-like text. But don’t think about the chatbot on your least favorite bank’s website. Think more like Jarvis from the Iron Man franchise. It’s the cutting edge of publicly-available machine learning, and it’s really good at what it does. It’s impersonating copywriters, writing computer code, and turning human sentences into fully-formed charts.
Some trivia for you: it stands for Generative Pre-trained Transformer 3.
In the words of OpenAI (the non-profit that created and released the technology):
“Given any text prompt, the API will return a text completion, attempting to match the pattern you gave it. You can “program” it by showing it just a few examples of what you’d like it to do; its success generally varies depending on how complex the task is.”
For this exercise, we used the default “Q&A” prompt for GPT-3. Here’s the version of GPT-3 we used. Here’s a blog post about GPT-3.
It’s a weekly trivia quiz, sent via Slack or email and used by thousands of work teams of all shapes and sizes around the world. It sparks conversation and builds team camaraderie. Hundreds of teams have told us it’s their favorite weekly team ritual. Here’s the 90-second explainer video.
All of our questions have free text responses and we have 3.3 million of those responses graded with oodles of metadata. Last time we data spelunked we wrote about 255 different spellings of Arnold Schwarzenegger.
In 2011, IBM’s Watson supercomputer competed in two Jeopardy! matches against legendary participants Ken Jennings and Brad Rutter. The computing company behind Deep Blue succeeded again. Watson easily won the match, with two-match winnings of $77,147 vs $24,000 (Jennings) and $21,600 (Rutter).
The victory was widely hailed as a smashing success for computers over humans because Jeopardy!’s punnery and unique phrasing made it significantly more challenging than the finite space of a chessboard.
However, trivia-philes viewed the game warily, as the highest levels of Jeopardy! competition often come down to buzzer skills (speed, timing) moreso than trivia knowledge. On this front, Watson was truly unmatched; its robot arm could out-time the buzzing of Jennings and Rutter.
A decent heuristic is to assume that all players were attempting to buzz in on each question. When viewed this way, the best measure of performance is simply the % correct when a participant buzzed in. (Game 1 stats, Game 2 stats)
The takeaway: Watson was really good at trivia, on par with the greatest Jeopardy players but not clearly better than them.
The more important takeaway: dozens of technologists at IBM spent more than three years and untold millions of dollars building the program specifically trained for Jeopardy! prowess. Less than 10 years later, a general-purpose open-sourced technology without the massive mainframe or cooling fans can compete on the same level.
Another classic example of the intersection between machine learning and trivia? The training strategy employed by computer scientist and Jeopardy! Champion Roger Craig.
Our 7,000+ hand-researched and hand-written trivia questions spread across eight categories:
We wanted to give GPT-3 a healthy mix of questions across those categories and also across difficulty. In particular, we were keen on assessing GPT-3 on our very hardest and very easiest questions (with difficulty assessed by how well our users did in their 3 million graded responses).
We filtered down our question set to only those with at least 500 responses, question content that hadn’t expired, and those that weren’t reliant on an attached image. Here’s the full set of 156 questions.
As was mostly expected, GPT-3 performed exceptionally well at Current Events and Fine Arts, with Miscellaneous (lots of pun-driven food questions) and Word Play (discussed above) as trickier areas. The most surprising result? The poor performance in Social Studies, driven largely by the degree of word play-intersecting questions in that category.
We expected that GPT-3 would do a good job on the questions where our users do a good job, and a worser job on the questions where our users do a worser job. This was true. The program scored 91% on questions that more than 75% of participants have gotten correct and 52% on questions that 25% of participants or fewer got correct.
This one’s not so surprising. We have a type of question called a “Two’fer Goofer” which asks for a pair of rhyming words that satisfy a given clue. It’s similar to the Rhyme Time category in Jeopardy or the old newspaper puzzle Wordy Gurdy. We had three of these questions in the showdown and GPT-3 missed all three of them.
For Word Play questions that were more like vocabulary quizzes, GPT-3 performed admirably:
We have an alliterative two-word phrase at the start of each question to add a bit of flair and sneak in a clue for participants. In the image below it would be Kooky Kingdom.
For GPT-3, these clues were a net-negative. In a few instances, the robot overlord program answered correctly when the clue was removed.
This reminds us of Watson’s inability on Jeopardy to use the Final Jeopardy category of U.S. Cities as a hint that the answer would not be Toronto on this question.
The other clues that confused GPT-3 were inline indications on the answer’s length. Below, we explicitly ask for a five-letter action and GPT-3 gave us eight letters across two words.
We award the 🤓 emoji when the response includes more information than necessary. GPT-3 was very prone to answers like this. Typically it was just a rephrasing of part of the question rather than providing wholesale new information. A few examples:
Computers are good at trivia. This was true 10 years ago when Watson helped Jeopardy! set ratings records, and it’s even truer now with machine learning technologies progressing at an accelerating pace.
The good news is that trivia has never been all about who scores the most correct answers. For Water Cooler Trivia, folks fill out their questions individually whenever they have free time each week. This means Googling the correct answers is extremely possible; however, instances of cheating are exceptionally rare. The fun comes from pulling a fact our of the recesses of your brain and learning how your coworker knew the third-smallest state by geography in the U.S.
So even with robots getting better and more accessible, trivia’s not going anywhere as a source of connection and fun. Oh and Water Cooler Trivia is completely free for four weeks and takes 60 seconds to get started.