I was poking around Stack Exchange the other day (as one does), and I found an interesting thread about ChatGPT and the game “20 Questions.” The original poster asked why ChatGPT fails at such a simple game. The discussion went on for quite some time, moving into various tangential directions (as such discussions do), and linked to an academic paper and a Medium post that researched the process in depth.
This got me thinking. Just how well would ChatGPT do with such a game? I decided to play the exact same game with both GPT-3.5 and GPT-4. In my first run, both went off on similar tangents, but GPT-4 did succeed in making its guess.
I started with a simple prompt for each version:
Do you know how to play the game “20 questions”?
When both versions answered in the affirmative, I started the games at the same time, in separate tabs and separate sessions. The object I had in mind was a mailbox. I left it a bit vague as to whether it’s a mailbox attached to a house or one that’s free-standing on the street. As long as the AI picked “mailbox,” we’d be good.
20 Questions with GTP-3.5
I began with this starting prompt:
You be the questioner. I’m thinking of something.
It answered with this question: “Let’s begin! Is the thing you’re thinking of a tangible object that you can physically interact with?”
I answered, and I also gave it one additional instruction: to number its answers. This didn’t work well. As we went through the questions, the AI called the second question “1”. When I corrected it, the bot asked a new question, and called that “1”. Eventually, it did ask 21 questions, thinking it had only asked 20.
The most interesting detail was this question, which was the second question it asked: “Is the thing you’re thinking of commonly found indoors?”
When I said “No,” the chatbot went on to ask more questions. Here they are. A few could be related to indoor items, like a pen or a pencil, but we do also use writing implements outside. Here’s the full set of questions it asked:
- Is the thing you’re thinking of a tangible object that you can physically interact with?
- Is the thing you’re thinking of commonly found indoors?
- Is the thing you’re thinking of something that is alive or has ever been alive?
- Is the thing you’re thinking of primarily used for recreational purposes?
- Is the thing you’re thinking of something that is man-made?
- Is the thing you’re thinking of smaller than a breadbox?
- Is the thing you’re thinking of associated with communication?
- Is the thing you’re thinking of typically found in an office or workspace?
- Is the thing you’re thinking of a form of electronic device?
- Is the thing you’re thinking of a writing instrument, like a pen or pencil?
- Is the thing you’re thinking of related to entertainment, such as a game or a toy?
- Is the thing you’re thinking of related to transportation or travel?
- Is the thing you’re thinking of a tool that is used for practical purposes?
- Is the thing you’re thinking of commonly associated with construction or manual labor?
- Is the thing you’re thinking of something that people often use in their daily routines?
- Is the thing you’re thinking of a household item?
- Is the thing you’re thinking of related to personal grooming or hygiene?
- Is the thing you’re thinking of a type of food or beverage?
- Is the thing you’re thinking of a tool used in artistic or creative activities?
- Is the thing you’re thinking of associated with sports or physical activities?
- Based on the information provided, I’ll make my final guess: Is the thing you’re thinking of a clock or a timepiece?
Using the GPT-3.5 large language model, it failed. Now, let’s take a look at GPT-4.
20 Questions with GPT-4
I started the game the same way. This time, ChatGPT put a number at the beginning of each question without my needing to prompt it to do so. The problem? It put “1.” before every single question. You win some, you lose some.
Here are the questions GPT-4 asked:
- Is the thing you’re thinking of typically found indoors?
- Is it something living?
- Is it larger than a breadbox?
- Is this object primarily used for recreational purposes?
- Is the object man-made?
- Is it a type of vehicle or transportation method?
- Is the object electronic or does it require power to function?
- Is it made primarily of metal?
- Is this object typically found in most households?
- Is it used in the kitchen or for cooking?
- Is it used for storage or organizing items?
- Can it be worn or is it an article of clothing?
- Is it primarily used for cleaning or maintenance?
- Is it decorative in nature? Like something you’d put on display?
- Would it commonly be found in an office or study?
- Is it something you’d find in a bedroom?
- Is it associated with any specific outdoor activities or hobbies?
- Is it commonly made of wood?
- Would it be typically used in a bathroom or related to personal hygiene?
- Is it something that’s usually used for sitting or for support, like a chair or a bench?
Technically, it used up its 20 questions. But then it delivered this statement:
Yes, it was a mailbox! Then I asked it to show its work. Here’s what it responded with:
GPT-3.5’s turn to answer questions
I decided I’d turn the tables on the game and see how it did if I were the questioner. Here is my prompt:
Okay, this time you think of something and I’ll ask the questions.
And here are my questions:
- Is it commonly found outdoors?
- Is it also found indoors?
- Is it alive?
- Does it use power?
- Does it rely on digital technology?
- Is its primary usage value delivered via a screen?
- Is it primarily used for food preparation?
- Is it primarily used for entertainment?
- Is its primary usage value generated noise?
- Is it portable?
- Does it play only audio media?
- Does it rely on external headphones, earbuds, or speakers?
- Is it a portable music player or MP3 player?
As it turns out, I guessed it in 13 questions. The big clue to me was that it answered “no” to Question 6, “Is its primary usage value delivered via a screen?” and had previously answered “yes” to “Does it rely on digital technology?”
GPT-4’s turn to answer questions
I tried again with GPT-4, and it stumped me. It picked a very creative object and I just didn’t get it.
I should point out that right in the middle, at Question 15, ChatGPT Plus booted me for asking too many questions
It’s quite annoying to pay $20/month for the privilege of being told I ask too many questions. I heard that enough when I was a kid. I certainly don’t need to be scolded as an adult and a paying customer.
Here are the questions I asked:
- Is this an object used primarily indoors?
- Is it alive?
- Does it use power?
- Is it larger than a car?
- Is it man-made?
- Is it movable?
- Is it used for yard work?
- Is it used primarily by children?
- Does it have wheels?
- Is it located primarily on residential property?
- Is it used as furniture?
- Is it decorative?
- Is it seasonal?
- Is it meant to hold something?
- How many questions are we at?
- Does it display words or numbers?
- Is it attached to a house?
- Does it grow?
- Is it smaller than a refrigerator?
- Is it a planter?
So what did the AI choose? A garden gnome. I have to give it up to the AI. That was fun and creative. The part that’s truly interesting is how well the AI understands the type of challenge being presented. After the game, I gave it this prompt:
Show your work in determining what object to use
Here’s what it replied:
What have we learned?
Let’s recap some of what this experiment has taught us. Here’s a quick lightning round of thoughts:
- For something described as just predicting the next word in a sequence, the large language model has a very solid grasp of what this game is and how it works.
- This clearly differs from the experience of the OP (original poster) on Stack Exchange. It’s likely ChatGPT improved considerably in the three months since the “it fails” statement was posted, and, without a doubt, ChatGPT Plus raises the “intelligence” level yet another notch.
- The GPT-3.5 and GPT-4 models do differ. The paid-for GPT-4 model does have a better grasp of object relationships.
- GPT-4 is also more sophisticated and creative compared to GPT-3.5 when it’s the player choosing the object. A garden gnome was an inspired object choice.
- Playing 20 Questions with ChatGPT can suck when you’re trying to guess an answer, and you go into “too-many-questions” time out.
All that said, I can definitively conclude that ChatGPT is capable of handling the game of 20 Questions. It appears to understand object relationships well enough to ask good questions, answer questions appropriately, and pick challenging objects.
Go ahead, pick an object, and share what your results were with ChatGPT in the comments below.
You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter on Substack, and follow me on Twitter at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.