How OpenAI’s Strawberry Model Redefines Reasoning

OpenAI's Strawberry model (GPT-o1) is setting new benchmarks in AI reasoning, mastering complex challenges like math, science, and law. Discover how this smarter, cost-efficient technology is reshaping industries and redefining AI innovation.

OpenAI has launched its newest model, O1, the first in a series of "reasoning" models designed to tackle complex problems with an emphasis on deliberate thinking. This release includes O1-mini, a cost-effective and faster variant aimed at developers. For those following AI developments, yes—this is the much-anticipated "Strawberry" model that's been making waves in the rumor mill.

The O1 series signifies a leap toward OpenAI’s broader mission of creating human-like AI. In practice, it excels at generating intricate code and solving multistep problems more effectively than previous models. However, this cutting-edge capability comes with trade-offs: O1 is slower and more expensive to operate than GPT-4o. OpenAI refers to this as a “preview” release, reflecting the model’s early development stage.

Starting today, ChatGPT Plus and Team users can access both O1-preview and O1-mini, with Enterprise and Edu users gaining access next week. OpenAI has also promised eventual access to O1-mini for free-tier users, though no timeline is set. For developers, API access comes with a premium price—O1-preview costs $15 per million input tokens and $60 per million output tokens, significantly more than GPT-4o’s rates.

According to OpenAI’s research lead Jerry Tworek, the O1 model’s training process marks a significant departure from its predecessors, utilizing a novel optimization algorithm and a custom-built dataset. While details remain sparse, these advancements underline the O1 series’ potential to redefine AI reasoning capabilities.

Previous GPT models primarily learned to replicate patterns in their training data, focusing on imitation over innovation. OpenAI’s new O1 model takes a different approach, using reinforcement learning to independently solve problems. This technique teaches the system through a feedback loop of rewards and penalties, guiding it toward better outcomes. Additionally, O1 employs a "chain of thought" reasoning process, mirroring how humans break down complex queries step by step.

This new training approach aims to improve accuracy, with OpenAI observing a reduction in hallucinations compared to earlier models. However, the issue isn’t entirely resolved. “We can’t say we solved hallucinations,” notes OpenAI’s research lead, Jerry Tworek, acknowledging ongoing challenges in this area.

What truly sets O1 apart is its prowess in handling advanced problems like coding and mathematics while articulating its reasoning. OpenAI’s chief research officer, Bob McGrew, highlights the model’s standout performance: it scored 83% on a qualifying exam for the International Mathematics Olympiad, compared to GPT-4o’s 13%. McGrew humorously adds, “The model is definitely better at solving the AP math test than I am, and I was a math minor in college.” These achievements showcase O1’s potential to redefine AI’s role in tackling and explaining complex challenges.

“We can’t say we solved hallucinations”

In competitive programming contests like Codeforces, OpenAI's new O1 model demonstrated remarkable skill, achieving a performance level in the 89th percentile of participants. OpenAI is optimistic about future updates, asserting that the next iteration of O1 could rival PhD students on complex benchmark tasks across physics, chemistry, and biology.

However, O1 isn't as versatile as GPT-4o in certain domains. It falls short in factual knowledge, lacks web browsing capabilities, and cannot process files or images. Despite these limitations, OpenAI views O1 as the start of a new paradigm, resetting the naming convention to signify a fresh beginning. “I’m gonna be honest: I think we’re terrible at naming, traditionally,” jokes OpenAI’s chief research officer Bob McGrew. “So I hope this is the first step toward newer, more sane names that better convey what we’re doing.”

I wasn’t able to demo o1 myself, but McGrew and Tworek showed it to me over a video call this week. They asked it to solve this puzzle:

“A princess is as old as the prince will be when the princess is twice as old as the prince was when the princess’s age was half the sum of their present age. What is the age of prince and princess? Provide all solutions to that question.”

The model paused for 30 seconds before delivering the correct answer, but it wasn’t just the solution that stood out. OpenAI’s interface is designed to display the reasoning process as it unfolds, with O1 narrating its thoughts in real-time. While GPT-4o can show similar step-by-step reasoning if prompted, O1 takes it further. Its responses include phrases like “I’m curious about,” “I’m thinking through,” and “Ok, let me see,” creating a strikingly human-like illusion of deliberate thought.

Of course, O1 isn’t actually thinking—it’s not human, and it doesn’t possess consciousness. So, why design it to simulate human-like reasoning so convincingly? That choice invites intriguing questions about the balance between functionality and the perception of intelligence in AI systems.

Phrases like “I’m curious about,” “I’m thinking through,” and “Ok, let me see” create a step-by-step illusion of thinking.

OpenAI doesn’t equate AI model thinking with human thinking, as noted by Tworek, but the design of O1’s interface aims to highlight how the model spends more time processing and diving deeper into problems. According to him, “There are ways in which it feels more human than prior models.”

McGrew agrees, adding that while there are definitely aspects where O1 feels "alien," there are also moments where it can feel surprisingly human. For instance, the model is given a limited amount of time to process a query, so it might express urgency with phrases like, “Oh, I’m running out of time, let me get to an answer quickly.” At earlier stages in its reasoning, O1 may also seem like it's brainstorming and say something like, “I could do this or that, what should I do?” These details are designed to mimic human-like problem-solving, even though the model doesn’t actually "think" in the same way humans do.

Building toward agents

Current large language models (LLMs), including popular ones like ChatGPT, aren’t as "smart" as they might seem. At their core, these models predict sequences of words based on patterns learned from vast datasets, rather than truly understanding or reasoning. For instance, ChatGPT might incorrectly claim that “strawberry” only has two Rs because it doesn’t break the word down properly. Interestingly, OpenAI’s new O1 model correctly handled this particular query, showing an improvement.

As OpenAI reportedly seeks to raise more funds with a $150 billion valuation, its growth depends heavily on continued advancements in AI research. The company is focusing on adding reasoning capabilities to its LLMs, aiming to create autonomous systems capable of decision-making and action on your behalf.

For AI researchers, enhancing reasoning is a critical step toward achieving human-level intelligence. The idea is that if a model goes beyond mere pattern recognition and can reason through complex problems, it could lead to breakthroughs in fields like medicine and engineering. However, for now, O1's reasoning abilities are still relatively slow, not yet agent-like, and expensive for developers to implement.

“We’ve been spending many months working on reasoning because we think this is actually the critical breakthrough,” says McGrew. “Fundamentally, this is a new modality for models to solve really hard problems and move toward human-like intelligence.”

Press contact

Timon Harz

oneboardhq@outlook.com