I tested the o1-preview on math, reasoning, coding, and crative writing. Here are my observations.

It's been four days since the o1-preview dropped, and the initial hype is starting to settle. People are divided on whether this model is a paradigm shift or just GPT-4o fine-tuned over the chain of thought data.

As an AI start-up that relies on the LLMs' reasoning ability, we wanted to know if this model is what OpenAI claims to be and if it can beat the incumbents in reasoning.

So, I spent some hours putting this model through its paces, testing it on a series of hand-picked challenging prompts and tasks that no other model has been able to crack in a single shot.

For a deeper dive into all the hand-picked prompts, detailed responses, and my complete analysis, check out the blog post here: OpenAI o1-preview: A detailed analysis.

What did I like about the model?

In my testing, this model does live up to its hype regarding complex reasoning, Math, and science, as OpenAI also claims. It was able to answer some questions that no other model could have gotten without human assistance.

What did I not like about the o1-preview?

It's not quite at a Ph.D. level (yet)—neither in reasoning nor math—so don't go firing your engineers or researchers.

Considering the trade-off between inference speed and accuracy, I prefer Sonnet 3.5 in coding over o1-preview. Creative writing is a complete no for o1-preview; in their defence, they never claimed otherwise.

I would like to know If anyone has used Sonnet 3.5 and o1-preview in tandem for planning and execution, like a real-world architect and developer.

However, o1 might be able to overcome that. It certainly feels like a step change, but the step's size needs to be seen.

What do you think about CoT traces? I got many correct responses, even though the traces were somewhat inconsistent.

Also, I would like to know if you have already tried structured output with the instructor or something similar with o1-preview.