R1 performs 11 points better than its non Chain-of-Thought counterpart on Livebench

https://preview.redd.it/z25hpwiko7ee1.png?width=1449&format=png&auto=webp&s=9546ff17ff7d84bb500e41a5aa912a6d304122bd

interestingly Gemini 2 Flash Thinking is only like ~2 points better than its non thinking counterpart I wonder why that is and we dont know what o1's base model is so compare against its non thinking

Madison Howard

Share Your Mood

pigeon57434

R1 performs 11 points better than its non Chain-of-Thought counterpart on Livebench