R1 performs 11 points better than its non Chain-of-Thought counterpart on Livebench
interestingly Gemini 2 Flash Thinking is only like ~2 points better than its non thinking counterpart I wonder why that is and we dont know what o1's base model is so compare against its non thinking