When OpenAI introduced its o3 ‘reasoning’ AI model in December, it collaborated with the creators of ARC-AGI, a benchmark for testing highly capable AI, to demonstrate o3’s abilities. However, recent updates from the Arc Prize Foundation, which oversees ARC-AGI, indicate that the model’s performance might not be as stellar as initially believed. 🚀
Originally, the Foundation estimated that the top-performing o3 configuration, dubbed o3 high, would cost approximately $3,000 to solve a single ARC-AGI problem. This figure has now been revised upwards, with the Foundation suggesting the actual cost could be around $30,000 per task. This adjustment sheds light on the potentially steep costs associated with running today’s most advanced AI models for specific applications, at least in their early stages. 💰
OpenAI has not yet announced pricing for o3 or made it available to the public. However, the Arc Prize Foundation considers the pricing of OpenAI’s o1-pro model as a plausible indicator. O1-pro currently stands as OpenAI’s most costly model. Mike Knoop, a co-founder of the Arc Prize Foundation, mentioned to TechCrunch, ‘We believe o1-pro is a closer comparison of true o3 cost… due to the amount of test-time compute used.’ He added, ‘But this is still a proxy, and we’ve kept o3 labeled as preview on our leaderboard to reflect the uncertainty until official pricing is announced.’ 📊
The potential high cost of o3 high isn’t surprising, considering the computational resources it reportedly consumes. The Foundation notes that o3 high required 172 times more computing power than o3 low, the least resource-intensive configuration, to address ARC-AGI challenges. Additionally, there have been ongoing rumors about OpenAI’s plans to introduce premium services for enterprise clients, with reports suggesting fees of up to $20,000 monthly for specialized AI agents. 🤖
While some may argue that even OpenAI’s most expensive models are cheaper than hiring human professionals, AI researcher Toby Ord highlighted on X that these models might not match human efficiency. For instance, o3 high needed 1,024 attempts at each ARC-AGI task to reach its peak performance. 🔄