October 5, 2024
With the recent release of o1-Preview after only incremental progress over the prior ~18 months post-GPT-4, it's worth taking a look at where AI research labs are succeeding and where they are falling short.
A major open question is whether large language models (LLMs), specifically transformer and auto-regressive architectures, can lead to artificial general intelligence (AGI). Opinions vary greatly: some believe scale will eventually unlock AGI, while others argue the current algorithms are fundamentally limited.
My position falls somewhere in the middle—scaling current architectures likely won't be sufficient for achieving AGI (with a high standard), but bolting on additional structures may offer some potential breakthroughs. We'll learn more in the coming months as scaled-up models are released by various labs.
The optimistic viewpoint is that
By pairing these improvements with versatile agents, the hope is that we will see substantial commercial benefits that drive real improvements in economic outputs. I suspect we will quickly reach a point where this is true in 80% of the cases but has a high cost and is often very slow, but over time we will hopefully see improvements on all three dimensions.
As a side note, a very cool short-term project that many developers could do quickly would be to build several bots based on the various leading frontier models, prompt them significantly, have them join a Discord chat, and have them work together to solve complex problems and iterate until finding a solution. Each agent could have a very specific role. This is a fun and easy project, although it is admittedly fairly expensive. Adding tools for the agents would be a big boost in problem solving ability.
The progress AI research labs have made over the past few years is undeniably impressive. These groups are trailblazing toward technological breakthroughs, and their work per individual researcher is incredibly impressive. However, there have been surprisingly few major algorithmic breakthroughs since GPT-4, and o1-Preview faces challenges in terms of slowness, cost, and limited API tools (e.g., JSON mode, streaming, caching, system prompts).
Consensus on o1-Preview indicates that it has made strides in certain areas, like zero-shot problem-solving and following instructions better, but still struggles with issues like bugginess and cost. Many API developers still prefer other models like Sonnet over o1-Mini and o1-Preview for coding support. Google's Gemini 1.5 has also outperformed o1-Preview on math tasks, although its performance for legal use cases has been disappointing.
I also can't figure out how the unreleased "full o1" (which has benchmarks exceeding o1-Preview) is different from o1-Preview. Is it the scale of pre-training data (I think OpenAI said it is not)? Does it simply involve allowing more time to compute answers? Is it a more-scaled reinforcement learning? Something else? OpenAI did say that they expect substantial and fast improvements in the o1 model and that JSON mode and system prompts should arrive by the end of the year).
Below is a high-level overview of the strengths and weaknesses observed in the AI models currently available:
Category | Current Quality Score (out of 10) | Notes | Workarounds |
---|---|---|---|
Frontier Intelligence | 9 | Sufficient for a large percentage of industry use cases, though there are areas for improvement. | Agentic flows and domain-specific data inclusion. |
Context Windows | 8 | Context will soon be key—the AI models are competent but need human-level context. Most use cases are easily solved within the current context windows, but some powerful use cases require more. | Chunking, retrieval-augmented generation (RAG), parallel processing, and summarization using agents. |
Instruction Following | 7 | Models like o1-Preview continue to improve, though they still struggle with complex instructions and multiple guidelines. | Great prompt engineering, additional AIs using JSON mode for error handling, or creating wrappers for equivalent error-handling. |
Bugginess | 7 | APIs can be buggy, often due to instruction failures or rate limits below expectations. | Retries with exponential back-offs, additional JSON wrapper programming, and backups between different lab APIs and cloud providers. |
Cost | 7 | Costs go hand-in-hand with other challenges. If zero-shot prompts could consistently yield accurate results, current costs would be acceptable, but we're not there yet. | Use smaller open-source models where possible to handle API calls. |
Speed | 7 | Speed would be acceptable if answers were always accurate. Many loops are needed, so faster CoT models would be helpful. | Fine-tuning smaller models, designing UX to process in the background, and using different models for different API calls. |
Confidentiality | 7 | Most labs now provide options where they do not look at submitted data. However, more built-in privacy features would be helpful. | Custom PII scrubs at the application layer, using private cloud tenants. |
Comprehensiveness | 5 | A major obstacle for adoption, especially for tasks like extracting all fees from a lengthy contract where only a portion is detected. | Secondary AIs, prompt engineering, and iteration. |
Consistent Accuracy | 5 | Issues with hallucinations and inconsistency in retrieving correct information from training data are significant impediments. | Using external data and traditional programming for exact verification. |
Working with Numbers | 5 | Substantial progress has been made, but traditional data handling and analytics still need improvement. | LLMs generate SQL strategies and rely on traditional AI/ML tools for analytics. |
Tool Integration | 4 | Few native built-in tools to support complex problem-solving tasks. | External third-party tools and custom integrations. |
Relevant Context Gathering | 3 | This aspect largely falls to the application layer. Labs provide limited support for effective context gathering. | Building company-specific data repositories and policies. |
o1-Preview is sufficient in some categories but deficient in others, and the same can be said of Sonnet and GPT-4o, which excel in mostly opposite areas. Many of these challenges can be solved by the application-level software companies using additional algorithms to limit hallucinations, ensure comprehensiveness, and maintain consistency. However, these workarounds often require:
On the other hand, this is how application-level software companies add value beyond the foundation models and create moats. Up until true AGI (2028 - late 2030s, most likely), the leading application level companies will stand on top of the foundation models and provide substantially more accurate, comprehensive, and useful intelligence within their domains.
But if the foundation models can improve in the above-mentioned areas, we'll be able to provide the industry more value, faster, and create a positive feedback loop with a healthy ecosystem where AI software companies continue to get funding and continue to bring huge amounts of API usage to the research labs, thus helping them raise their own valuations. And that will let them scale compute and build AGI.
My personal request for AI scientists and engineers is to focus less on pushing frontier intelligence and more on enhancing consistency, comprehensiveness, and accuracy. Current models are highly intelligent most of the time—bringing that consistency to near 99% is how we'll see real progress.