They assured us the agency, but all we got was the static chain

In the spring of 2023, the world is excited about the emergence of LLM-based AI agents. Powerful demonstrations like Autogpt and Babyagi showcase the potential of LLM to run in a loop, selecting the next action, observing its results and selecting the next action, one step at a time (also known as the React Framework). This new approach promises to be a power-driven agent that performs autonomous and generally multi-step tasks. Give it a goal and a set of tools and it will take care of the rest. By the end of 2024, the landscape will be full of AI agents and AI agent construction frameworks. But how do they measure commitment?
It is safe to say that the agents powered by the naive reaction framework suffered severe restrictions. Give them a task that takes more than a few steps, use multiple tools and they will fail miserably. In addition to the obvious delay problem, they will lose track, fail to follow instructions, stop too early or stop too late, and produce very different results in each attempt. No wonder. The React framework takes the limitations of unpredictable LLM and compounds it by number of steps. However, it is hoped that real-world use cases, especially agent manufacturers in enterprises, cannot be related to performance at this level. They need reliable, predictable, and interpretable results for complex multi-step workflows. They need AI systems that mitigate rather than aggravate the unpredictable nature of LLM.
So, how do agents for today’s businesses be built? For use cases that require more than a few tools and several steps (e.g., conversation rags), today agent builders have largely abandoned the dynamic and automatic responses of methods that rely heavily on static links – Create Chain-specific use cases designed to solve for predetermined chains. This approach is similar to traditional software engineering and is far from a reactive proxy commitment. It achieves a higher level of control and reliability, but lacks autonomy and flexibility. Therefore, the solution is developed intensively, has a narrow application range, and is too rigid to solve the high differences in input space and environment.
To be sure, static chaining practices differ in their “static” aspects. Some chains use only LLMS to perform only atomic steps (e.g. extract information, summarize text or draft messages), while others use LLMS to make some decisions dynamically at runtime (e.g., LLM routing between alternative streams in chains. or the result of the LLM verification step to determine if it should be run again). In any case, as long as LLM is responsible for any dynamic decisions in the solution – we inevitably get caught in the trade-off between reliability and autonomy. The more static the solution is, the more reliable and predictable, but the less autonomy, so the narrower the application and the development-intensive. The more dynamic and autonomous the solution is, the more versatile the build will be, and it will be easier to be reliable and predictable.
This trade-off can be represented in the following figure:
This begs the question, why haven’t we seen a proxy framework that can be placed in the upper right quadrant? Are we destined to always weigh the reliability of autonomy? Can’t we get a framework for a simple interface (taking the target and a set of tools and figuring it out) of a reaction agent without sacrificing reliability?
The answer is – we can, we will! But for this we need to realize that we have done something wrong. All current proxy building frameworks have a common flaw: they rely on LLM as dynamic, autonomous components. But the key element we lack (we need to create an agent that is both autonomous and reliable is planning technology. LLM is not a great planner.
But first, what is a “plan”? By “plan” we mean alternative action plans that can clearly model the results that lead to expected results and effectively explore and utilize these alternatives. The plan should be completed at the macro and micro level. The macro plan breaks down the task into dependencies and independent steps that must be performed to achieve the desired results. What is often overlooked is the need for microplanning to ensure expected results at the step level. There are many strategies available that can improve the reliability and implementation assurance at the single step level by using more inference time calculations. For example, you can interpret semantic search queries multiple times, each given query can retrieve more context, use larger models, and get more inferences from LLM – all of which can get more demand from it A satisfactory result. The best. A good micro player can effectively use inference time calculations to achieve the best results with a given calculation and delay budget. Expand resource investments as needed based on the specific task at hand. In this way, the planned AI system can mitigate the probability properties of the LLM to achieve guaranteed results at the step level. Without such a guarantee, we are back to the complex problem of errors that even the best macro-level plans can be broken.
But why can’t LLM be a planner? After all, they are able to transform high-level instructions into reasonable chains of ideas or plans defined in natural language or code. The reason is that the plan requires more. Plans require the ability to model alternative action plans to reasonably lead to expected results and reason about the expected utility and expected costs (calculation and/or latency) of each alternative. Although LLMs can potentially generate representations of available action plans, they cannot predict their corresponding expected utilities and costs. For example, what are the expected utility and cost of using Model X with Model Y? What is the expected utility of finding specific information in an index document with an API call to CRM? Your LLM has no clues. There is good reason – historical traces of these probabilistic features are rarely found in the wild and are not included in the LLM training data. Unlike the general knowledge that LLMs can acquire, they also tend to be specific to the specific tools and data environments in which the AI system will run. Even if the LLM can predict the expected utility and cost, choosing the most effective action plan is a logical decision theory deduction, and it cannot be assumed that the next token prediction of the LLMS is performed reliably.
So what are the components missing from AI planning technology? We need planner models that can be learned from experience and simulations to clearly model the corresponding utility and corresponding utility and cost probabilities of specific tasks in a particular tool and data environment. We need a planning definition language (PDL) that can be used to represent and justify the reasons for the actions and probabilities described. We need an execution engine that can determine to effectively execute a given plan defined in the PDL.
Some have already worked hard to achieve this commitment. Before this, please continue to build a static chain. Please do not call them “agents”.