Table of Contents 目录
Agent Overview 代理概述
Tools 工具
Knowledge augmentation 知识增强
Capability extension 能力扩展
Write actions 写入操作
Planning 规划
Planning overview 规划概述
Foundation models as planners
作为规划器的基础模型
Plan generation 计划生成
Function calling 函数调用
Planning granularity 规划粒度
Complex plans 复杂的计划
Reflection and error correction
反射和纠错
Tool selection 工具选择
Agent Failure Modes and Evaluation
代理故障模式和评估
Planning failures 规划失败
Tool failures 工具故障
Efficiency 效率
Conclusion 结论
Agents 智能体
Intelligent agents are considered by many to be the ultimate goal of AI. The classic book by Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach (Prentice Hall, 1995), defines the field of AI research as “the study and design of rational agents.”
许多人认为智能代理是 AI 的最终目标。Stuart Russell 和 Peter Norvig 的经典著作《人工智能:一种现代方法》(Prentice Hall,1995 年)将 AI 研究领域定义为“理性代理的研究和设计”。
The unprecedented capabilities of foundation models have opened the door to agentic applications that were previously unimaginable. These new capabilities make it finally possible to develop autonomous, intelligent agents to act as our assistants, coworkers, and coaches. They can help us create a website, gather data, plan a trip, do market research, manage a customer account, automate data entry, prepare us for interviews, interview our candidates, negotiate a deal, etc. The possibilities seem endless, and the potential economic value of these agents is enormous.
基础模型前所未有的功能为以前无法想象的代理应用程序打开了大门。这些新功能最终使开发自主、智能的代理成为可能,以充当我们的助手、同事和教练。他们可以帮助我们创建网站、收集数据、计划旅行、进行市场研究、管理客户帐户、自动数据输入、为我们准备面试、面试我们的候选人、谈判交易等。可能性似乎是无穷无尽的,这些药剂的潜在经济价值是巨大的。
This section will start with an overview of agents and then continue with two aspects that determine the capabilities of an agent: tools and planning. Agents, with their new modes of operations, have new modes of failure. This section will end with a discussion on how to evaluate agents to catch these failures.
本节将首先概述代理,然后继续介绍决定代理功能的两个方面:工具和规划。代理具有新的操作模式,具有新的故障模式。本节最后将讨论如何评估代理以捕获这些故障。
Agent Overview 代理概述
The term agent has been used in many different engineering contexts, including but not limited to a software agent, intelligent agent, user agent, conversational agent, and reinforcement learning agent. So, what exactly is an agent?
代理一词已用于许多不同的工程上下文中,包括但不限于软件代理、智能代理、用户代理、对话代理和强化学习代理。那么,究竟什么是代理呢?
An agent is anything that can perceive its environment and act upon that environment. Artificial Intelligence: A Modern Approach (1995) defines an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.
代理是可以感知其环境并作用于该环境的任何东西。人工智能:一种现代方法 (1995) 将代理定义为可以被视为通过传感器感知其环境并通过执行器作用于该环境的任何东西。
This means that an agent is characterized by the environment it operates in and the set of actions it can perform.
这意味着代理的特征在于它所处的环境以及它可以执行的一组操作。
The environment an agent can operate in is defined by its use case. If an agent is developed to play a game (e.g., Minecraft, Go, Dota), that game is its environment. If you want an agent to scrape documents from the internet, the environment is the internet. A self-driving car agent’s environment is the road system and its adjacent areas.
代理可以在其中运行的环境由其用例定义。如果开发代理来玩游戏(例如,Minecraft、Go、Dota),则该游戏就是它的环境。如果您希望代理从 Internet 上抓取文档,则环境就是 Internet。自动驾驶汽车代理的环境是道路系统及其相邻区域。
The set of actions an AI agent can perform is augmented by the tools it has access to. Many generative AI-powered applications you interact with daily are agents with access to tools, albeit simple ones. ChatGPT is an agent. It can search the web, execute Python code, and generate images. RAG systems are agents—text retrievers, image retrievers, and SQL executors are their tools.
AI 代理可以执行的操作集通过它有权访问的工具得到增强。您每天与之交互的许多生成式 AI 驱动的应用程序都是可以访问工具的代理,尽管这些工具很简单。ChatGPT 是一个代理。它可以搜索 Web、执行 Python 代码和生成图像。RAG 系统是代理 — 文本检索器、图像检索器和 SQL 执行程序是它们的工具。
There’s a strong dependency between an agent’s environment and its set of tools. The environment determines what tools an agent can potentially use. For example, if the environment is a chess game, the only possible actions for an agent are the valid chess moves. However, an agent’s tool inventory restricts the environment it can operate in. For example, if a robot’s only action is swimming, it’ll be confined to a water environment.
代理的环境与其工具集之间存在很强的依赖关系。环境决定了代理可能使用的工具。例如,如果环境是国际象棋游戏,则代理唯一可能的操作是有效的国际象棋移动。但是,代理的工具清单会限制其可在其中运行的环境。例如,如果机器人的唯一动作是游泳,它将被限制在水中环境中。
A coding agent
Figure 1. SWE-agent is a coding agent whose environment is the computer and whose actions include navigation, search, view files, and editing
图 1.SWE-agent 是一个编码代理,其环境是计算机,其操作包括导航、搜索、查看文件和编辑
Figure 1 shows a visualization of SWE-agent (Yang et al., 2024), an agent built on top of GPT-4. Its environment is the computer with the terminal and the file system. Its set of actions include navigate repo, search files, view files, and edit lines.
图1 显示了 SWE-agent (Yang et al., 2024) 的可视化,这是一个建立在 GPT-4 之上的代理。它的环境是带有终端和文件系统的计算机。其操作集包括 navigate repo、search files、view files 和 edit lines。
An AI agent is meant to accomplish tasks typically provided by the users. In an AI agent, AI is the brain that processes the task, plans a sequence of actions to achieve this task, and determines whether the task has been accomplished.
AI 代理旨在完成通常由用户提供的任务。在 AI 代理中,AI 是处理任务、规划一系列操作来完成此任务并确定任务是否已完成的大脑。
Let’s return to the RAG system with tabular data in the Kitty Vogue example above. This is a simple agent with three actions:
让我们回到上面 Kitty Vogue 示例中的表格数据的 RAG 系统。这是一个包含三个操作的简单代理:
response generation,
SQL query generation, and
SQL query execution.
Given the query "Project the sales revenue for Fruity Fedora over the next three months", the agent might perform the following sequence of actions:
- Reason about how to accomplish this task. It might decide that to predict future sales, it first needs the sales numbers from the last five years. An agent’s reasoning can be shown as intermediate responses.
- Invoke SQL query generation to generate the query to get sales numbers from the last five years.
- Invoke SQL query execution to execute this query.
- Reason about the tool outputs (outputs from the SQL query execution) and how they help with sales prediction. It might decide that these numbers are insufficient to make a reliable projection, perhaps because of missing values. It then decides that it also needs information about past marketing campaigns.
- Invoke SQL query generation to generate the queries for past marketing campaigns.
- Invoke SQL query execution.
- Reason that this new information is sufficient to help predict future sales. It then generates a projection.
- Reason that the task has been successfully completed.
Compared to non-agent use cases, agents typically require more powerful models for two reasons:
Compound mistakes: an agent often needs to perform multiple steps to accomplish a task, and the overall accuracy decreases as the number of steps increases. If the model’s accuracy is 95% per step, over 10 steps, the accuracy will drop to 60%, and over 100 steps, the accuracy will be only 0.6%.
复合错误:代理通常需要执行多个步骤才能完成一项任务,并且整体准确性会随着步骤数的增加而降低。如果模型的准确率是每步 95%,超过 10 步,准确率会下降到 60%,超过 100 步,准确率只会只有 0.6%。
Higher stakes: with access to tools, agents are capable of performing more impactful tasks, but any failure could have more severe consequences.
更高的风险:使用工具后,代理能够执行更具影响力的任务,但任何失败都可能产生更严重的后果。
A task that requires many steps can take time and money to run. A common complaint is that agents are only good for burning through your API credits. However, if agents can be autonomous, they can save human time, making their costs worthwhile.
需要许多步骤的任务可能需要时间和金钱来运行。一个常见的抱怨是,代理只适合烧毁您的 API 积分。但是,如果代理可以自主,他们可以节省人力时间,使他们的成本物有所值。
Given an environment, the success of an agent in an environment depends on the tool it has access to and the strength of its AI planner. Let’s start by looking into different kinds of tools a model can use. We’ll analyze AI’s capability for planning next.
给定一个环境,代理在环境中的成功取决于它可以访问的工具及其 AI 规划器的实力。让我们先看看模型可以使用的不同种类的工具。接下来,我们将分析 AI 的规划能力。
Tools 工具
A system doesn’t need access to external tools to be an agent. However, without external tools, the agent’s capabilities would be limited. By itself, a model can typically perform one action—an LLM can generate text and an image generator can generate images. External tools make an agent vastly more capable.
系统不需要访问外部工具即可成为代理。但是,如果没有外部工具,代理的功能将受到限制。就其本身而言,模型通常可以执行一项操作 — an LLM 可以生成文本,而 image generator 可以生成图像。外部工具使代理的能力大大增强。
Tools help an agent to both perceive the environment and act upon it. Actions that allow an agent to perceive the environment are read-only actions, whereas actions that allow an agent to act upon the environment are write actions.
工具可帮助代理感知环境并对其采取行动。允许代理感知环境的操作是只读操作,而允许代理对环境执行操作的操作是写入操作。
The set of tools an agent has access to is its tool inventory. Since an agent’s tool inventory determines what an agent can do, it’s important to think through what and how many tools to give an agent. More tools give an agent more capabilities. However, the more tools there are, the more challenging it is to understand and utilize them well. Experimentation is necessary to find the right set of tools, as discussed later in the “Tool selection” section.
代理有权访问的工具集是其工具清单。由于代理的工具清单决定了代理可以做什么,因此考虑为代理提供什么以及提供多少工具非常重要。更多的工具为代理提供了更多功能。但是,工具越多,理解和利用它们就越具有挑战性。要找到正确的工具集,必须进行实验,如后面的 “Tool selection” 部分所述。
Depending on the agent’s environment, there are many possible tools. Here are three categories of tools that you might want to consider: knowledge augmentation (i.e., context construction), capability extension, and tools that let your agent act upon its environment.
根据代理的环境,有许多可能的工具。以下是您可能需要考虑的三类工具:知识增强(即上下文构建)、功能扩展以及让您的代理对其环境进行操作的工具。
Knowledge augmentation 知识增强
I hope that this book, so far, has convinced you of the importance of having the relevant context for a model’s response quality. An important category of tools includes those that help augment the knowledge of your agent. Some of them have already been discussed: text retriever, image retriever, and SQL executor. Other potential tools include internal people search, an inventory API that returns the status of different products, Slack retrieval, an email reader, etc.
我希望到目前为止,这本书已经让您相信拥有模型响应质量的相关上下文的重要性。一类重要的工具包括那些有助于增强代理知识的工具。其中一些已经讨论过了:文本检索器、图像检索器和 SQL 执行器。其他可能的工具包括内部人员搜索、返回不同产品状态的库存 API、Slack 检索、电子邮件阅读器等。
Many such tools augment a model with your organization’s private processes and information. However, tools can also give models access to public information, especially from the internet.
许多此类工具使用您组织的私有流程和信息来增强模型。但是,工具也可以让模型访问公共信息,尤其是从 Internet 访问的信息。
Web browsing was among the earliest and most anticipated capabilities to be incorporated into ChatGPT. Web browsing prevents a model from going stale. A model goes stale when the data it was trained on becomes outdated. If the model’s training data was cut off last week, it won’t be able to answer questions that require information from this week unless this information is provided in the context. Without web browsing, a model won’t be able to tell you about the weather, news, upcoming events, stock prices, flight status, etc.
Web 浏览是最早和最受期待的 ChatGPT 功能之一。Web 浏览可防止模型过时。当模型训练的数据过时时,模型就会过时。如果模型的训练数据在上周被截断,则它将无法回答需要本周信息的问题,除非在上下文中提供了此信息。如果没有 Web 浏览,模型将无法告诉您有关天气、新闻、即将发生的事件、股票价格、航班状态等的信息。
I use web browsing as an umbrella term to cover all tools that access the internet, including web browsers and APIs such as search APIs, news APIs, GitHub APIs, or social media APIs.
我使用 Web 浏览作为一个总称来涵盖访问互联网的所有工具,包括 Web 浏览器和 API,例如搜索 API、新闻 API、GitHub API 或社交媒体 API。
While web browsing allows your agent to reference up-to-date information to generate better responses and reduce hallucinations, it can also open up your agent to the cesspools of the internet. Select your Internet APIs with care.
虽然网页浏览允许您的代理人参考最新信息以产生更好的反应并减少幻觉,但它也可以让您的代理人进入互联网的污水池。请谨慎选择您的 Internet API。
Capability extension 能力扩展
You might also consider tools that address the inherent limitations of AI models. They are easy ways to give your model a performance boost. For example, AI models are notorious for being bad at math. If you ask a model what is 199,999 divided by 292, the model will likely fail. However, this calculation would be trivial if the model had access to a calculator. Instead of trying to train the model to be good at arithmetic, it’s a lot more resource-efficient to just give the model access to a tool.
您还可以考虑使用工具来解决 AI 模型的固有限制。它们是提升模型性能的简单方法。例如,AI 模型因数学不佳而臭名昭著。如果您问模型 199,999 除以 292 是多少,则该模型可能会失败。但是,如果模型可以访问计算器,则此计算将很简单。与其试图训练模型使其擅长算术,不如让模型访问一个工具,这样可以节省更多的资源。
Other simple tools that can significantly boost a model’s capability include a calendar, timezone converter, unit converter (e.g., from lbs to kg), and translator that can translate to and from the languages that the model isn’t good at.
其他可以显著提高模型能力的简单工具包括日历、时区转换器、单位转换器(例如,从 lbs 到 kg)和转换器,它们可以与模型不擅长的语言相互转换。
More complex but powerful tools are code interpreters. Instead of training a model to understand code, you can give it access to a code interpreter to execute a piece of code, return the results, or analyze the code’s failures. This capability lets your agents act as coding assistants, data analysts, and even research assistants that can write code to run experiments and report results. However, automated code execution comes with the risk of code injection attacks, as discussed in Chapter 5 in the section “Defensive Prompt Engineering“. Proper security measurements are crucial to keep you and your users safe.
更复杂但功能强大的工具是代码解释器。您可以让模型访问代码解释器来执行一段代码、返回结果或分析代码的故障,而不是训练模型来理解代码。此功能让您的代理可以充当编码助理、数据分析师,甚至是可以编写代码来运行实验和报告结果的研究助理。然而,自动化代码执行会带来代码注入攻击的风险,如第 5 章的 “防御性提示工程” 一节所述。适当的安全措施对于确保您和您的用户安全至关重要。
Tools can turn a text-only or image-only model into a multimodal model. For example, a model that can generate only texts can leverage a text-to-image model as a tool, allowing it to generate both texts and images. Given a text request, the agent’s AI planner decides whether to invoke text generation, image generation, or both. This is how ChatGPT can generate both text and images—it uses DALL-E as its image generator.
工具可以将纯文本或纯图像模型转换为多模态模型。例如,只能生成文本的模型可以利用文本到图像模型作为工具,从而允许它同时生成文本和图像。给定文本请求,代理的 AI 规划师决定是调用文本生成、图像生成还是两者兼而有之。这就是 ChatGPT 生成文本和图像的方式——它使用 DALL-E 作为其图像生成器。
Agents can also use a code interpreter to generate charts and graphs, a LaTex compiler to render math equations, or a browser to render web pages from HTML code.
代理还可以使用代码解释器生成图表和图形,使用 LaTex 编译器呈现数学方程式,或使用浏览器从 HTML 代码呈现网页。
Similarly, a model that can process only text inputs can use an image captioning tool to process images and a transcription tool to process audio. It can use an OCR (optical character recognition) tool to read PDFs.
同样,只能处理文本输入的模型可以使用图像字幕工具处理图像,使用转录工具处理音频。它可以使用 OCR(光学字符识别)工具来读取 PDF。
Tool use can significantly boost a model’s performance compared to just prompting or even finetuning. Chameleon (Lu et al., 2023) shows that a GPT-4-powered agent, augmented with a set of 13 tools, can outperform GPT-4 alone on several benchmarks. Examples of tools this agent used are knowledge retrieval, a query generator, an image captioner, a text detector, and Bing search.
与仅仅提示甚至微调相比,使用工具可以显著提高模型的性能。Chameleon (Lu et al., 2023) 表明,由 GPT-4 驱动的代理,加上一组 13 种工具,可以在几个基准上单独优于 GPT-4。此代理使用的工具示例包括知识检索、查询生成器、图像字幕器、文本检测器和 Bing 搜索。
On ScienceQA, a science question answering benchmark, Chameleon improves the best published few-shot result by 11.37%. On TabMWP (Tabular Math Word Problems) (Lu et al., 2022), a benchmark involving tabular math questions, Chameleon improves the accuracy by 17%.
在科学问答基准 ScienceQA 上,Chameleon 将已发表的最佳小镜头结果提高了 11.37%。在涉及表格数学问题的基准 TabMWP(表格数学单词问题)(Lu et al., 2022)上,Chameleon 将准确性提高了 17%。
Write actions 写入操作
So far, we’ve discussed read-only actions that allow a model to read from its data sources. But tools can also perform write actions, making changes to the data sources. An SQL executor can retrieve a data table (read) and change or delete the table (write). An email API can read an email but can also respond to it. A banking API can retrieve your current balance, but can also initiate a bank transfer.
到目前为止,我们已经讨论了允许模型从其数据源中读取数据的只读操作。但工具也可以执行写入操作,对数据源进行更改。SQL 执行程序可以检索数据表 (读取) 并更改或删除表(写入)。电子邮件 API 可以读取电子邮件,但也可以响应它。银行 API 可以检索您的当前余额,但也可以发起银行转账。
Write actions enable a system to do more. They can enable you to automate the whole customer outreach workflow: researching potential customers, finding their contacts, drafting emails, sending first emails, reading responses, following up, extracting orders, updating your databases with new orders, etc.
写入操作使系统能够执行更多操作。它们可以使您能够自动化整个客户拓展工作流程:研究潜在客户、寻找他们的联系人、起草电子邮件、发送第一封电子邮件、阅读回复、跟进、提取订单、使用新订单更新您的数据库等。
However, the prospect of giving AI the ability to automatically alter our lives is frightening. Just as you shouldn’t give an intern the authority to delete your production database, you shouldn’t allow an unreliable AI to initiate bank transfers. Trust in the system’s capabilities and its security measures is crucial. You need to ensure that the system is protected from bad actors who might try to manipulate it into performing harmful actions.
然而,赋予 AI 自动改变我们生活的能力的前景令人恐惧。就像您不应该授予实习生删除您的生产数据库的权限一样,您也不应该允许不可靠的 AI 发起银行转账。对系统功能及其安全措施的信任至关重要。您需要确保系统受到保护,免受可能试图操纵系统执行有害操作的不良行为者的侵害。
Sidebar: Agents and security
侧边栏:代理和安全性
Whenever I talk about autonomous AI agents to a group of people, there is often someone who brings up self-driving cars. “What if someone hacks into the car to kidnap you?” While the self-driving car example seems visceral because of its physicality, an AI system can cause harm without a presence in the physical world. It can manipulate the stock market, steal copyrights, violate privacy, reinforce biases, spread misinformation and propaganda, and more, as discussed in the section “Defensive Prompt Engineering” in Chapter 5.
每当我向一群人谈论自动驾驶 AI 代理时,通常会有人提到自动驾驶汽车。“如果有人闯入汽车绑架你怎么办?”虽然自动驾驶汽车的例子因其物理性而显得发自肺腑,但 AI 系统可以在不存在物理世界的情况下造成伤害。它可以操纵股票市场、窃取版权、侵犯隐私、强化偏见、传播错误信息和宣传等等,如第 5 章的 “防御性提示工程” 一节所述。
These are all valid concerns, and any organization that wants to leverage AI needs to take safety and security seriously. However, this doesn’t mean that AI systems should never be given the ability to act in the real world. If we can trust a machine to take us into space, I hope that one day, security measures will be sufficient for us to trust autonomous AI systems. Besides, humans can fail, too. Personally, I would trust a self-driving car more than the average stranger to give me a lift.
这些都是合理的担忧,任何想要利用 AI 的组织都需要认真对待安全和保障。但是,这并不意味着 AI 系统永远不应该被赋予在现实世界中行动的能力。如果我们可以相信机器会带我们进入太空,我希望有一天,安全措施足以让我们信任自主的 AI 系统。此外,人类也可能失败。就我个人而言,我比普通陌生人更相信自动驾驶汽车会载我一程。
Just as the right tools can help humans be vastly more productive—can you imagine doing business without Excel or building a skyscraper without cranes?—tools enable models to accomplish many more tasks. Many model providers already support tool use with their models, a feature often called function calling. Going forward, I would expect function calling with a wide set of tools to be common with most models.
正如正确的工具可以帮助人类大大提高工作效率一样,您能想象没有 Excel 开展业务或在没有起重机的情况下建造摩天大楼吗?许多模型提供商已经支持将工具用于他们的模型,这一功能通常称为函数调用。展望未来,我希望使用各种工具进行函数调用在大多数模型中是通用的。
Planning 规划
At the heart of a foundation model agent is the model responsible for solving user-provided tasks. A task is defined by its goal and constraints. For example, one task is to schedule a two-week trip from San Francisco to India with a budget of $5,000. The goal is the two-week trip. The constraint is the budget.
基础模型代理的核心是负责解决用户提供的任务的模型。任务由其目标和约束定义。例如,一项任务是安排一次从旧金山到印度的为期两周的旅行,预算为 5,000 USD。目标是为期两周的旅行。约束是预算。
Complex tasks require planning. The output of the planning process is a plan, which is a roadmap outlining the steps needed to accomplish a task. Effective planning typically requires the model to understand the task, consider different options to achieve this task, and choose the most promising one.
复杂的任务需要规划。规划过程的输出是一个计划,它是一个路线图,概述了完成任务所需的步骤。有效的规划通常需要模型理解任务,考虑不同的选项来完成此任务,并选择最有前途的选项。
If you’ve ever been in any planning meeting, you know that planning is hard. As an important computational problem, planning is well studied and would require several volumes to cover. I’ll only be able to cover the surface here.
如果您曾经参加过任何规划会议,您就会知道规划是困难的。作为一个重要的计算问题,规划得到了很好的研究,需要几卷书才能完成。我在这里只能覆盖表面。
Planning overview 规划概述
Given a task, there are many possible ways to solve it, but not all of them will lead to a successful outcome. Among the correct solutions, some are more efficient than others. Consider the query, "How many companies without revenue have raised at least $1 billion?", and the two example solutions:
给定一项任务,有许多可能的方法可以解决它,但并非所有方法都会导致成功的结果。在正确的解决方案中,有些比其他解决方案更有效。请考虑查询 "How many companies without revenue have raised at least $1 billion?" 和 两个示例解决方案:
Find all companies without revenue, then filter them by the amount raised.
查找所有没有收入的公司,然后按筹集的金额进行筛选。
Find all companies that have raised at least $1 billion, then filter them by revenue.
查找筹集了至少 10 亿美元的所有公司,然后按收入筛选它们。
The second option is more efficient. There are vastly more companies without revenue than companies that have raised $1 billion. Given only these two options, an intelligent agent should choose option 2.
第二个选项效率更高。没有收入的公司比筹集了 10 亿美元的公司要多得多。仅给定这两个选项,智能代理应选择选项 2。
You can couple planning with execution in the same prompt. For example, you give the model a prompt, ask it to think step by step (such as with a chain-of-thought prompt), and then execute those steps all in one prompt. But what if the model comes up with a 1,000-step plan that doesn’t even accomplish the goal? Without oversight, an agent can run those steps for hours, wasting time and money on API calls, before you realize that it’s not going anywhere.
您可以在同一个提示符中将 Planning 与 Execution 耦合在一起。例如,您给模型一个提示,让它一步一步地思考(例如使用思维链提示),然后在一个提示中执行这些步骤。但是,如果模型提出了一个 1,000 步的计划,但甚至没有实现目标怎么办?如果没有监督,代理可能会运行这些步骤数小时,在 API 调用上浪费时间和金钱,然后您才会意识到它不会消失。
To avoid fruitless execution, planning should be decoupled from execution. You ask the agent to first generate a plan, and only after this plan is validated is it executed. The plan can be validated using heuristics. For example, one simple heuristic is to eliminate plans with invalid actions. If the generated plan requires a Google search and the agent doesn’t have access to Google Search, this plan is invalid. Another simple heuristic might be eliminating all plans with more than X steps.
为避免无果的执行,规划应与执行解耦。您要求代理首先生成一个计划,只有在验证此计划后才会执行该计划。可以使用启发式方法验证该计划。例如,一种简单的启发式方法是消除具有无效操作的计划。如果生成的计划需要 Google 搜索,而代理无权访问 Google 搜索,则此计划无效。另一种简单的启发式方法可能是消除所有具有 X 个以上步骤的计划。
A plan can also be validated using AI judges. You can ask a model to evaluate whether the plan seems reasonable or how to improve it.
还可以使用 AI 裁判验证计划。您可以要求模型评估计划是否合理或如何改进计划。
If the generated plan is evaluated to be bad, you can ask the planner to generate another plan. If the generated plan is good, execute it.
如果生成的计划被评估为不良,您可以要求规划者生成另一个计划。如果生成的计划良好,请执行它。
Agent pattern
Figure 2. Decoupling planning and execution so that only validated plans are executed
图 2.将规划和执行分离,以便仅执行经过验证的计划
If the plan consists of external tools, function calling will be invoked. Outputs from executing this plan will then again need to be evaluated. Note that the generated plan doesn’t have to be an end-to-end plan for the whole task. It can be a small plan for a subtask. The whole process looks like Figure 2.
如果计划由外部工具组成,则会调用函数调用。然后需要再次评估执行此计划的输出。请注意,生成的计划不必是整个任务的端到端计划。它可以是子任务的小计划。整个过程如图 2 所示。
Your system now has three components: one to generate plans, one to validate plans, and another to execute plans. If you consider each component an agent, this can be considered a multi-agent system. Because most agentic workflows are sufficiently complex to involve multiple components, most agents are multi-agent.
您的系统现在有三个组件:一个用于生成计划,一个用于验证计划,另一个用于执行计划。如果将每个组件视为一个代理,则可以将其视为多代理系统。由于大多数代理工作流都足够复杂,足以涉及多个组件,因此大多数代理都是多代理的。
To speed up the process, instead of generating plans sequentially, you can generate several plans in parallel and ask the evaluator to pick the most promising one. This is another latency–cost tradeoff, as generating multiple plans simultaneously will incur extra costs.
为了加快流程,您可以并行生成多个计划,并要求评估员选择最有前途的计划,而不是按顺序生成计划。这是另一种延迟-成本权衡,因为同时生成多个计划将产生额外费用。
Planning requires understanding the intention behind a task: what’s the user trying to do with this query? An intent classifier is often used to help agents plan. As shown in Chapter 5 in the section “Break complex tasks into simpler subtasks“, intent classification can be done using another prompt or a classification model trained for this task. The intent classification mechanism can be considered another agent in your multi-agent system.
规划需要了解任务背后的意图:用户尝试对此查询执行什么操作?意图分类器通常用于帮助代理进行规划。如第 5 章 “将复杂任务分解为更简单的子任务” 一节所示,可以使用另一个提示或为此任务训练的分类模型来完成意图分类。可以将 intent classification 机制视为多代理系统中的另一个代理。
Knowing the intent can help the agent pick the right tools. For example, for customer support, if the query is about billing, the agent might need access to a tool to retrieve a user’s recent payments. But if the query is about how to reset a password, the agent might need to access documentation retrieval.
了解意图可以帮助代理选择正确的工具。例如,对于客户支持,如果查询是关于计费的,则代理可能需要访问工具来检索用户最近的付款。但是,如果查询是关于如何重置密码的,则代理可能需要访问文档检索。
Tip: 提示:
Some queries might be out of the scope of the agent. The intent classifier should be able to classify requests as IRRELEVANT so that the agent can politely reject those instead of wasting FLOPs coming up with impossible solutions.
某些查询可能超出代理的范围。意图分类器应该能够将请求分类为 IRRELATED 请求,以便代理可以礼貌地拒绝这些请求,而不是浪费 FLOP 提出不可能的解决方案。
So far, we’ve assumed that the agent automates all three stages: generating plans, validating plans, and executing plans. In reality, humans can be involved at any stage to aid with the process and mitigate risks.
到目前为止,我们假设代理自动执行所有三个阶段:生成计划、验证计划和执行计划。实际上,人类可以在任何阶段参与其中,以协助完成该过程并降低风险。
A human expert can provide a plan, validate a plan, or execute parts of a plan. For example, for complex tasks for which an agent has trouble generating the whole plan, a human expert can provide a high-level plan that the agent can expand upon.
人类专家可以提供计划、验证计划或执行计划的某些部分。例如,对于代理在生成整个计划时遇到困难的复杂任务,人工专家可以提供代理可以扩展的高级计划。
If a plan involves risky operations, such as updating a database or merging a code change, the system can ask for explicit human approval before executing or defer to humans to execute these operations. To make this possible, you need to clearly define the level of automation an agent can have for each action.
如果计划涉及有风险的操作,例如更新数据库或合并代码更改,系统可以在执行之前请求明确的人工批准,或者让人工执行这些操作。为了实现这一点,您需要明确定义代理可以为每个操作提供的自动化级别。
To summarize, solving a task typically involves the following processes. Note that reflection isn’t mandatory for an agent, but it’ll significantly boost the agent’s performance.
总而言之,解决任务通常涉及以下过程。请注意,反射对于代理来说不是必需的,但它会显著提高代理的性能。
Plan generation: come up with a plan for accomplishing this task. A plan is a sequence of manageable actions, so this process is also called task decomposition.
计划生成:提出完成此任务的计划。计划是一系列可管理的操作,因此此过程也称为任务分解。
Reflection and error correction: evaluate the generated plan. If it’s a bad plan, generate a new one.
Reflection and error correction:评估生成的计划。如果这是一个糟糕的计划,请生成一个新的计划。
Execution: take actions outlined in the generated plan. This often involves calling specific functions.
执行:执行生成的计划中概述的操作。这通常涉及调用特定函数。
Reflection and error correction: upon receiving the action outcomes, evaluate these outcomes and determine whether the goal has been accomplished. Identify and correct mistakes. If the goal is not completed, generate a new plan.
反思和纠错:在收到操作结果后,评估这些结果并确定目标是否已完成。识别并纠正错误。如果目标未完成,请生成新计划。
You’ve already seen some techniques for plan generation and reflection in this book. When you ask a model to “think step by step”, you’re asking it to decompose a task. When you ask a model to “verify if your answer is correct”, you’re asking it to reflect.
您已经在本书中看到了一些用于计划生成和反思的技术。当您要求模型 “逐步思考” 时,您就是在要求它分解任务。当您要求模型“验证您的答案是否正确”时,您是在要求它进行反思。
Foundation models as planners 作为规划器的基础模型
An open question is how well foundation models can plan. Many researchers believe that foundation models, at least those built on top of autoregressive language models, cannot. Meta’s Chief AI Scientist Yann LeCun states unequivocably that autoregressive LLMs can’t plan (2023).
一个悬而未决的问题是基础模型的规划能力如何。许多研究人员认为,基础模型,至少是那些建立在自回归语言模型之上的模型,不能。Meta 的首席人工智能科学家 Yann LeCun 明确表示,自回归LLMs无法计划(2023 年)。
While there is a lot of anecdotal evidence that LLMs are poor planners, it’s unclear whether it’s because we don’t know how to use LLMs the right way or because LLMs, fundamentally, can’t plan.
虽然有很多轶事证据表明LLMs规划得很糟糕,但目前尚不清楚是因为我们不知道如何使用LLMs正确的方式,还是因为LLMs从根本上说,无法规划。
Planning, at its core, is a search problem. You search among different paths towards the goal, predict the outcome (reward) of each path, and pick the path with the most promising outcome. Often, you might determine that no path exists that can take you to the goal.
规划的核心是一个搜索问题。您在通往目标的不同路径中搜索,预测每条路径的结果(奖励),并选择结果最有希望的路径。通常,您可能会确定不存在可以带您实现目标的路径。
Search often requires backtracking. For example, imagine you’re at a step where there are two possible actions: A and B. After taking action A, you enter a state that’s not promising, so you need to backtrack to the previous state to take action B.
搜索通常需要回溯。例如,假设您处于一个步骤,其中有两个可能的操作:A 和 B。执行操作 A 后,您进入了没有希望的状态,因此您需要回溯到之前的状态才能执行操作 B。
Some people argue that an autoregressive model can only generate forward actions. It can’t backtrack to generate alternate actions. Because of this, they conclude that autoregressive models can’t plan. However, this isn’t necessarily true. After executing a path with action A, if the model determines that this path doesn’t make sense, it can revise the path using action B instead, effectively backtracking. The model can also always start over and choose another path.
有些人认为自回归模型只能产生向前的动作。它无法回溯以生成替代操作。因此,他们得出结论,自回归模型无法规划。但是,这不一定是真的。使用操作 A 执行路径后,如果模型确定此路径没有意义,则可以改用操作 B 修改路径,从而有效地回溯。模型也始终可以重新开始并选择另一条路径。
It’s also possible that LLMs are poor planners because they aren’t given the toolings needed to plan. To plan, it’s necessary to know not only the available actions but also the potential outcome of each action. As a simple example, let’s say you want to walk up a mountain. Your potential actions are turn right, turn left, turn around, or go straight ahead. However, if turning right will cause you to fall off the cliff, you might not consider this action. In technical terms, an action takes you from one state to another, and it’s necessary to know the outcome state to determine whether to take an action.
也有LLMs可能是糟糕的规划者,因为他们没有获得规划所需的工具。要进行规划,不仅需要了解可用的操作,还需要了解每个操作的潜在结果。举个简单的例子,假设你想走上一座山。您的潜在操作是右转、左转、转身或直行。但是,如果右转会导致您从悬崖上掉下来,您可能不会考虑这样做。从技术上讲,操作将您从一种状态带到另一种状态,并且必须了解结果状态才能确定是否执行操作。
This means that prompting a model to generate only a sequence of actions like what the popular chain-of-thought prompting technique does isn’t sufficient. The paper “Reasoning with Language Model is Planning with World Model” (Hao et al., 2023) argues that an LLM, by containing so much information about the world, is capable of predicting the outcome of each action. This LLM can incorporate this outcome prediction to generate coherent plans.
这意味着提示模型仅生成一系列操作(如流行的思维链提示技术)是不够的。论文 “Reasoning with Language Model is Planning with World Model” (Hao et al., 2023) 认为,an LLM包含如此多的关于世界的信息,能够预测每个动作的结果。这LLM可以结合此结果预测以生成连贯的计划。
Even if AI can’t plan, it can still be a part of a planner. It might be possible to augment an LLM with a search tool and state tracking system to help it plan.
即使 AI 无法规划,它仍然可以成为规划器的一部分。也许可以通过搜索工具和状态跟踪系统来增强它LLM来帮助它进行规划。
The agent is a core concept in RL, which is defined in Wikipedia as a field “concerned with how an intelligent agent ought to take actions in a dynamic environment in order to maximize the cumulative reward.”
代理是 RL 中的一个核心概念,在维基百科中将其定义为一个字段,“关注智能代理应该如何在动态环境中采取行动,以最大化累积奖励”。
RL agents and FM agents are similar in many ways. They are both characterized by their environments and possible actions. The main difference is in how their planners work.
RL 代理和 FM 代理在很多方面都很相似。它们都以其环境和可能的行动为特征。主要区别在于他们的规划师的工作方式。
In an RL agent, the planner is trained by an RL algorithm. Training this RL planner can require a lot of time and resources.
在 RL 代理中,规划器由 RL 算法训练。培训这个 RL 规划器可能需要大量的时间和资源。
In an FM agent, the model is the planner. This model can be prompted or finetuned to improve its planning capabilities, and generally requires less time and fewer resources.
在 FM 代理中,模型是规划者。可以提示或微调此模型以提高其规划能力,并且通常需要更少的时间和资源。
However, there’s nothing to prevent an FM agent from incorporating RL algorithms to improve its performance. I suspect that in the long run, FM agents and RL agents will merge.
但是,没有什么可以阻止 FM 代理合并 RL 算法来提高其性能。我怀疑从长远来看,FM 代理和 RL 代理会合并。
Plan generation 计划生成
The simplest way to turn a model into a plan generator is with prompt engineering. Imagine that you want to create an agent to help customers learn about products at Kitty Vogue. You give this agent access to three external tools: retrieve products by price, retrieve top products, and retrieve product information. Here’s an example of a prompt for plan generation. This prompt is for illustration purposes only. Production prompts are likely more complex.
将模型转换为平面图生成器的最简单方法是快速工程。假设您要创建一个代理来帮助客户了解 Kitty Vogue 的产品。您允许此代理访问三个外部工具:按价格检索产品、检索热门产品和检索产品信息。下面是一个计划生成提示的示例。此提示仅用于说明目的。生产提示可能更复杂。
SYSTEM PROMPT: 系统提示:
Propose a plan to solve the task. You have access to 5 actions:
* get_today_date()
* fetch_top_products(start_date, end_date, num_products)
* fetch_product_info(product_name)
* generate_query(task_history, tool_output)
* generate_response(query)
The plan must be a sequence of valid actions.
Examples
Task: "Tell me about Fruity Fedora"
Plan: [fetch_product_info, generate_query, generate_response]
Task: "What was the best selling product last week?"
Plan: [fetch_top_products, generate_query, generate_response]
Task: {USER INPUT}
Plan:
There are two things to note about this example:
关于此示例,有两点需要注意:
The plan format used here—a list of functions whose parameters are inferred by the agent—is just one of many ways to structure the agent control flow.
此处使用的计划格式 — 其参数由代理推断的函数列表 — 只是构建代理控制流的众多方法之一。
The generate_query function takes in the task’s current history and the most recent tool outputs to generate a query to be fed into the response generator. The tool output at each step is added to the task’s history.
generate_query 函数采用任务的当前历史记录和最新的工具输出,以生成要馈送到响应生成器的查询。每个步骤的工具输出将添加到任务的历史记录中。
Given the user input “What’s the price of the best-selling product last week”, a generated plan might look like this:
给定用户输入“What's the price of the best-sale product last week”,生成的计划可能如下所示:
get_time()
fetch_top_products()
fetch_top_products()
fetch_product_info()
fetch_product_info()
generate_query()
generate_query()
generate_response()
generate_response()
You might wonder, “What about the parameters needed for each function?” The exact parameters are hard to predict in advance since they are often extracted from the previous tool outputs. If the first step, get_time(), outputs “2030-09-13”, the agent can reason that the parameters for the next step should be called with the following parameters:
您可能想知道,“每个函数所需的参数呢?确切的参数很难提前预测,因为它们通常是从以前的工具输出中提取的。如果第一步 get_time() 输出 “2030-09-13”,则代理可以使用以下参数推断下一步的参数:
fetch_top_products(
start_date="2030-09-07",
end_date="2030-09-13",
num_products=1
)
Often, there’s insufficient information to determine the exact parameter values for a function. For example, if a user asks “What’s the average price of best-selling products?”, the answers to the following questions are unclear:
通常,没有足够的信息来确定函数的确切参数值。例如,如果用户问“最畅销产品的平均价格是多少?”,则以下问题的答案并不明确:
How many best-selling products the user wants to look at?
用户想看多少最畅销的产品?
Does the user want the best-selling products last week, last month, or of all time?
用户想要上周、上个月还是所有时间最畅销的产品?
This means that models frequently have to make guesses, and guesses can be wrong.
这意味着模型经常必须进行猜测,而猜测可能是错误的。
Because both the action sequence and the associated parameters are generated by AI models, they can be hallucinated. Hallucinations can cause the model to call an invalid function or call a valid function but with wrong parameters. Techniques for improving a model’s performance in general can be used to improve a model’s planning capabilities.
由于动作序列和相关参数都是由 AI 模型生成的,因此它们可能是幻觉。幻觉会导致模型调用无效函数或调用有效函数但参数错误。通常,用于提高模型性能的技术可用于改进模型的规划功能。
Tips for making an agent better at planning.
让代理人更好地规划的技巧。
Write a better system prompt with more examples.
编写包含更多示例的更好的系统提示符。
Give better descriptions of the tools and their parameters so that the model understands them better.
更好地描述工具及其参数,以便模型更好地理解它们。
Rewrite the functions themselves to make them simpler, such as refactoring a complex function into two simpler functions.
重写函数本身以使其更简单,例如将一个复杂函数重构为两个更简单的函数。
Use a stronger model. In general, stronger models are better at planning.
使用更强大的模型。一般来说,更强的模型更擅长规划。
Finetune a model for plan generation.
微调模型以生成计划。
Function calling 函数调用
Many model providers offer tool use for their models, effectively turning their models into agents. A tool is a function. Invoking a tool is, therefore, often called function calling. Different model APIs work differently, but in general, function calling works as follows:
许多模型提供商为其模型提供工具使用,从而有效地将他们的模型转变为代理。工具是一种功能。因此,调用工具通常称为函数调用。不同模型 API 的工作方式不同,但通常,函数调用的工作原理如下:
Create a tool inventory. Declare all the tools that you might want a model to use. Each tool is described by its execution entry point (e.g., its function name), its parameters, and its documentation (e.g., what the function does and what parameters it needs).
创建工具清单。 声明您可能希望模型使用的所有工具。每个工具都由其执行入口点(例如,其函数名称)、其参数及其文档(例如,函数的作用和需要的参数)进行描述。
Specify what tools the agent can use for a query.
指定代理可用于查询的工具。
Because different queries might need different tools, many APIs let you specify a list of declared tools to be used per query. Some let you control tool use further by the following settings:
由于不同的查询可能需要不同的工具,因此许多 API 允许您指定要为每个查询使用的已声明工具列表。有些允许您通过以下设置进一步控制工具的使用:
required: the model must use at least one tool.
required:模型必须至少使用一个工具。
none: the model shouldn’t use any tool.
none:模型不应使用任何工具。
auto: the model decides which tools to use.
auto:模型决定使用哪些工具。
Function calling is illustrated in Figure 3. This is written in pseudocode to make it representative of multiple APIs. To use a specific API, please refer to its documentation.
函数调用如图 3 所示。这是用伪代码编写的,以使其代表多个 API。要使用特定的 API,请参阅其文档。
An example of function calling
Figure 3. An example of a model using two simple tools
图3.使用两个简单工具的模型示例
Given a query, an agent defined as in Figure3 will automatically generate what tools to use and their parameters. Some function calling APIs will make sure that only valid functions are generated, though they won’t be able to guarantee the correct parameter values.
给定一个查询,如图 3 中定义的那样,代理将自动生成要使用的工具及其参数。某些函数调用 API 将确保仅生成有效的函数,尽管它们无法保证正确的参数值。
For example, given the user query “How many kilograms are 40 pounds?”, the agent might decide that it needs the tool lbs_to_kg_tool with one parameter value of 40. The agent’s response might look like this.
例如,给定用户查询“40 磅是多少公斤?”,代理可能会决定它需要一个参数值为 40 的工具lbs_to_kg_tool。代理的响应可能如下所示。
response = ModelResponse(
finish_reason='tool_calls',
message=chat.Message(
content=None,
role='assistant',
tool_calls=[
ToolCall(
function=Function(
arguments='{"lbs":40}',
name='lbs_to_kg'),
type='function')
])
)
From this response, you can evoke the function lbs_to_kg(lbs=40) and use its output to generate a response to the users.
从此响应中,您可以调用函数 lbs_to_kg(lbs=40) 并使用其输出生成对用户的响应。
Tip: When working with agents, always ask the system to report what parameter values it uses for each function call. Inspect these values to make sure they are correct.
提示: 使用代理时,请始终要求系统报告它为每个函数调用使用的参数值。检查这些值以确保它们正确无误。
Planning granularity 规划粒度
A plan is a roadmap outlining the steps needed to accomplish a task. A roadmap can be of different levels of granularity. To plan for a year, a quarter-by-quarter plan is higher-level than a month-by-month plan, which is, in turn, higher-level than a week-to-week plan.
计划是概述完成任务所需步骤的路线图。路线图可以具有不同的粒度级别。要规划一年,逐季度计划比按月计划更高级别,而按月计划又比每周计划更高级别。
There’s a planning/execution tradeoff. A detailed plan is harder to generate, but easier to execute. A higher-level plan is easier to generate, but harder to execute. An approach to circumvent this tradeoff is to plan hierarchically. First, use a planner to generate a high-level plan, such as a quarter-to-quarter plan. Then, for each quarter, use the same or a different planner to generate a month-to-month plan.
需要权衡规划 / 执行。详细的计划更难生成,但更容易执行。更高级别的计划更容易生成,但更难执行。规避这种权衡的一种方法是分层规划。首先,使用计划员生成高级计划,例如季度到季度计划。然后,对于每个季度,使用相同或不同的计划器来生成月度计划。
So far, all examples of generated plans use the exact function names, which is very granular. A problem with this approach is that an agent’s tool inventory can change over time. For example, the function to get the current date get_time() can be renamed to get_current_time(). When a tool changes, you’ll need to update your prompt and all your examples. Using the exact function names also makes it harder to reuse a planner across different use cases with different tool APIs.
到目前为止,所有生成的计划示例都使用确切的函数名称,这非常精细。这种方法的一个问题是代理的工具库存会随着时间的推移而变化。例如,用于获取当前日期的函数 get_time() 可以重命名为 get_current_time()。当工具发生变化时,您需要更新提示和所有示例。使用确切的函数名称还会使在不同用例中使用不同工具 API 重用 Planner 变得更加困难。
If you’ve previously finetuned a model to generate plans based on the old tool inventory, you’ll need to finetune the model again on the new tool inventory.
如果您之前已微调模型以根据旧刀具库存生成计划,则需要在新刀具库存上再次微调模型。
To avoid this problem, plans can also be generated using a more natural language, which is higher-level than domain-specific function names. For example, given the query “What’s the price of the best-selling product last week”, an agent can be instructed to output a plan that looks like this:
为避免此问题,还可以使用更自然的语言生成计划,该语言比特定于域的函数名称更高。例如,给定查询 “What's the price of the best-sale product last week”,可以指示代理输出如下所示的计划:
get current date
获取当前日期
retrieve the best-selling product last week
retrieve product information
检索产品信息
generate query 生成查询
generate response
生成响应
Using more natural language helps your plan generator become robust to changes in tool APIs. If your model was trained mostly on natural language, it’ll likely be better at understanding and generating plans in natural language and less likely to hallucinate.
使用更自然的语言有助于您的计划生成器对工具 API 的更改变得稳健。如果您的模型主要使用自然语言进行训练,则它可能更擅长理解和生成自然语言计划,并且不太可能产生幻觉。
The downside of this approach is that you need a translator to translate each natural language action into executable commands. Chameleon (Lu et al., 2023) calls this translator a program generator. However, translating is a much simpler task than planning and can be done by weaker models with a lower risk of hallucination.
这种方法的缺点是您需要翻译器将每个自然语言操作转换为可执行命令。Chameleon (Lu et al., 2023) 将此翻译器称为程序生成器。然而,翻译是一项比计划简单得多的任务,并且可以由幻觉风险较低的较弱模型完成。
Complex plans 复杂的计划
The plan examples so far have been sequential: the next action in the plan is always executed after the previous action is done. The order in which actions can be executed is called a control flow. The sequential form is just one type of control flow. Other types of control flows include the parallel, if statement, and for loop. The list below provides an overview of each control flow, including sequential for comparison:
到目前为止,计划示例是连续的:计划中的下一个操作始终在上一个操作完成后执行。操作的执行顺序称为控制流。顺序形式只是控制流的一种类型。其他类型的控制流包括 parallel、if 语句和 for 循环。下面的列表提供了每个控制流的概述,包括用于比较的顺序:
- Sequential 顺序
Executing task B after task A is complete, possibly because task B depends on task A. For example, the SQL query can only be executed after it’s been translated from the natural language input.
在任务 A 完成后执行任务 B,可能是因为任务 B 依赖于任务 A。例如,SQL 查询只有在从自然语言输入转换后才能执行。
- Parallel 平行
Executing tasks A and B at the same time. For example, given the query “Find me best-selling products under $100”, an agent might first retrieve the top 100 best-selling products and, for each of these products, retrieve its price.
同时执行任务 A 和 B。例如,给定查询“Find me best-sales products under 100”,代理可能会首先检索前 100 个最畅销的产品,然后针对每个产品检索其价格。
- If statement If 语句
Executing task B or task C depending on the output from the previous step. For example, the agent first checks NVIDIA’s earnings report. Based on this report, it can then decide to sell or buy NVIDIA stocks. Anthropic’s post calls this pattern “routing”.
执行任务 B 或任务 C,具体取决于上一步的输出。例如,代理首先检查 NVIDIA 的收益报告。根据这份报告,它可以决定出售或购买 NVIDIA 股票。Anthropic 的帖子将这种模式称为 “路由”。
- For loop For 循环
Repeat executing task A until a specific condition is met. For example, keep on generating random numbers until a prime number.
重复执行任务 A,直到满足特定条件。例如,不断生成随机数,直到质数。
These different control flows are visualized in Figure4.
这些不同的控制流如图 4 所示。
Agent control flow
Figure 4. Examples of different orders in which a plan can be executed
图 4.可以执行计划的不同顺序的示例
In traditional software engineering, conditions for control flows are exact. With AI-powered agents, AI models determine control flows. Plans with non-sequential control flows are more difficult to both generate and translate into executable commands.
在传统的软件工程中,控制流的条件是精确的。借助 AI 驱动的代理,AI 模型可以确定控制流。具有非顺序控制流的计划更难生成和转换为可执行命令。
Tip: 提示:
When evaluating an agent framework, check what control flows it supports. For example, if the system needs to browse ten websites, can it do so simultaneously? Parallel execution can significantly reduce the latency perceived by users.
在评估代理框架时,请检查它支持的控制流。例如,如果系统需要浏览十个网站,是否可以同时进行?并行执行可以显著降低用户感知到的延迟。
Reflection and error correction 反射和纠错
Even the best plans need to be constantly evaluated and adjusted to maximize their chance of success. While reflection isn’t strictly necessary for an agent to operate, it’s necessary for an agent to succeed.
即使是最好的计划也需要不断评估和调整,以最大限度地提高成功的机会。虽然反射不是代理操作所必需的,但代理成功是必要的。
There are many places during a task process where reflection can be useful:
在任务过程中,反射在许多地方都很有用:
After receiving a user query to evaluate if the request is feasible.
收到用户查询后,评估请求是否可行。
After the initial plan generation to evaluate whether the plan makes sense.
在初始计划生成之后,评估计划是否有意义。
After each execution step to evaluate if it’s on the right track.
在每个执行步骤之后,以评估它是否在正确的轨道上。
After the whole plan has been executed to determine if the task has been accomplished.
执行整个计划后,确定任务是否已完成。
Reflection and error correction are two different mechanisms that go hand in hand. Reflection generates insights that help uncover errors to be corrected.
反射和纠错是两种相辅相成的不同机制。Reflection 生成有助于发现需要纠正的错误。
Reflection can be done with the same agent with self-critique prompts. It can also be done with a separate component, such as a specialized scorer: a model that outputs a concrete score for each outcome.
可以使用带有自我批评提示的同一代理来完成反射。它也可以通过一个单独的组件来完成,例如专门的评分器:一个为每个结果输出具体分数的模型。
First proposed by ReAct (Yao et al., 2022), interleaving reasoning and action has become a common pattern for agents. Yao et al. used the term “reasoning” to encompass both planning and reflection. At each step, the agent is asked to explain its thinking (planning), take actions, then analyze observations (reflection), until the task is considered finished by the agent. The agent is typically prompted, using examples, to generate outputs in the following format:
由 ReAct (Yao et al., 2022) 首次提出,交错 reasoning和 action已成为代理的常见模式。Yao 等人使用“推理”一词来涵盖计划和反思。在每个步骤中,代理都被要求解释其想法(计划),采取行动,然后分析观察结果(反思),直到代理认为任务已完成。通常会使用示例提示代理生成以下格式的输出:
Thought 1: …
Act 1: …
Observation 1: …
… [continue until reflection determines that the task is finished] …
Thought N: …
Act N: Finish [Response to query]
ReAct example
Figure 5: A ReAct agent in action.
图 5:正在运行的 ReAct 代理。
Figure 5 shows an example of an agent following the ReAct framework responding to a question from HotpotQA (Yang et al., 2018), a benchmark for multi-hop question answering.
图 5 显示了遵循 ReAct 框架的代理回答来自 HotpotQA 的问题的示例(Yang et al., 2018),HotpotQA 是多跳问答的基准。
You can implement reflection in a multi-agent setting: one agent plans and takes actions and another agent evaluates the outcome after each step or after a number of steps.
您可以在多代理设置中实施反射:一个代理计划并采取行动,另一个代理在每个步骤或多个步骤后评估结果。
If the agent’s response failed to accomplish the task, you can prompt the agent to reflect on why it failed and how to improve. Based on this suggestion, the agent generates a new plan. This allows agents to learn from their mistakes.
如果代理的响应未能完成任务,您可以提示代理反思失败的原因以及如何改进。根据此建议,代理生成新计划。这使代理能够从他们的错误中吸取教训。
For example, given a coding generation task, an evaluator might evaluate that the generated code fails ⅓ of the test cases. The agent then reflects that it failed because it didn’t take into account arrays where all numbers are negative. The actor then generates new code, taking into account all-negative arrays.
例如,给定一个编码生成任务,评估者可能会评估生成的代码未通过 1/3 的测试用例。然后,代理会反映它失败,因为它没有考虑所有数字都是负数的数组。然后,Actor 生成新代码,同时考虑全负数组。
This is the approach that Reflexion (Shinn et al., 2023) took. In this framework, reflection is separated into two modules: an evaluator that evaluates the outcome and a self-reflection module that analyzes what went wrong. Figure 6-13 shows examples of Reflexion agents in action. The authors used the term “trajectory” to refer to a plan. At each step, after evaluation and self-reflection, the agent proposes a new trajectory.
这就是 Reflexion (Shinn et al., 2023) 采取的方法。在这个框架中,反思分为两个模块:评估结果的评估器和分析问题所在地的自我反思模块。图 6-13 显示了 Reflexion 代理的运行示例。作者使用术语 “trajectory” 来指代计划。在每一步,在评估和自我反省之后,代理都会提出一个新的轨迹。
Reflexion example
Figure 6. Examples of how Reflexion agents work.
图 6。Reflexion 代理工作原理的示例。
Compared to plan generation, reflection is relatively easy to implement and can bring surprisingly good performance improvement. The downside of this approach is latency and cost. Thoughts, observations, and sometimes actions can take a lot of tokens to generate, which increases cost and user-perceived latency, especially for tasks with many intermediate steps. To nudge their agents to follow the format, both ReAct and Reflexion authors used plenty of examples in their prompts. This increases the cost of computing input tokens and reduces the context space available for other information.
与计划生成相比,反射相对容易实现,并且可以带来非常好的性能改进。这种方法的缺点是延迟和成本。想法、观察结果,有时甚至是操作,都需要生成大量令牌,这会增加成本和用户感知的延迟,尤其是对于具有许多中间步骤的任务。为了促使他们的代理遵循这种格式,ReAct 和 Reflexion 的作者在他们的提示中使用了大量示例。这会增加计算 input token 的成本,并减少可用于其他信息的上下文空间。
Tool selection 工具选择
Because tools often play a crucial role in a task’s success, tool selection requires careful consideration. The tools to give your agent depend on the environment and the task, but also depends on the AI model that powers the agent.
由于工具通常在任务的成功中起着至关重要的作用,因此需要仔细考虑工具的选择。为您的代理提供的工具取决于环境和任务,但也取决于为代理提供支持的 AI 模型。
There’s no foolproof guide on how to select the best set of tools. Agent literature consists of a wide range of tool inventories. For example:
关于如何选择最佳工具集,没有万无一失的指南。代理文献包括广泛的工具清单。例如:
Toolformer (Schick et al., 2023) finetuned GPT-J to learn 5 tools.
Toolformer(Schick 等人,2023 年)微调了 GPT-J 以学习 5 种工具。
Chameleon (Lu et al., 2023) uses 13 tools.
Chameleon (Lu et al., 2023) 使用了 13 种工具。
Gorilla (Patil et al., 2023) attempted to prompt agents to select the right API call among 1,645 APIs.
Gorilla (Patil et al., 2023) 试图提示代理在 1,645 个 API 中选择正确的 API 调用。
More tools give the agent more capabilities. However, the more tools there are, the harder it is to efficiently use them. It’s similar to how it’s harder for humans to master a large set of tools. Adding tools also means increasing tool descriptions, which might not fit into a model’s context.
更多的工具为代理提供了更多功能。但是,工具越多,有效使用它们就越难。这类似于人类更难掌握大量工具。添加工具还意味着增加工具描述,这可能不适合模型的上下文。
Like many other decisions while building AI applications, tool selection requires experimentation and analysis. Here are a few things you can do to help you decide:
与构建 AI 应用程序时的许多其他决策一样,工具选择需要试验和分析。您可以采取以下措施来帮助您做出决定:
Compare how an agent performs with different sets of tools.
比较代理与不同工具集的执行情况。
Do an ablation study to see how much the agent’s performance drops if a tool is removed from its inventory. If a tool can be removed without a performance drop, remove it.
进行消融研究,看看如果从库存中取出工具,代理的性能会下降多少。如果可以删除工具而不会降低性能,请将其删除。
Look for tools that the agent frequently makes mistakes on. If a tool proves too hard for the agent to use—for example, extensive prompting and even finetuning can’t get the model to learn to use it—change the tool.
寻找代理经常犯错误的工具。如果某个工具对代理来说太难使用(例如,大量的提示甚至微调都无法让模型学习使用它),请更改该工具。
Plot the distribution of tool calls to see what tools are most used and what tools are least used. Figure 6-14 shows the differences in tool use patterns of GPT-4 and ChatGPT in Chameleon (Lu et al., 2023).
绘制工具调用的分布图,以查看哪些工具使用最多,哪些工具使用最少。图 7 显示了 Chameleon 中 GPT-4 和 ChatGPT 的工具使用模式的差异(Lu et al., 2023)。
Different models have different tool preferences
Figure 7. Different models and tasks express different tool use patterns.
图7.不同的模型和任务表示不同的工具使用模式。
Experiments by Chameleon (Lu et al., 2023) also demonstrate two points:
Chameleon 的实验(Lu et al., 2023)也证明了两点:
Different tasks require different tools. ScienceQA, the science question answering task, relies much more on knowledge retrieval tools than TabMWP, a tabular math problem-solving task.
不同的任务需要不同的工具。ScienceQA 是科学问答任务,它更多地依赖于知识检索工具,而不是 TabMWP,后者是一种表格数学问题解决任务。
Different models have different tool preferences. For example, GPT-4 seems to select a wider set of tools than ChatGPT. ChatGPT seems to favor image captioning, while GPT-4 seems to favor knowledge retrieval.
不同的模型具有不同的工具首选项。例如,GPT-4 似乎选择了比 ChatGPT 更广泛的工具集。ChatGPT 似乎偏爱图像字幕,而 GPT-4 似乎偏爱知识检索。
Tip: 提示:
When evaluating an agent framework, evaluate what planners and tools it supports. Different frameworks might focus on different categories of tools. For example, AutoGPT focuses on social media APIs (Reddit, X, and Wikipedia), whereas Composio focuses on enterprise APIs (Google Apps, GitHub, and Slack).
在评估代理框架时,请评估它支持的规划器和工具。不同的框架可能侧重于不同类别的工具。例如,AutoGPT 专注于社交媒体 API(Reddit、X 和 Wikipedia),而 Composio 专注于企业 API(Google Apps、GitHub 和 Slack)。
As your needs will likely change over time, evaluate how easy it is to extend your agent to incorporate new tools.
由于您的需求可能会随着时间的推移而变化,因此请评估扩展您的代理以整合新工具的难易程度。
As humans, we become more productive not just by using the tools we’re given, but also by creating progressively more powerful tools from simpler ones. Can AI create new tools from its initial tools?
作为人类,我们不仅通过使用我们得到的工具来提高工作效率,而且还通过从更简单的工具逐步创建更强大的工具。AI 能否从其初始工具中创建新工具?
Chameleon (Lu et al., 2023) proposes the study of tool transition: after tool X, how likely is the agent to call tool Y? Figure 8 shows an example of tool transition. If two tools are frequently used together, they can be combined into a bigger tool. If an agent is aware of this information, the agent itself can combine initial tools to continually build more complex tools.
Chameleon (Lu et al., 2023) 提出了工具转换的研究:在工具 X 之后,代理调用工具 Y 的可能性有多大?图 8 显示了工具过渡的示例。如果两个工具经常一起使用,则可以将它们组合成一个更大的工具。如果代理知道此信息,则代理本身可以组合初始工具以不断构建更复杂的工具。
Tool transition
Figure8. A tool transition tree by Chameleon (Lu et al., 2023).
图 8.Chameleon 的工具过渡树 (Lu et al., 2023)。
Vogager (Wang et al., 2023) proposes a skill manager to keep track of new skills (tools) that an agent acquires for later reuse. Each skill is a coding program. When the skill manager determines a newly created skill is to be useful (e.g., because it’s successfully helped an agent accomplish a task), it adds this skill to the skill library (conceptually similar to the tool inventory). This skill can be retrieved later to use for other tasks.
Vogager (Wang et al., 2023) 提出了一个技能管理器来跟踪代理获得的新技能(工具)以供以后重用。每个技能都是一个编码程序。当技能管理器确定新创建的技能有用时(例如,因为它成功地帮助代理完成任务),它会将此技能添加到技能库中(在概念上类似于工具清单)。此技能可以稍后检索以用于其他任务。
Earlier in this section, we mentioned that the success of an agent in an environment depends on its tool inventory and its planning capabilities. Failures in either aspect can cause the agent to fail. The next section will discuss different failure modes of an agent and how to evaluate them.
在本节的前面,我们提到了代理在环境中的成功取决于其工具清单和规划能力。任一方面的失败都可能导致代理失败。下一节将讨论代理的不同故障模式以及如何评估它们。
Agent Failure Modes and Evaluation 代理故障模式和评估
Evaluation is about detecting failures. The more complex a task an agent performs, the more possible failure points there are. Other than the failure modes common to all AI applications discussed in Chapters 3 and 4, agents also have unique failures caused by planning, tool execution, and efficiency. Some of the failures are easier to catch than others.
评估是关于检测故障的。代理执行的任务越复杂,可能的故障点就越多。除了第 3 章和第 4 章讨论的所有 AI 应用程序常见的故障模式外,代理还存在由规划、工具执行和效率引起的独特故障。有些故障比其他故障更容易发现。
To evaluate an agent, identify its failure modes and measure how often each of these failure modes happens.
要评估代理,请确定其故障模式并测量每种故障模式发生的频率。
Planning failures 规划失败
Planning is hard and can fail in many ways. The most common mode of planning failure is tool use failure. The agent might generate a plan with one or more of these errors.
规划是困难的,而且可能在很多方面失败。规划失败的最常见模式是工具使用失败。代理可能会生成包含一个或多个这些错误的计划。
1. Invalid tool 无效工具
For example, it generates a plan that contains bing_search, which isn’t in the tool inventory.
例如,它会生成一个包含 bing_search 的计划,而 不在工具清单中。
2. Valid tool, invalid parameters.
有效的工具,无效的参数。
For example, it calls lbs_to_kg with two parameters, but this function requires only one parameter, lbs.
例如,它使用两个参数调用 lbs_to_kg,但此函数只需要一个参数 lbs。
3. Valid tool, incorrect parameter values
工具有效,参数值不正确
For example, it calls lbs_to_kg with one parameter, lbs, but uses the value 100 for lbs when it should be 120.
例如,它使用一个参数 lbs 调用 lbs_to_kg,但对 lbs 使用值 100,而该值应为 120。
Another mode of planning failure is goal failure: the agent fails to achieve the goal. This can be because the plan doesn’t solve a task, or it solves the task without following the constraints. To illustrate this, imagine you ask the model to plan a two-week trip from San Francisco to India with a budget of $5,000. The agent might plan a trip from San Francisco to Vietnam, or plan you a two-week trip from San Francisco to India that will cost you way over the budget.
规划失败的另一种模式是目标失败:代理未能实现目标。这可能是因为计划没有解决任务,或者它解决了任务而没有遵循约束。为了说明这一点,假设您要求模型计划一次从旧金山到印度的为期两周的旅行,预算为 5,000 美元。代理人可能会计划从旧金山到越南的旅行,或者为您计划从旧金山到印度的为期两周的旅行,这将超出您的预算。
A common constraint that is often overlooked by agent evaluation is time. In many cases, the time an agent takes matters less because you can assign a task to an agent and only need to check in when it’s done. However, in many cases, the agent becomes less useful with time. For example, if you ask an agent to prepare a grant proposal and the agent finishes it after the grant deadline, the agent isn’t very helpful.
代理评估经常忽略的一个常见约束是时间。在许多情况下,代理花费的时间不那么重要,因为您可以将任务分配给代理,并且只需在任务完成后签入。但是,在许多情况下,代理会随着时间的推移而变得不那么有用。例如,如果您要求代理准备资助提案,而代理在资助截止日期之后完成,则该代理不是很有帮助。
An interesting mode of planning failure is caused by errors in reflection. The agent is convinced that it’s accomplished a task when it hasn’t. For example, you ask the agent to assign 50 people to 30 hotel rooms. The agent might assign only 40 people and insist that the task has been accomplished.
一种有趣的计划失败模式是由反射错误引起的。代理确信它已经完成了一项任务,但实际上还没有。例如,您要求代理将 50 人分配到 30 个酒店房间。代理可能只分配 40 个人,并坚持认为任务已经完成。
To evaluate an agent for planning failures, one option is to create a planning dataset where each example is a tuple (task, tool inventory). For each task, use the agent to generate a K number of plans. Compute the following metrics:
要评估代理程序的规划失败,一种选择是创建一个规划数据集,其中每个示例都是一个元组(任务、工具清单)。对于每个任务,使用代理生成 K 个计划。计算以下指标:
Out of all generated plans, how many are valid?
在所有生成的计划中,有多少是有效的?
For a given task, how many plans does the agent have to generate to get a valid plan?
对于给定的任务,代理必须生成多少个计划才能获得有效的计划?
Out of all tool calls, how many are valid?
在所有工具调用中,有多少是有效的?
How often are invalid tools called?
无效工具多久被调用一次?
How often are valid tools called with invalid parameters?
使用无效参数调用有效工具的频率如何?
How often are valid tools called with incorrect parameter values?
使用不正确的参数值调用有效工具的频率有多高?
Analyze the agent’s outputs for patterns. What types of tasks does the agent fail more on? Do you have a hypothesis why? What tools does the model frequently make mistakes with? Some tools might be harder for an agent to use. You can improve an agent’s ability to use a challenging tool by better prompting, more examples, or finetuning. If all fail, you might consider swapping out this tool for something easier to use.
分析代理的输出以了解模式。代理在哪些类型的任务上失败得更多?你有什么假设吗?模型经常使用哪些工具出错?某些工具可能更难让代理使用。您可以通过更好的提示、更多示例或微调来提高代理使用具有挑战性的工具的能力。如果所有工具都失败了,您可以考虑将此工具换成更易于使用的工具。
Tool failures 工具故障
Tool failures happen when the correct tool is used, but the tool output is wrong. One failure mode is when a tool just gives the wrong outputs. For example, an image captioner returns a wrong description, or an SQL query generator returns a wrong SQL query.
当使用正确的工具,但工具输出错误时,会发生工具故障。一种失败模式是工具只是给出错误的输出。例如,图像字幕生成器返回错误的描述,或者 SQL 查询生成器返回错误的 SQL 查询。
If the agent generates only high-level plans and a translation module is involved in translating from each planned action to executable commands, failures can happen because of translation errors.
如果代理程序仅生成高级计划,并且转换模块参与从每个计划操作转换为可执行命令,则可能会因转换错误而发生故障。
Tool failures are tool-dependent. Each tool needs to be tested independently. Always print out each tool call and its output so that you can inspect and evaluate them. If you have a translator, create benchmarks to evaluate it.
工具故障取决于工具。每个工具都需要独立测试。始终打印出每个工具调用及其输出,以便您可以检查和评估它们。如果您有翻译人员,请创建基准来评估它。
Detecting missing tool failures requires an understanding of what tools should be used. If your agent frequently fails on a specific domain, this might be because it lacks tools for this domain. Work with human domain experts and observe what tools they would use.
检测缺失的工具故障需要了解应该使用哪些工具。如果您的代理在特定域上经常失败,这可能是因为它缺少该域的工具。与人类领域专家合作,观察他们将使用什么工具。
Efficiency 效率
An agent might generate a valid plan using the right tools to accomplish a task, but it might be inefficient. Here are a few things you might want to track to evaluate an agent’s efficiency:
代理可能会使用正确的工具生成有效的计划来完成任务,但效率可能很低。以下是您可能需要跟踪的一些事项以评估代理的效率:
How many steps does the agent need, on average, to complete a task?
代理平均需要多少个步骤才能完成一项任务?
How much does the agent cost, on average, to complete a task?
代理完成一项任务的平均成本是多少?
How long does each action typically take? Are there any actions that are especially time-consuming or expensive?
每个操作通常需要多长时间?是否有任何操作特别耗时或昂贵?
You can compare these metrics with your baseline, which can be another agent or a human operator. When comparing AI agents to human agents, keep in mind that humans and AI have very different modes of operation, so what’s considered efficient for humans might be inefficient for AI and vice versa. For example, visiting 100 web pages might be inefficient for a human agent who can only visit one page at a time but trivial for an AI agent that can visit all the web pages at once.
您可以将这些指标与基线进行比较,基线可以是另一个代理或人工操作员。在将 AI 代理与人类代理进行比较时,请记住,人类和 AI 的操作模式截然不同,因此被认为对人类有效的方法对 AI 可能效率低下,反之亦然。例如,对于一次只能访问一个页面的人工代理来说,访问 100 个网页的效率可能很低,但对于可以一次访问所有网页的 AI 代理来说,访问这个网页就微不足道了。
Conclusion 结论
At its core, the concept of an agent is fairly simple. An agent is defined by the environment it operates in and the set of tools it has access to. In an AI-powered agent, the AI model is the brain that leverages its tools and feedback from the environment to plan how best to accomplish a task. Access to tools makes a model vastly more capable, so the agentic pattern is inevitable.
从本质上讲,代理的概念相当简单。代理由其运行环境以及它有权访问的工具集定义。在 AI 驱动的代理中,AI 模型是利用其工具和来自环境的反馈来规划如何最好地完成任务的大脑。使用工具会使模型的能力大大增强,因此代理模式是不可避免的。
While the concept of “agents” sounds novel, they are built upon many concepts that have been used since the early days of LLMs, including self-critique, chain-of-thought, and structured outputs.
虽然“代理”的概念听起来很新奇,但它们建立在自早期以来就使用的许多概念之上LLMs,包括自我批判、思维链和结构化输出。
This post covered conceptually how agents work and different components of an agent. In a future post, I’ll discuss how to evaluate agent frameworks.
这篇文章从概念上介绍了代理的工作原理以及代理的不同组件。在以后的文章中,我将讨论如何评估代理框架。
The agentic pattern often deals with information that exceeds a model’s context limit. A memory system that supplements the model’s context in handling information can significantly enhance an agent’s capabilities. Since this post is already long, I’ll explore how a memory system works in a future blog post.
代理模式通常处理超出模型上下文限制的信息。在处理信息时补充模型上下文的内存系统可以显著提高代理的能力。由于这篇文章已经很长了,我将在以后的博客文章中探讨内存系统的工作原理