域名和网站有什么区别班级优化大师使用心得-Seo优化-嘉义县网站建设公司

域名和网站有什么区别,班级优化大师使用心得,适合个人网站,免费个人域名网站LLMs之ToolUse#xff1a;《ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration》翻译与解读导读#xff1a;ToolOrchestra 提出并验证了“用小型、训练良好的 Orchestrator 去编排多样化工具#xff08;包括更强的模型#xff09;”这一…LLMs之ToolUse《ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration》翻译与解读导读ToolOrchestra 提出并验证了“用小型、训练良好的 Orchestrator 去编排多样化工具包括更强的模型”这一范式——通过联合优化正确性、成本与用户偏好作者展示了在困难推理任务上既能超越或匹配更大模型的表现又能大幅降低实际运行成本与延迟从而为可扩展、可控且经济的工具增强智能体部署提供了一条切实可行的路线。背景痛点●智能体复杂任务的计算代价大语言模型LLM在解决复杂、多步骤推理任务如 Humanity’s Last ExamHLE时准确率受限且计算成本高昂。● 单一大模型局限以单个强模型配合若干工具的常规方法不能充分利用“多样化工具不同能力模型”组合的潜力并且会出现“自我偏好/自我增强”与“默认调用最强模型”的问题导致过度使用高成本模型。● 可控性与效率权衡缺失现有工具使用代理通常只关注正确率缺乏在“结果正确性、计算/延迟成本、用户工具偏好”之间的联合优化机制。具体的解决方案● 方案总览ToolOrchestra — 训练一个小型“Orchestrator”语言模型论文中为 8B 参数让其在多回合推理中以强化学习策略决定何时、以何种顺序调用各种工具包括查询、解释器、专用 LLMS 以及更大的通用模型。目标是以更低成本和更好用户偏好对齐取得与或优于大型模型的性能。●统一工具接口将所有工具API、专用 LLM、通用 LLM、函数等通过统一 JSON 描述暴露给 Orchestrator包含名称、说明和参数 schema以便 Orchestrator 可以标准化地选择与调用工具。●三重奖励设计训练目标采用强化学习时设计 outcome答案正确性、efficiency按货币化计算的算力/延迟惩罚与 preference对用户偏好工具的对齐三类奖励联合优化决策策略。●数据合成与ToolScale为 RL 提供可验证的多回合工具使用训练样本作者建立了自动化的数据合成管线并发布 ToolScale 数据集用以覆盖 10 个领域的复杂环境与任务。核心思路步骤● 问题建模把多回合工具使用任务建模为马尔可夫决策过程MDP在每一步 Orchestrator 根据历史上下文决策reasoning → action调用某工具→ observation工具返回直到终止或达到回合上限。● 统一描述工具并采集工具能力为每个候选工具包括其它 LLM生成简洁描述通过采样任务、获得工具执行轨迹然后让 LLM 汇总生成描述使 Orchestrator 在调用前能“理解”工具擅长的任务类型。● 奖励构成与计算● ●成果奖励以二元或评分方式判定最终回答是否正确论文中用 GPT-5 作为评判者来处理多样性输出● ● 成本/延时惩罚将输入/输出 token 与第三方 API 定价换算为货币成本并加入延时惩罚以鼓励低成本快速解法● ● 用户偏好奖励统计某次轨迹中各工具被调用次数按用户偏好向量进行归一化奖励/惩罚鼓励遵守用户指定的工具偏好。● 训练流程与技巧先用合成数据做行为克隆或监督预训练再用策略梯度类的 RL并结合奖励设计进行端到端微调使用多技巧稳定训练论文附录列出细节。优势相对于现有方法的改进点● 成本效率显著Orchestrator8B在 HLE 上取得 37.1% 的得分超过 GPT-5 的 35.1%同时在成本上更优文中宣称约 2.5× 更高效率且在其他基准上亦以更低成本获得更好或相近性能。● 更细粒度的工具选择与组合能力通过统一接口与 RL 策略Orchestrator 能在多回合中选择更合适的廉价工具或专用模型仅在必要时调用昂贵的大模型从而取得性能—成本的最佳折中。● 用户可控性引入用户偏好作为奖励分量使得系统在遵循用户想用或不想用特定工具时有明确优化目标提升可控性与信任度。● 泛化性强在 HLE、-Bench函数调用基准、FRAMES事实性推理等不同任务上都有良好表现且在面对未见工具或任务时仍表现鲁棒说明学到的是策略性调度能力而非任务单一记忆。后续落地与结论观点经验与建议●小型 Orchestrator多样工具的体系在工业落地上更可行 — 可以显著降低总体 API 成本与延迟同时保有或提升最终任务效果运营团队应优先考虑把“昂贵模型”作为按需资源而非默认常开调用。●奖励设计很关键— 成果/成本/偏好三者的权重需要根据业务侧重例如以用户体验或以成本节约为主调节文中展示了如何把 token 使用量映射成货币成本来实现统一衡量。● 工具描述与能力估计要到位 — 统一 JSON 描述、并对模型工具的能力做示例驱动的“描述化”有助于 Orchestrator 做出更合理选择在工程上应维护工具能力元数据并定期校准。● 风险与注意需要谨慎设计评判器paper 用 GPT-5 做 judge以避免引入评判偏差同时数据合成要尽量覆盖边界情形避免 Orchestrator 学到对某类任务的错误捷径。《ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration》翻译与解读地址https://arxiv.org/abs/2511.21689时间2025年11月26日作者NVIDIA香港大学AbstractLarge language models are powerful generalists, yet solving deep and complex problems such as those of the Humanitys Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.大型语言模型是强大的通用型工具但解决诸如“人类最后考试”HLE这类深度复杂的问题在概念上具有挑战性且计算成本高昂。我们表明由小型协调器管理其他模型和各种工具既能提升智能上限又能提高解决困难代理任务的效率。我们推出了ToolOrchestra这是一种用于训练小型协调器的方法这些协调器能够协调智能工具。ToolOrchestra 明确使用了强化学习其奖励机制考虑了结果、效率和用户偏好。借助 ToolOrchestra我们开发出了 Orchestrator这是一个 80 亿参数的模型在成本更低的情况下比之前的工具使用代理实现了更高的准确率同时还能根据用户偏好选择用于特定查询的工具。在 HLE 中Orchestrator 的得分达到了 37.1%超过了 GPT-535.1%且效率高出 2.5 倍。在 tau2-Bench 和 FRAMES 上Orchestrator 的表现大幅超越 GPT-5而成本仅为其约 30%。大量分析表明Orchestrator 在多个指标下实现了性能与成本的最佳平衡并且能稳健地推广到未见过的工具上。这些结果表明使用轻量级的编排模型组合多样化的工具比现有方法更高效、更有效为实用且可扩展的工具增强推理系统铺平了道路。Figure 1:ToolOrchestra shows consistently strong performance on HLE, FRAMES, andτ2-Bench with superior cost efficiency.图 1ToolOrchestra 在 HLE、FRAMES 和 τ2-Bench 上始终表现出色且成本效益更优。1、IntroductionLarge language models (LLMs) have been reported to have made remarkable strides towards superhuman intelligence but remain of limited utility in complex agentic tasks such as those posed by the Humanity’s Last Exam (HLE) [1]. Tool use is a promising avenue for the extension of their capabilities beyond what can be learned from the training data. By calling on external resources through search engines and code interpreters, tool use has been shown to enhance accuracy and reduce hallucinations [2, 3, 4, 5, 6, 7, 8, 9, 10].Prior research on tool-use agents has primarily focused on equipping a single powerful model with utility tools such as web search or calculators. While effective in many scenarios, this approach underutilizes the potential of tools: humans, when reasoning, routinely extend themselves by calling upon resources of greater-than-human intelligence, from domain experts to sophisticated processes and software systems. Motivated by this observation, we propose the orchestration paradigm. Under this paradigm, intelligence emerges not from a monolith but from a composite system. At the center of the system lies an orchestrator model, whose responsibility is to invoke the right tools for the given task, and to do so in the right order to accomplish the task. The crucial difference to the standard monolithic setup featuring a single powerful model is that in addition to deterministic utilities such as web search functions and code interpreters, models of various capabilities are made available to the orchestrator as intelligent tools. The use of tools of different levels of intelligence comes at varying costs, and the challenge for the orchestrator is then to dynamically decide on which tools to invoke in order to solve the task while respecting user preferences for various tools and minimizing the cost. By delegating narrowed-down sub-problems of a larger effort requiring intelligence to intelligent tools instead of handling the entire effort by a single generalist, orchestration teems with the promise of exhibiting higher intelligence than any of the system’s tools and leading monolithic solutions alike.大型语言模型LLMs已被报道在迈向超人类智能方面取得了显著进展但在诸如“人类最后考试”HLE[1] 所提出的复杂代理任务方面仍存在局限性。工具使用是扩展其能力的一个有前景的途径超越了从训练数据中所能学到的内容。通过借助搜索引擎和代码解释器等外部资源工具使用已被证明能够提高准确性并减少幻觉[2 3 4 5 6 7 8 9 10]。先前关于工具使用代理的研究主要集中在为单个强大的模型配备实用工具如网络搜索或计算器。尽管在许多场景中有效但这种方法未能充分利用工具的潜力人类在推理时通常会通过调用超出人类智能的资源来扩展自身能力从领域专家到复杂的流程和软件系统。受此观察的启发我们提出了“编排”范式。在这种范式下智能并非源自单一整体而是来自复合系统。该系统的核心是一个协调器模型其职责是针对给定任务调用合适的工具并按照正确的顺序调用这些工具以完成任务。与标准的单体式设置其中只有一个强大的模型相比关键的区别在于除了诸如网络搜索功能和代码解释器等确定性工具之外各种能力的模型也被作为智能工具提供给协调器。使用不同智能水平的工具会产生不同的成本协调器面临的挑战在于动态决定调用哪些工具来解决任务同时尊重用户对各种工具的偏好并尽量降低成本。通过将需要智能的大任务分解为更小的子任务并将这些子任务委托给智能工具处理而不是由一个全能的模型来处理整个任务这种协调方式有望展现出比系统中的任何工具和传统的单体式解决方案更高的智能水平。One approach to implementing the orchestrator paradigm is to employ a language model as the orchestrator and allow it to invoke stronger models only when it deems it necessary. This can be done naively by prompting an off-the-shelf language model or by training a general-purpose orchestrator. For the former, we find that relying on straightforward model prompting is brittle and introduces systemic biases. As shown in Figure 3 (left and middle), GPT-5 disproportionately delegates tasks to GPT-5-mini, while Qwen3-8B defers to GPT-5 at a markedly higher rate. This illustrates two present issues of prompting in the context of complex tool orchestration: (i) the overuse of developmentally-related variants of oneself, i.e., self-enhancement bias [11], and (ii) defaulting to the strongest available tool regardless of the cost or relative utility (see Appendix A for more details and §4 for a thorough comparison to baselines). As such, we conclude that the scenarios in which an orchestrating model may call on models and tools of capabilities both inferior and superior to its own are idiosyncratic in the context of model tool calling and warrant their own approach to training. In addition, controllability in tool-use agents remains underexplored along two axes: cost–efficiency and user preferences (cf. §7).实现编排器范式的一种方法是采用语言模型作为编排器并仅在其认为必要时调用更强大的模型。这可以通过提示现成的语言模型来实现也可以通过训练通用编排器来实现。对于前者我们发现单纯依靠直接提示模型的方法很脆弱并且会引入系统性偏差。如图 3左图和中图所示GPT-5 过度将任务委托给 GPT-5-mini而 Qwen3-8B 则明显更倾向于调用 GPT-5。这说明了在复杂工具编排的背景下提示存在的两个当前问题i过度使用自身发展相关的变体即自我增强偏差[11]以及ii无论成本或相对效用如何都默认调用可用的最强工具更多细节见附录 A与基线的全面比较见第 4 节。因此我们得出结论在模型工具调用的背景下编排模型可能调用能力低于或高于自身的模型和工具的场景是独特的需要专门的方法来进行训练。此外在工具使用代理的可控性方面沿成本效益和用户偏好这两个轴线的研究仍处于探索阶段参见第 7 节。We address these shortcomings by proposing ToolOrchestra (shown in Figure 2), a novel method for training a small language model to act as the orchestrator – the “brain” of a heterogeneous tool-use agent. Using ToolOrchestra, we produce the Orchestrator, an 8B-parameter model trained end-to-end with reinforcement learning (RL) to decide when and how to invoke more intelligent language models and various tools such as web search or code interpreters, and how to combine them in multi-turn reasoning. Our reward design balances three objectives – correctness of the final outcome, efficiency in resource usage, and alignment with user preferences – to yield a cost-effective and user-controllable tool-use policy. To aid RL training, we build an automatic data synthesis pipeline that generates thousands of verifiable multi-turn tool-use training examples with complex environments across 10 domains. We will make the resulting dataset, ToolScale, publicly available to facilitate further research on tool-use agent training.In our experiments, we rigorously evaluate the merits of our approach on three challenging tasks. On HLE [1], a benchmark consisting of difficult questions across many disciplines, we find that Orchestrator substantially outperforms prior methods with far lower computational cost. We also test on τ 2-Bench [12], a function-calling benchmark, where Orchestrator demonstrates the ability to schedule a variety of tools effectively, calling a large model (GPT-5) in only ∼40% of the steps and utilizing cheaper models or tools for the rest, yet still exceeding the performance of an agent that uses the large model for every step. Finally, additional evaluations on the FRAMES [13], a factuality reasoning benchmark, provide further evidence of the versatility and robustness of our approach. We observe that even though the training and testing tasks differ markedly, the RL-trained Orchestrator adapts its tool-use policy to new challenges, indicating a high degree of general reasoning ability.Our contributions can be summarized as follows: (1) We introduce ToolOrchestra, a method for training a small language model to serve as the orchestrator of a diverse toolkit, including classical tools and more intelligent models. This dovetails with recent developments in the field testifying that small language models are often sufficiently powerful and far more economical in agentic systems [14, 15]. (2) We develop a novel reward training design that goes beyond accuracy. The resulting Orchestrator is trained end-to-end to balance task outcome correctness, efficiency in cost and latency, and alignment with user cost and tool preferences. (3) We demonstrate that Orchestrator trained by ToolOrchestra achieves state-of-the-art performance on challenging reasoning benchmarks, surpassing frontier models while using only a fraction of their compute and wall-clock time, and that it generalizes robustly to unseen tasks and tools.为了解决这些不足我们提出了 ToolOrchestra如图 2 所示这是一种新颖的方法用于训练一个小语言模型充当编排器——异构工具使用代理的“大脑”。通过 ToolOrchestra我们生成了 Orchestrator这是一个 80 亿参数的模型通过强化学习RL进行端到端训练以决定何时以及如何调用更智能的语言模型和诸如网络搜索或代码解释器等各种工具并如何在多轮推理中将它们结合起来。我们的奖励设计平衡了三个目标——最终结果的正确性、资源使用的效率以及与用户偏好的一致性从而产生了一种成本效益高且用户可控的工具使用策略。为了辅助强化学习训练我们构建了一个自动数据合成管道生成了跨越 10 个领域的数千个可验证的多轮工具使用训练示例这些示例具有复杂的环境。我们将公开发布由此产生的数据集 ToolScale以促进对工具使用智能体训练的进一步研究。在我们的实验中我们在三个具有挑战性的任务上严格评估了我们方法的优点。在 HLE [1] 上这是一个包含多个学科难题的基准测试我们发现 Orchestrator 显著优于先前的方法且计算成本低得多。我们还在 τ 2-Bench [12] 上进行了测试这是一个函数调用基准测试在这里 Orchestrator 展示了有效调度各种工具的能力在大约 40% 的步骤中调用大型模型GPT-5而在其余步骤中使用更便宜的模型或工具但仍超过了每一步都使用大型模型的智能体的性能。最后在 FRAMES [13] 上的额外评估这是一个事实推理基准测试进一步证明了我们方法的通用性和鲁棒性。我们观察到尽管训练任务和测试任务差异显著但通过强化学习训练的协调器能够适应新的挑战调整其工具使用策略这表明其具备高度的通用推理能力。我们的贡献可总结如下1我们引入了 ToolOrchestra 方法该方法用于训练一个小语言模型来充当包含传统工具和更智能模型的多样化工具包的协调器。这与该领域近期的发展相契合证明了在代理系统中小语言模型通常足够强大且经济得多[14 15]。2我们开发了一种新颖的奖励训练设计超越了单纯准确性。由此训练出的协调器能够端到端地平衡任务结果的正确性、成本和延迟效率以及与用户成本和工具偏好的一致性。3我们证明通过 ToolOrchestra 训练的协调器在具有挑战性的推理基准测试中达到了最先进的性能超越了前沿模型同时仅使用了它们一小部分的计算资源和运行时间并且能够稳健地泛化到未见过的任务和工具。Figure 2:Overview of Orchestrator. Given a task, Orchestrator alternates between reasoning and tool calling in multiple turns to solve it. Orchestrator interacts with a diverse tool set, including basic tools (web search, functions such as get_flight_status, etc.), specialized LLMs (coding models, math models, etc.) and generalist LLMs (GPT-5, Claude Opus 4.1, etc.). In training under ToolOrchestra, Orchestrator is jointly optimized by outcome, efficiency and preference rewards via reinforcement learning.图 2编排器概述。给定一项任务编排器通过多次交替进行推理和调用工具来解决它。编排器与多样化的工具集进行交互包括基础工具网络搜索、获取航班状态等函数、专业 LLM编程模型、数学模型等和通用 LLMGPT-5、Claude Opus 4.1 等。在 ToolOrchestra 的训练中编排器通过强化学习依据结果、效率和偏好奖励共同进行优化。Figure 3:Tool-calling preferences exhibited by a prompted off-the-shelf or RL-trained model. GPT-5 tends to call GPT-5-mini most of the time, while Qwen3-8B relies heavily on GPT-5.图 3提示式现成模型或通过强化学习训练的模型所表现出的工具调用偏好。GPT-5 大部分时候倾向于调用 GPT-5-mini而 Qwen3-8B 则严重依赖 GPT-5。8ConclusionIn this work, we presented ToolOrchestra, a method for training a small orchestration model to unify diverse tools and specialized models. By training Orchestrator end-to-end with reinforcement learning, we showed that it can learn to plan adaptive tool-use strategies guided by both outcome quality, efficiency, and human preference rewards. This enables the agent to dynamically balance performance and cost, rather than relying on static heuristics or purely supervised approaches. To aid reinforcement learning, we also contribute a complex user-agent-tool synthetic dataset ToolScale. Our experiments on challenging benchmarks demonstrate that our Orchestrator-8B attains state-of-the-art performance while operating at significantly lower cost compared to larger models. Looking ahead, we envision more sophisticated recursive orchestrator systems to push the upper bound of intelligence but also to further enhance efficiency in solving increasingly complex agentic tasks.在本研究中我们提出了 ToolOrchestra 方法这是一种通过训练小型编排模型来统一各种工具和专业模型的方法。通过使用强化学习对 Orchestrator 进行端到端的训练我们证明了它能够学习到以结果质量、效率和人类偏好奖励为指导的自适应工具使用策略。这使得智能体能够动态地平衡性能和成本而不是依赖于静态启发式方法或纯监督方法。为了辅助强化学习我们还贡献了一个复杂的用户-智能体-工具合成数据集 ToolScale。我们在具有挑战性的基准测试上的实验表明我们的 Orchestrator-8B 达到了最先进的性能同时与更大的模型相比其运行成本显著降低。展望未来我们设想更复杂的递归编排系统以提升智能的上限同时进一步提高效率。

域名和网站有什么区别班级优化大师使用心得

建设公司董事长致辞网站范文wordpress调二级分类目录

上海创新网站建设淄博网站制作高端

咨询学校网站开发费用wordpress the terms

免费建站网站一级在线看wordpress系统怎么设置关键词

做男装去哪个网站好南京网站设公司

网站怎么做速排闲聊app是哪个公司开发

域名 和网站有什么区别班级优化大师使用心得

建设公司董事长致辞网站范文wordpress调二级分类目录

上海创新网站建设淄博网站制作高端

咨询学校网站开发费用wordpress the terms

免费建站网站一级在线看wordpress系统怎么设置关键词

做男装去哪个网站好南京网站设公司

网站怎么做速排闲聊app是哪个公司开发

域名和网站有什么区别班级优化大师使用心得