FireAct: Toward Language Agent Fine-tuning

Baian Chen^1*, Chang Shu^2*, Ehsan Shareghi³, Nigel Collier², Karthik Narasimhan⁴, Shunyu Yao⁴,

Language and Technology Lab², Princeton Language and Intelligence⁴
System2 Research¹, University of Cambridge², Monash University³, Princeton University⁴
^*Indicates Equal Contribution

Paper Model Dataset Code

While language agents and language model fine-tuning are both popular topics, their intersection is understudied. This work takes an initial step to show multiple advantages of fine-tuning LMs for agentic uses, and opens up various new questions toward language agent fine-tuning.

Abstract

Recent efforts have augmented language models (LMs) with external tools or environments, leading to the development of language agents that can reason and act. However, most of these agents rely on few-shot prompting techniques with off-the-shelf LMs. In this paper, we investigate and argue for the overlooked direction of fine-tuning LMs to obtain language agents. Using a setup of question answering (QA) with a Google search API, we explore a variety of base LMs, prompting methods, fine-tuning data, and QA tasks, and find language agents are consistently improved after fine-tuning their backbone LMs. For example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods, and show having more diverse fine-tuning data can further improve agents. Along with other findings regarding scaling effects, robustness, generalization, efficiency and cost, our work establishes comprehensive benefits of fine-tuning LMs for agents, and provides an initial set of experimental designs, insights, as well as open questions toward language agent fine-tuning.

Method

The FireAct framework includes two steps:
(a) During the fine-tuning, a robust language model (e.g., GPT-4) generates task-solving paths by analyzing questions from various datasets and employing diverse methods as prompts. These effective paths are then translated into the ReAct format to fine-tune a smaller language model (e.g. Llama2-7B).
(b) During inference, the fine-tuned language model can operate without the need for explicit prompts and has the capability to autonomously select an agent method, allowing it to complete ReAct trajectories with adaptable lengths, thus accommodating varying levels of question complexity. The example "3+4+5=" represents an ad-hoc question for illustration purposes.

Results

Few-shot Prompting results.
	Prompt	EM
GPT-4	IO	37.2
	CoT	45.0
	ReAct	42.0
GPT-3.5	IO	22.4
	CoT	28.0
	ReAct	31.4

Prompting vs. fine-tuning, with absolute/relative increases.
	ReAct	FireAct	abs./rel. diff
Llama-2-7B	14.8	26.2	+11.4 / 77%
Llama-2-13B	21.2	34.4	+13.1 / 62%
CodeLlama-7B	17.4	27.8	+10.4 / 60%
CodeLlama-13B	20.8	29.0	+8.2 / 39%
CodeLlama-34B	22.2	27.8	+5.6 / 25%
GPT-3.5	31.4	39.2	+7.8 / 25%

Fine-tuning significantly improves agent performance. Fine-tuning consistently and substantially enhances HotpotQA EM scores compared to prompting alone. We observe that even weaker language models benefit significantly from fine-tuning, with Llama-2-7B showing a remarkable 77% increase. Stronger models like GPT-3.5 also see a 25% improvement, highlighting the advantages of fine-tuning across various scenarios. When comparing fine-tuned Llama-2-13B to strong prompting baselines, it outperforms all GPT-3.5 prompting methods (IO/CoT/ReAct). This suggests that fine-tuning smaller, open-source language models can outperform prompting with larger, commercial counterparts. Notably, even the strongest fine-tuned LM, GPT-3.5, outperforms GPT-4 + IO prompting but falls behind GPT-4 + CoT/ReAct prompting, indicating room for further improvement.

Comparison of costs, robustness, and generalization for fine-tuned vs. prompted GPT-3.5.
	Cost per trial		Obs. Robustness (EM)			Generalization
	Money ($)	Time (s)	Normal	None	Random	Bamboogle (EM)
FireAct	2.2 × 10^-3	2.7	39.2	33.6	37.2	44.0
ReAct	2.6 × 10^-3	9.0	31.4	20.8	22.6	40.8

Fine-tuning also offers cost and time advantages during agent inference. Since fine-tuned LMs do not require few-shot in-context examples, their inference becomes more efficient, especially in agentic applications with iterative context accumulation. For instance, the cost comparison between fine-tuned and prompted GPT-3.5 inference shows a substantial reduction in inference time by 70% (9.0s to 2.7s per trial) and a decrease in inference cost, despite the higher expense associated with fine-tuned inference. While these costs may vary under different conditions (e.g., parallelism implementation), the benefits of having a much smaller context are evident.

Robustness to noisy tools. The tools or environments that language agents interact with are not always trustworthy, which has led to safety concerns like jailbreaking or prompt injection. Here we consider a simplified and harmless setup, where the search API has a probability of 0.5 to return 1) "None" or 2) a random search response (from all previous experiments and trials), and ask if language agents could still robustly answer questions. As shown in the second part of Table 3, the "None" setup turns out to be the more challenging one, which lowered ReAct EM by 33.8% and FireAct EM only by 14.2%. Interestingly, random observations hurt ReAct by a similar degree (28.0% drop), but do not hurt FireAct much (only a 5.1% drop), possibly because the fine-tuning trajectories already contain examples of noisy search queries and how GPT-4 "reacts" to such noises successfully. These initial results hint at the importance of a more diverse learning support for robustness.

Generalization to new tasks. The third part of the table shows EM results of fine-tuned and prompted GPT-3.5 on a test set of multi-hop questions that cannot be directly answered by searching on Google. While both fine-tuned and prompted GPT-3.5 show reasonable generalization to these questions, fine-tuning outperforms prompting, suggesting its generalization advantages. Similarly, combining few-shot prompts with fine-tuning greatly improves performance on these questions. However, fine-tuning on one QA dataset does not generalize well to other datasets with different question styles and answer formats, motivating further experiments in multi-task fine-tuning.

Analysis

Effect of fine-tuning data scale. This analysis explores how FireAct performances scale with the number of fine-tuning trajectories (n in {100, 200, 500, 1000}). GPT-3.5 appears very sample-efficient, requiring only 100 samples to reach an EM around 35, and the gain after 200 samples is marginal. On the other hand, Llama models cannot even learn the ReAct format using 100 or 200 samples, but non-trivial EMs "emerge" with 500 samples, and most models (except CodeLlama-13B) further improve with 1,000 samples.

Multi-method fine-tuning increases agent flexibility. Before presenting quantitative results, we offer two example questions to illustrate the benefit of multi-method FireAct fine-tuning. The first question (a) is simple, but the ReAct-only fine-tuned agent (a1) searched for an over-complicated query, leading to distraction and a wrong answer. In contrast, an agent fine-tuned with both CoT and ReAct chose to solve the task within one round, relying on confident internal knowledge. The second question (b) is more challenging, and the ReAct-only fine-tuned agent (b1) kept searching queries ending in "during the Libyan Civil War" without useful information. In contrast, an agent fine-tuned with both Reflexion and ReAct reflected upon this problem and pivoted the search strategy to change the time constraint to "during his rule," which led to the right answer. The flexibility to implicitly choose methods for different problems is another key advantage of fine-tuning over prompting.

Multi-method fine-tuning affects different LMs differently. Despite the intuitive benefit, mixing more methods does not always improve results, and the optimal mix of methods depends on the base LM. For example, ReAct+CoT outperforms ReAct for GPT-3.5 and Llama-2 models, but hurts for CodeLlama models. ReAct+CoT+Reflexion is the worst for CodeLlama-7/13B, but is the best for CodeLlama-34B. These non-trivial results call for further studies of the interaction of base LMs and fine-tuning data.

BibTeX

@misc{chen2023fireact,
      title={FireAct: Toward Language Agent Fine-tuning}, 
      author={Baian Chen and Chang Shu and Ehsan Shareghi and Nigel Collier and Karthik Narasimhan and Shunyu Yao},
      year={2023},
      eprint={2310.05915},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}