Disclaimers: This post contains several simplifications to help explain some core concepts of Large Language Models (LLMs). Only read footnotes if you want to dive deeperđ€ż.
In the previous article, we looked at what Large Language Models (LLMs) are and how they came to be. We ended up with the rise of GPT-3, a raw next-word predictor that was the first glimpse of the future to come.
Today, we are closing the gap between that âprehistoricâ time and the âAI agentsâ we are now used to. Letâs discuss the ideas you need to build Copilots or Assistants, starting from a raw next-word predictor.
As we will see, there are quite a few new concepts to cover and while none is particularly difficult, thinking about their interactions can be overwhelming. So grab a hot drink, get comfy and buckle up.
The first âconversationâ with the Computer
LLMs are different from any previous machine learning model for one key reason: their input changes their behavior. You can transform one model into a translator, a sentiment detector, or a typo corrector by changing its âpromptâ, i.e. their input, the text you feed it.
The very natural question that follows is: What is the optimal âinputâ to get the behavior I want?. In the early days (2020-2023), people experimented a lot and discovered many empirical âtricksâ to statically improve the odds of getting the desired behavior from the LLM. For example, to get a french-to-english translator, you could use the following prompt:
INPUT: "Bonjour, comment ça va ?"
OUTPUT: "Hello, how are you ?"
INPUT: "Maudit Français."
OUTPUT: "Bloody French."
INPUT: <INSERT ACTUAL TEXT TO TRANSLATE HERE>
OUTPUT:
Receiving this prompt, the LLM (at the time still a raw next-word predictor) would predict the next words of that paragraph and was very likely to return your translated text.
Something else that emerged quickly was the option to âhave a conversationâ with the LLM. By feeding it a prompt that looked like a conversation (or a play in book form), you could get it to follow the flow of that conversation and answer your questions. For example:
Sylvester: "Hi"
Tweety:
Would make the LLM generate a response like "Hello Sylvester, how are you today ?". Then, you could invoke the LLM a second time with:
Sylvester: "Hi"
Tweety: "Hello Sylvester, how are you today ?"
Sylvester: "I'm good, how about you ?"
Tweety:
Rinse and repeat and you have a conversation with the machine. Your first ever conversion of the Third Kind (đ€âđœ).
But keep in mind that these early LLMs were not trying to âhave a conversationâ, nor to be helpful. They were doing one thing: predicting the most likely next word in the piece of text you sent them. This crude setup didnât really help them stay consistent over long multi-turn conversations. And if your text looked like an online newspaper comment section or a Reddit thread, the LLM would predict accordingly (i.e. not so helpful most of the time).
We just highlighted two key characteristics of LLMs.
First, there is no concept of continuity from the point of view of the model itself. Once itâs done generating a response, âitâ disappears in the ether.
To mimic the continuity you expect from a conversation, the next time you invoke the model, you need to provide it with the full history of the conversation (or at least a sizable chunk) so the model knows what is going on and can predict the answer accordingly.
At their very core, they are âstatelessâ. Nothing survives a text generation.
Second, the total amount of text fed into the model during the âconversationâ increases quadratically with the number of âturnsâ. If you pay per usage, it means that having a 10-turn conversation costs 100 times more than a single-turn conversation.
If we normalize the size of each turn input and output to 1, we get the following sequence:
| Turn | Input Size | Output Size | Total Turn | Total Conversation |
|---|---|---|---|---|
| 1 | 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 3 | 5 |
| 3 | 3 | 1 | 4 | 9 |
| 4 | 4 | 1 | 5 | 14 |
| ⊠| ⊠| ⊠| ⊠| ⊠|
| n | n | 1 | n+1 | ~nÂČ |
This was a significant challenge in the early days and was partially solved with âinput cachingâ but letâs keep that for a future blog post đ.
Training a better conversational LLM
Now, we know how to âhave a conversationâ with a primitive next-word predictor. Letâs make a more advanced and useful LLM.
First, weâll keep the initial training step, now called pre-training. The foundation still is a very good next-word predictor. So gather all the text you are able to fathom: books (published or not), encyclopedias, all the web, all the scanned documents, all the transcripts of conversations, videos or radio shows, newspapers, codebases, etc. In all languages. Then, find a few bunch of GPUs and use them to train an initial model to predict the next word in all of that text. The pre-training text corpus is too huge to be properly reviewed or curated.
Then, we enter the next phase called supervised fine-tuning (SFT). This is a key step to reshape the LLM âbehaviorâ. The idea is simple. You gather a few thousand high-quality conversations mimicking the behavior you want the model to exhibit and you train the pre-trained model on them. While the pretraining dataset is very chaotic, the SFT dataset is very coherent and help changing the LLMs weights to make it less a generalistic next-word predictor and more a specialized conversational model.
Now, you have a LLM that tries to fake being helpful. âfakeâ because fundamentally the LLM still tries to predict the next word and when it gets it right, it gets its reward, else it doesnât1. We need to switch our âreward mechanismâ to a higher level one, one that would reward the LLM when it produces a correct answer or at least an answer appreciated by humans. We are entering the post-training phase.
The first post-training approach âdiscoveredâ in 20222 at OpenAI is called Reinforcement Learning from Human Feedback (RLHF) (paper). The idea is to ask a question to the LLM and make it generate 2 answers, then you ask a human to choose which one they prefer, finally you use that feedback to modify the LLMâs weights so it is more likely to produce the preferred answer in the future.
But LLMs are very slow learners, they need billions of such pair of answers and preferences. And human time is expensive. So instead of asking actual humans to give billions of feedback, you gather a few million human feedbacks and train a secondary Neural Network to predict which answer a human would prefer. If your Reward Model is good enough, you donât need an actual human in your training loop anymore. So you can scale to billions of feedbacks.
To get a post-training approach that actually rewards correct answers, we had to wait for almost 3 years (an eternity in AI time). In December 2024, the Allen Institute for AI and DeepSeek published a new approach called Reinforcement Learning from Verifiable Rewards (RLVR) (paper 1, paper 2). The idea is even simpler, you ask questions to the LLM whose answer can be easily and deterministically verified. Think of math problems, logic puzzles, coding problems, etc. If the answer is correct, you give a reward, else you donât. Again, youâll need to do that many many times. Quoting Karpathy:
[During RLVR], the LLMs spontaneously develop strategies that look like âreasoningâ to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies.
There you have it, the main ingredients to build an âhelpfulâ LLM: pre-training, SFT, RLHF and RLVR3. Now itâs up to you to mix them in the best way you can think of4.
Almost there, hang in there!
Tools and Reasoning
We are getting close! Our newly trained LLM is now trying its best to be âhelpfulâ (not because it likes you but because thatâs how it got its reward). But we still miss 2 things.
First, we need a way to let the LLM âactsâ (rather than just being able to answer us). Using the same tricks as above, you could get an early LLM to âmake a web searchâ with the following prompt:
INPUT: "What is the capital of France ?"
SEARCH: "capital of France"
OUTPUT: [REDACTED SEARCH RESULTS]
INPUT: <INSERT ACTUAL QUESTION HERE>
SEARCH:
The LLM would predict the next word after âSEARCH:â and was likely to produce a search query. Then, you could automatically parse that query, execute it on a search engine and feed the results back to the LLM (in another call).
As soon as 2021, an OpenAI team showed with WebGPT that it was possible to greatly improve the ability of LLMs to use tools. They modified the supervised fine-tuning by asking the LLM to reproduce human-made demonstrations of the desired behavior (a.k.a searching the web before answering) and the post-training processes by integrating a custom RLHF reward-model5, rewarding the LLM when it used the right tools for a given query.
Quick emphasis because itâs important. The tools name, parameters and description are all part of the prompt (or input) of the LLM.
Since then, the concept of âtoolsâ (or âfunction callingâ) has become one of few stable foundations of the LLM ecosystem. It became a key part of the post-training process too. An early form of Reinforcement Learning (RL) consisted in rewarding the LLM when it used the right tool (or combination of tools) for a given query. Nowadays, all major labs (OpenAI, Anthropic, âŠ) dedicate a significant part of their compute budget to the post-training steps improving their modelsâ ability to use important tools (like web_search, read_file or run_command). At inference time, they all provide you with a way to define your own tools so their LLMs can act the way you want them to.
Second, âReasoningâ emerged the same year in a similar manner from a Google team. It was a two-step discovery. First in 2021, they found that LLMs were more likely to produce a correct answer to a complex problem if given the option to write âintermediary stepsâ before producing its final answer (see paper). In 2022, they coined the term âChain-of-Thoughtâ (CoT) prompting when they got even better results by adding âintermediary stepsâ in the examples in the prompt (see paper).
Then, in late 2024, OpenAI found a way to leverage this strange property of early LLMs directly in their training process. To make their next generation of models smarter, they included a new step in their post-training process. The models in their o1 series got more reward when they produced a few intermediary âreasoningâ steps before answering. To distinguish the âchain of thoughtâ part from the final answer, they trained the model to generate a special token before and after its reasoning steps (see paper). As these models were trained to do so, they spontaneously exhibited the behaviors and the performance described in the original CoT prompting papers, but without any special techniques required from the user.
Finally harnessing the power of LLMs: Agents
At last, we have what we need: a conversational LLM trying to be correct and helpful with a native ability to use âtoolsâ and âreasonâ. Making an agent from there is easy: put the LLM in a loop where given a user query, it can freely use tools to interact with its environment and reason about the problem at hand. Et voilĂ , you have a Copilot.
The most popular AI coding tools (Claude Code, OpenAI Codex, âŠ) are relatively thin wrappers around this core idea. Their main purpose is to reduce the amount of human work required to leverage the modern LLMs, i.e. managing the LLM context, orchestrating LLM calls. Subagents, skills, instructions, saved prompts, even MCP servers can mostly be seen as convenience layers to help the human user to manage what goes in the LLMâs input.
Other kind of âharnessesâ are (re)appearing6 since beginning of 2026. As models improve and people get better at identifying the bottlenecks in their workflows, the well-established tools are challenged by new ideas and custom workflows. Anthropic recently published a blog post where they explore how to orchestrate a duo of agents (one generating code, the other providing feedback) interacting in a loop to achieve more autonomy.
Future articles will explore these developments in more detail. Thanks for reading this fairly dense and long article! I hope you enjoyed it đ.
Footnotes
-
During the training of a Machine Learning model,we use a sophisticated function called the âloss functionâ, it takes 2 inputs: the modelâs prediction and the âground truthâ (the correct answer). It outputs a number (the loss) that represents how wrong the model is. The reward is the inverse of the loss. The algorithm training the model (usually some variant of gradient descent) then uses this number to update the modelâs weights to be less wrong in the future. â©
-
RLHF was actually discovered way earlier in 2017 by a joint team from OpenAI and DeepMind. They applied the idea of learning a reward model from human feedback to improve the performance of a reinforcement learning agent playing Atari games (paper). 2022 was still the first time this idea was applied to LLMs. â©
-
Training LLMs is a complex, secretive and quickly evolving process. If the steps presented above are well documented, their exact combination is a well-kept secret in all the leading U.S. based labs (OpenAI, Anthropic, Google, etc.). They never share the key details of their training process (even when they publish papers), their Chinese competitors (DeepSeek, Alibaba, etc.) are more open and share a lot more. â©
-
How you combine these steps would deserve a whole other blog post as you can get fairly creative with it. For example, in the DeepSeek R1 paper, the authors used a âself-improvementâ loop where they ask the LLM to generate new RLVR questions and answers to autonomously expand its post-training dataset, then they trained on the newly generated dataset and repeat the process as long as they are able to measure improvements. â©
-
Technically, as WebGPT precedes ChatGPT, they also introduced the idea of using a reward model mimicking human-preferences during the post-training. â©
-
The âearly daysâ were very fertile and creative times. Arguably, Minecraftâs Voyager (2023) can be viewed as the ancestor of both Code-Mode / Programmatic tool calling and Karpathyâs
autoresearch. Reflexion (2023) introduced the idea of having one agent overseeing the action of another agent and providing feedback well before Anthropicâs âHarness design for long-running application developmentâ. â©