Which flavor of software are LLMs exactly?

Preamble: Long time no see! It’s been work, work, work lately. I even went to Google Cloud Next 2026, which was a blast (expect a future blog post…).

In my previous blog post, I introduced the concept coined by Andrej Karpathy of Software 2.0, more commonly known as Machine Learning (ML), a new software paradigm where human-written instructions are replaced by weights automatically tuned to get the desired behavior¹.

Let’s build upon that and discuss how Large Language Models (LLMs) fit this picture. LLMs are the foundation of what most people have called Artificial Intelligence (AI) for the last 4 years (yes Dad, even ChatGPT). They are technically a special kind of Neural Network (NN) and should fit nicely in the Software 2.0 category… but they don’t.

1. The nice and cozy pre-LLM era.

Artificial Intelligence and Neural Networks have been around pretty much since the invention of computers². Neither are particularly new. Initially limited by the hardware capabilities and their tendencies not to behave nicely during their training, Neural Networks haven’t been the sharpest tools in the shed of Machine Learning techniques for years. But around 2012, three distinct trends met. First, the hardware DID finally catch up. Second, researchers discovered a few small tweaks in known techniques that greatly improved the training process of Neural Networks, making it faster and more stable. Third, as the internet grew, bigger datasets than ever before were put together. In the following years, Neural Network quickly gained a lot of interest and they become the state-of-the-art (a.k.a. the best) method to classify images, detecting objects or people on images, and much more.

But during that time, Neural Networks still respected most of the numerous “rules” of Machine Learning that practitioners either stole from statisticians (them again!) or painfully learned through experimentation. First, the Machine Learning models were trained specifically for the tasks they would be used for. If you wanted a Neural Network to detect squirrels eating your precious flowers, you needed to train³ a Neural Network to detect squirrels and a dataset to do so. Second, like other kinds of Machine Learning models, you had to limit the complexity of the model to match the amount of available training data. Indeed, Neural Networks are very good at memorizing data, and if you give them too much power (too many weights) for the amount of data they are trained on, they will just memorize the training data and not learn to generalize to new data. For our example, not generalizing would mean having a model only able to detect squirrels with exactly the same background, lighting and flowers as in the training data. Not super useful … Finding the right balance between model complexity and training data size was still important.

Until then, everything was rosy. The Neural Networks, while powerful, still behaved like the rest of the Machine Learning models. While there was a lot to learn and to adapt, ML practitioners (at least me) still felt more or less at home.

2. LLMs did not kindly knock at the door

Then, between 2017 and 2022, multiple breakthroughs occurred in succession and opened the new era we are in.

First and most famously, in 2017, a Google R&D team gave birth to a new Neural Network architecture called the Transformer (in what is now probably the most cited paper ever in the ML field). It’s hard to overstate the improvements this architecture brought: way easier to parallelize than alternatives (important if you want to train on a lot of data), more stable to train than RNN or LSTM (the previous state-of-the-art for sequence modeling), extremely expressive (meaning it can learn all sorts of patterns in the data) while more parameter-efficient⁴. The initial paper applied this new architecture to translation tasks and featured 2 Neural Networks of relatively large scale of the time: ~65 million and ~213 million parameters, taking respectively 12 hours and 3.5 days to train on a small GPU cluster.

One year later, in 2018, a team at OpenAI was the first to apply the Transformer architecture to predict the next word given a piece of text, GPT-1, 117M parameters. 4 months later, the Google team replied with BERT, 340M parameters, not able to generate text but better at most tasks people cared about at the time.

The year after, in 2019, the OpenAI team wanted to see if they could also get better results by “simply” increasing the size of the Neural Network and the size of its training data, GPT-2, 1.5 billion parameters. And they kept doing that in 2020 with the now famous GPT-3, 175 billion parameters.

Beyond the raw parameter count, GPT-3 marked the first time people experienced “emergent capabilities”. The model became so good at predicting the next word that it also become “decent” at other tasks it wasn’t trained for: translation, question-answering and even more disturbingly, following instructions and following examples⁵. In 2026, it seems pretty normal but being able to write a piece of text describing the task you would like the model to accomplish with a few examples and having a model that was trained to predict the next word in a sentence be able to do it was mind-blowing at the time (that was Sci-fi for me). In the “modern” sense of this word, GPT-3 was the first Large Language Model (LLM).

What did LLMs mean for my job?

Having the technical infrastructure to allow for multi GPU training was quite rare in 2017. So even the initial “small” Transformer with its 65M parameters and 12 hours to train on a 8 GPUs cluster would have been out of reach for most teams and individuals.

I agree that 12 hours doesn’t sound that bad and you could argue that on 1 GPU it would have taken 4 days, not exactly “out of reach”. But something key to understand is that you don’t train it once, you train it dozens of times. Mainly for two reasons, first, training Neural Networks is fundamentally a non-deterministic convergence process started with a “random” initialization… Sometimes it converges to a good solution, sometimes it doesn’t. You need to train it multiple times to be sure that the results you are seeing are not just (bad) luck.

Second, there are a lot of small little things you can change when training Neural Networks, the “random” initialization, the convergence mechanism and so on. We call these hyperparameters. You can’t predict which sets of hyperparameters will work best for a given Neural Network, so you need to “try them all” 🙃. Multiple times each ideally.

There are technical solutions to speed up this process, like using a smaller subset of the data to quickly test a set of hyperparameters before training on the full dataset or interrupting the training process early, but it still requires a lot more compute and time than “one training”.

And what about gpt-3 and its 175 billion parameters? Almost three thousand times bigger. The training of GPT-3 was estimated to cost around $4.6 million dollars in cloud compute. Out of reach would be an understatement.

But LLMs had and still have such a great potential that many people whose job involved training custom ML models switched to simply “exploiting” already trained LLMs. “Prompt engineering” and “Context engineering” replaced “hyperparameter optimization”, “data augmentation” and “feature engineering”. Some skills were transferable but not all.

If your job is changing because of AI, trust me I feel you.

3. Software 3.0: Wait, is English code now?

Software 1.0 means having instructions meticulously laid down for the “Computer” to execute. We usually call these instructions “code”. Software 2.0 means that instead of writing code you use data and a training procedure to automatically “program” a black box (the ML model) that the “Computer” can use to do the job. The black box ingests an input (e.g. a picture) and detects if there is a squirrel in it. That’s all it does. If it expects an image, but somehow you feed it a piece of text, its output will still be “Squirrel” or “No Squirrel”.

LLMs are different. Well they are also a black box automatically trained on data that the “Computer” can use, but you can steer their behavior by changing their input🤯. Let’s repeat that. When it comes to LLMs, English is very much a programming language-ish. You can modify the behavior of the “Computer” by describing what you want in English (beware the “Computer” may not respect the instructions). In a recent talk, Andrej Karpathy⁶ thinks LLMs and what we build upon them deserve its own category: Software 3.0.

Where it became dizzying for me was when I realized that LLMs are able to write code in good ol’ Software 1.0 programming languages, like Python, C# or rust. They are not as able as human developers for now but it creates a surreal feeling when I remind myself what these things are, the stack of technology they are built upon, and then the fact that they could soon either improve or degrade that very stack they depend on.

(Despite using these daily, I’m still baffled that any of this could work)

4. Conclusion

The implications are many, we are still figuring them out. Software engineers and associated roles are going through a significant identity crisis. Schools and teachers are challenged in many of their core practices, other professions are also impacted at different levels. LLMs increase mental health risks, have significant environmental impacts, enable new forms of cyber criminality. They challenge the very definition of “work”, “quality” or “communication”. They change the pace at which a lot of previously slow things can happen. Many processes and practices are in shambles, unadapted to the new world they are finding themselves in.

We are all living through weird, unstable and sometimes surreal times. There is no way to put the “Genius” back in its bottle. Let try our best to understand what it is, how it works and how to use it, so we don’t end up being used by it.

A typical squirrel trying to eat my precious flowers, courtesy of Richard Sagredo

Let’s mention that the process to train these weights still relied on good ol’ Software 1.0 instructions. ↩
According to Wikipedia, the first “artificial neurons” were designed in 1943, before computers were even a thing. ↩
Or fine-tune a Neural Network to detect squirrels, fine-tuning here means taking an existing Neural Network trained to do a similar (often harder) task, like recognize up to 21,841 different kind of animal and objects, and then re-tweaking its already-learned weights on your specific squirrel dataset. In order not to lose the initial performance of the trained model, the “re-tweaking” procedure is slightly different than a full training procedure but requires way less data and compute. ↩
If you haven’t yet, consider watching the following videos about Transformers and the Attention Mechanism from 3Blue1Brown. Highest quality content on the subject I am aware of. ↩
Describing LLMs as “next-word predictors” is a massive oversimplification. Since the GPT-3 paper, the training pipeline of LLMs has evolved. While there is still a “next-word prediction” training part (called “pre-training”), most of the resources dedicated to training are now spent on later steps (call “post-training”) that involve Reinforcement Learning on Human Feedback (RLHF) and Reinforcement Learning with Verifiable Reward (RLVR). These techniques are not as well documented for now and definitely deserve there own blog post 😉. ↩
Andrej Karpathy: ex-Director of AI at Tesla, ex-OpenAI co-founder, also the guy who coined the terms “software 2.0” or “vibe coding” (link to wikipedia) ↩

Which flavor of software are LLMs exactly?

1. The nice and cozy pre-LLM era.

2. LLMs did not kindly knock at the door

3. Software 3.0: Wait, is English code now?

4. Conclusion

Footnotes