How Large Language Models work

Large language models like ChatGPT and GPT-4 can chat with people, write copy, and even help you debug code, seeming to possess “superpowers.” But they aren’t actually “thinking”—they get the job done through a complex process. Today, let’s use plain language to explain how these AI models really work.

Step 1: Pre-training – Making the Model “Read Ten Thousand Books”

Before a large language model goes into service, it must undergo “intensive training” known as pre-training. Just as humans learn to speak by first listening to a lot of conversations, models also need to “read” massive amounts of text from the internet.

The first step in training is gathering materials. Researchers collect online articles, books, forum posts, and even databases like Common Crawl, which contains 250 billion web pages. However, this raw material contains a lot of junk information, such as duplicate content and advertisements, which must be cleaned up first.

Cleaned text still can’t be used directly because computers only understand numbers. This is where “tokenization” comes in: breaking sentences into small units—could be words, roots, or even letter combinations—and assigning each unit a unique number. For example, “cooking” might be split into “cook” and “ing,” each corresponding to a number. Models like GPT-4 have a “digital dictionary” for tokenization with 100,000 entries.

The core training follows: making the model predict the next word. Give the model “We are making” and it needs to predict whether the next word is “food” or “homework.” Initially, the model guesses randomly, but through repeated practice (this process repeats billions of times on massive datasets), it gradually figures out patterns: “The sky is” is much more likely to be followed by “blue” than “table.” After each guess, the system tells it the correct answer, and the model adjusts its internal parameters using a method called “backpropagation,” just like a student correcting mistakes.

After this training, although the model can skillfully predict the next word, it’s still just an “advanced autocomplete tool” and doesn’t quite understand how to follow instructions. For example, if you ask it to “write a leave application,” it might just throw together a few random sentences.

Step 2: Post-training – Giving the Model “Vocational Courses”

A pre-trained model is like a new college graduate—knowledgeable but not skilled at doing actual work. This is where post-training comes in, equivalent to giving it “vocational training.”

The materials used in post-training are no longer randomly collected texts but specially designed “example questions.” For instance, to make the model good at chatting, create lots of dialogue examples like “User asks XX, AI should answer XX”; to make it understand programming, collect plenty of “problem + code” pairs.

When OpenAI trained InstructGPT back in 2022, they hired about 40 contractors to create these example questions. Now, it’s more advanced—AI can be used to automatically generate examples, which are then checked and revised by humans. Through this training, the model gradually learns: when a user says “help me write a summary,” it shouldn’t ramble on about something else; when encountering malicious requests like “teach me to do bad things,” it must refuse.

Post-training also adds some “secret codes,” such as using special symbols to mark “this is what the user said” and “this is what the AI should reply,” preventing the model from confusing roles. After this conditioning, the model is truly “graduated” and can handle various practical needs.

Step 3: Inference – How the Model “Answers Questions”

The process by which a trained model gives us responses is called “inference.” Even though its answers seem smooth and quick, it’s still doing the same old “word guessing” behind the scenes.

When you input “The weather is nice today, I want to,” the model first converts this sentence into a sequence of numbers, then calculates the probability of each possible next word. For example, “go to the park” has a 30% probability, “sunbathe” 25%, “do homework” 5%, etc. It doesn’t just pick the most probable one; instead, it selects from these options like a lottery, so the answers aren’t too rigid.

After choosing a word, that word is added to the original sequence, becoming “The weather is nice today, I want to go to the park,” and then the same process repeats to predict the next word. It keeps guessing word by word until a complete answer is generated. That’s why even with the same question, the model might give different answers—because the “lottery results” are different each time.

Why Do Models “Tell Lies”? – Talking About Hallucinations

Sometimes models spout nonsense with great conviction, like telling you “Li Bai was from the Song Dynasty”—this is called “hallucination.” It’s not that the model is deliberately lying; it simply doesn’t understand what “facts” are.

Essentially, a model is a “probability calculator.” It only knows that “Li Bai” is most likely followed by “a poet of the Tang Dynasty” but doesn’t grasp that this is actual history. If the training data contains many sentences structured like “XX is XX,” it will think all similar questions should have clear answers, even if it hasn’t learned them. For example, if you ask it “Who is Zhang San?” even if it’s never heard of this name, it might make up an identity because it thinks the probability of answering “I don’t know” is too low.

Researchers are now finding ways to solve this problem. For example, specially training models to say “I don’t know,” or teaching them to look up information—when unsure of an answer, first search online and then respond based on the search results. Just like when we’re unsure about a piece of knowledge, we check a book first.

The operation of large language models is actually a cycle of “massive reading → specialized training → probability-based word guessing.” Although they seem intelligent, their way of “thinking” is completely different from humans. As technology advances, these models will become more useful, with tools like WhaleFlux further enhancing their deployment speed and stability through optimized GPU resource management, but there’s probably still a long way to go before they truly understand the world.

How Large Language Models work?

Step 1: Pre-training – Making the Model “Read Ten Thousand Books”

Step 2: Post-training – Giving the Model “Vocational Courses”

Step 3: Inference – How the Model “Answers Questions”

Why Do Models “Tell Lies”? – Talking About Hallucinations

Sign up for more.