How LLMs Actually Work: Tokens, Transformers and Why They Lie

In partnership with

Your prompts are leaving out 80% of what you're thinking.

When you type a prompt, you summarize. When you speak one, you explain. Wispr Flow captures your full reasoning — constraints, edge cases, examples, tone — and turns it into clean, structured text you paste into ChatGPT, Claude, or any AI tool. The difference shows up immediately. More context in, fewer follow-ups out.

89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Try Wispr Flow free — works on Mac, Windows, and iPhone.

Start flowing free

❝

Welcome to Grind Engineer, your guide to becoming a better software engineer! No fluff. Pure engineering insights.

❝

Added Job Opening in the end of the article!

ChatGPT, Claude, Gemini. You use them daily. You build products on top of them. But most engineers treat them as a black box with a text input and a text output.

That is fine for casual use. It is a liability when you are building production systems. Understanding how an LLM actually works is what separates engineers who debug AI failures from engineers who stare at them confused.

No math. No PhD required. Just the mental model every software engineer should have.

❝

TL;DR: LLMs break text into tokens, convert them into vectors, run them through a transformer that figures out how every word relates to every other word, then predict the next token one at a time. Temperature controls how creative or deterministic those predictions are. Hallucinations happen because the model is always predicting, never looking anything up.

1. Tokens and Tokenization

LLMs do not read words. They do not read characters. They read tokens.

A token is a chunk of text, roughly 3 to 4 characters on average. Common words are single tokens. Rare words split into multiple tokens.

"Hello" → ["Hello"]                         # 1 token
"unbelievable" → ["un", "believ", "able"]   # 3 tokens
"ChatGPT" → ["Chat", "G", "PT"]            # 3 tokens
"王" (Chinese character) → ["王"]           # 1 token, but costs more UTF-8 bytes

Why tokens instead of words or characters? Because the vocabulary space becomes manageable. A vocabulary of 50,000 to 100,000 tokens covers most human language without needing millions of unique words. GPT-4 uses around 100,000 tokens in its vocabulary. Claude uses a similar range.

This has practical implications for engineers:

Fact	Implication
1 token ≈ 4 characters	A 1,000 word document ≈ 1,300 tokens
Rare words cost more tokens	Technical jargon, code, and non English text cost more than plain English
You are billed per token	Verbose prompts cost more. Concise prompts are cheaper.
Context limits are in tokens	"128K context window" means 128,000 tokens, roughly a 100,000 word novel

2. Transformer Intuition (No Math)

Once text is tokenized, each token becomes an embedding: a list of numbers representing its meaning (covered in our previous article). But here is where transformers go further than simple embeddings.

The word "warm" means something different in "She gave me a warm hug" vs "The warm weather is lovely." A static embedding gives the same vector for "warm" in both sentences. A transformer gives different vectors.

How? Through attention.

Think of attention as every token asking two questions:

"Which other tokens in this sentence are relevant to understanding me?" "How much should I pay attention to each of them?"

When the transformer processes "warm" in "She gave me a warm hug," it pays high attention to "hug" and "gave" (emotional context). In "The warm weather is lovely," it pays high attention to "weather" (physical context). These attention weights change the meaning vector of "warm" differently in each case.

This happens across multiple attention heads simultaneously. Each head learns to detect a different type of relationship: syntax, semantics, co reference, proximity. All heads run in parallel. The outputs are combined.

Then this whole process repeats across many layers. GPT-3 has 96 layers. Claude models have similar depth. Each layer refines the meaning of each token by attending to all other tokens. By the final layer, each token's vector is a richly contextualized representation of its meaning within the full input.

❝

💡 Key Insight: The transformer processes all tokens in your prompt simultaneously, not sequentially. It sees "bank" and immediately knows whether you mean "river bank" or "financial bank" by attending to everything around it. This parallel processing is why transformers are fast on GPUs and why context matters so deeply to LLM output quality.

3. Training vs Inference

These two phases are completely different operations.

Training:

A model is trained on hundreds of billions of tokens from the internet, books, code, and other text. The training objective is simple: predict the next token. Given "The cat sat on the", predict "mat."

The model starts with random weights. It makes a prediction. The prediction is compared to the actual next token. The error is propagated back through the network, nudging weights slightly in the right direction. This happens billions of times across trillions of tokens. The weights converge to something that has compressed the statistical patterns of human language.

Training a frontier model costs tens to hundreds of millions of dollars in compute. GPT-4 training was estimated at over $100M. You do not train frontier models. You use them.

Inference:

Inference is what happens when you send a prompt. The model takes your input tokens, runs them through its frozen (fixed, unchanging) weights, and predicts the next token. Then that token is added to the context and the model predicts the next one. And the next. And the next.

This is called autoregressive generation: generating text one token at a time, each prediction conditioned on everything before it.

	Training	Inference
When	Once, before release	Every time you send a prompt
Compute	Hundreds of millions of dollars	Cents to dollars per query
Weights	Being updated	Frozen
Goal	Learn patterns from data	Predict next token given input
Duration	Weeks to months	Milliseconds to seconds

4. Context Window

The context window is the working memory of an LLM. Everything the model can see at once: your system prompt, the conversation history, retrieved documents, tool outputs, and the current message.

Think of it as a desk. Everything on the desk is visible to the model. When the desk fills up, old items fall off the edge.

Context limits have grown dramatically:

Model	Context Window	Approximate equivalent
GPT-3 (2020)	4K tokens	~3,000 words
GPT-4 (2023)	8K to 32K tokens	~6,000 to 24,000 words
Claude 3 (2024)	200K tokens	~150,000 words (~a full novel)
Gemini 1.5 Pro	1M tokens	~750,000 words

Bigger context sounds better. It often is not, for two reasons:

Cost: Most APIs charge per token. A 200K token context costs 50x more than a 4K context.

Attention dilution: The transformer computes n² relationships for n tokens. With 200,000 tokens, that is 40 billion pairwise relationships. Attention gets spread thin. Tokens in the middle of long contexts are attended to less reliably than tokens at the start and end. This is the "lost in the middle" problem: studies show models answer questions from the beginning and end of long documents well, but miss information buried in the middle.

The production rule: A smaller context window with strong retrieval (RAG) almost always outperforms a massive context window stuffed with everything. Give the model exactly what it needs, not everything you have.

5. Temperature and Sampling

Once the transformer finishes processing your input, it produces a probability distribution over all tokens in the vocabulary. Every possible next token gets a probability. "The" might get 12%. "A" might get 8%. "Photosynthesis" might get 0.0001%.

Temperature controls how sharp or flat that distribution is.

# Low temperature (0.1): sharper distribution
# Most probability collapses to top few tokens
# Predictable, deterministic output
response = client.messages.create(
    model="claude-opus-4-5",
    temperature=0.1,  # almost always picks the highest probability token
    messages=[{"role": "user", "content": "Write a SQL query to get top users"}]
)

# High temperature (0.9): flatter distribution
# Probability spreads across more tokens
# Creative, varied, unpredictable output
response = client.messages.create(
    model="claude-opus-4-5",
    temperature=0.9,  # may pick surprising but valid tokens
    messages=[{"role": "user", "content": "Write a poem about distributed systems"}]
)

Temperature	Effect	Use for
0.0	Always picks the highest probability token. Fully deterministic.	Structured outputs, JSON, SQL, classification
0.1 to 0.3	Very predictable, slight variation	Code generation, factual Q&A, summarization
0.7 to 0.9	Creative, varied	Creative writing, brainstorming, marketing copy
1.0+	Chaotic, often incoherent	Rarely useful in production

The engineer's rule: For any task that requires correctness (code, data extraction, structured output), use temperature 0 to 0.2. For any task that rewards creativity (writing, ideation), use 0.7 to 0.9.

6. Why Hallucination Happens

Hallucination is the most misunderstood aspect of LLMs. Engineers often frame it as a bug that will be fixed in the next version. It is not. It is a fundamental property of how LLMs work.

The model is always predicting the statistically most plausible next token. It is never looking anything up. It has no database query to run. No source of truth to consult. It is pattern matching at massive scale, and sometimes the most statistically plausible continuation of a sentence is factually wrong.

Three root causes:

1. The training data ended. The model's knowledge is frozen at its training cutoff. Ask it about something that happened after training and it will predict what sounds most plausible based on patterns, not facts.

2. The fact was rare in training data. If a fact appeared 3 times in the training corpus, the model's weights barely captured it. Ask about it and the model fills in with what seems statistically similar. Like asking someone to recall the 847th word they read last Tuesday.

3. The question looks like a pattern the model knows. "What year was Einstein born?" looks like "What year was [famous person] born?" which has a clear answer pattern. If the model saw enough confident wrong answers about Einstein in training data (or similar patterns that bleed together), it will confidently generate a wrong year.

The context window makes this worse at scale: attention gets diluted across thousands of tokens, and models sometimes generate text that seems plausible given the surrounding patterns but contradicts a specific fact buried 50,000 tokens back.

How to mitigate hallucinations in production:

Technique	How it helps
RAG	Grounds answers in retrieved documents. Model reads facts instead of recalling them.
Low temperature	Reduces random sampling. Model sticks closer to high confidence tokens.
Explicit uncertainty prompting	"If you are not sure, say I don't know." Teaches the model when to refuse.
Output verification	For critical facts, verify programmatically against a source of truth.
Smaller, focused context	Prevents attention dilution. Give the model exactly what it needs.

What This Means For Engineers

LLMs are prediction machines, not knowledge bases. Every output is a statistical prediction. The model never "knows" something the way a database knows a row. Build your systems accordingly: use RAG for factual recall, verification for critical outputs, and treat LLM output as a first draft, not a final answer.
Temperature is a dial you control. Use it deliberately. Temperature 0 for structured output. Temperature 0.7 for creative tasks. Most engineers leave it at the default and wonder why their code generation is inconsistent or their creative output is boring.
Context window size is not free. Every token in context costs money, adds latency (n² attention), and dilutes focus. Build systems that give models exactly what they need, nothing more. A 10K token context with the right 10K tokens beats a 200K token context stuffed with marginally relevant data.

❝

Job Openings

Software Engineer, New Grad @Stripe: Apply Here
Software Engineer, Payments and Risk @Stripe: Apply Here
Software Engineer, Data & AI @Stripe: Apply Here
Software Engineer (1+ YOE) @Stripe: Apply Here
Software Engineer 2, iOS @Uber: Apply Here

Follow me on Youtube · LinkedIn · X · Instagram to stay updated.

See you in the next one!
Scortier, Signing Off!

How LLMs Actually Work: Tokens, Transformers and Why They Lie

Your prompts are leaving out 80% of what you're thinking.

1. Tokens and Tokenization

2. Transformer Intuition (No Math)

3. Training vs Inference

4. Context Window

5. Temperature and Sampling

6. Why Hallucination Happens

What This Means For Engineers

Reply

Keep Reading

Subscribe to Grind Engineer

How LLMs Actually Work: Tokens, Transformers and Why They Lie

Your prompts are leaving out 80% of what you're thinking.

1. Tokens and Tokenization

2. Transformer Intuition (No Math)

3. Training vs Inference

4. Context Window

5. Temperature and Sampling

6. Why Hallucination Happens

What This Means For Engineers

Subscribe to keep reading

Reply

Keep Reading

Subscribe to Grind Engineer