How LLMs Work, Explained Without Math

blog.miguelgrinberg.com

215 points by kdamica 13 days ago

If commenters wish to know what is not, "guessing the next word", let me outline it.

Compare, "I like what you were wearing", "Pass me the salt", and "Have you been to London recently?" as generated by an LLM and as spoken by a person.

What is the reason each piece of text (in a whatsapp chat, say) is provided?

When the LLM generates each word it does so because it is, on average, the most common word in a corpus of text on which it was trained: "wearing" follows, "I like what you were" because most people who were having these conversations, captured in the training data, were talking about clothes.

When a person types those words on a keyboard, the following are the causes: the speaker's mental states of recollection, preference, taste; the speaker's affective/attachement states with respect to their friend; the speaker's habitation into social cues; the speaker's imagining through recall what their friend was wearing; the speaker's ability to abstract from their memories into identifying clothing; and so on.

Indeed, the cause of a person speaking is so vastly different to generating a word based on a historical frequency, that to suppose these are related seems incomprehensible.

The only reason the illusion of similarity is effective is because the training data is a text-based observation of the causal process in people: the training data is distributed by people talking (and so on). Insofar as you cannot just replay variations on these prior conversations, the LLM will fail and expose itself as actually insensitive to any of these things.

I'd encourage credulous fans of AI not to dehumanize themselves and others by the supposition that they speak because they are selecting an optimal word from a dictionary based on all prior conversations they were a part of. You aren't doing that.

tiborsaas 13 days ago

If your point is LLM-s are different than humans, I guess that goes without saying? We are not even on the same level yet.
> the most common word in a corpus of text on which it was trained
I think you are downplaying the fine grains of knowledge that can be encoded in a huge corpus of text. LLM-s are capable of taking context into account and encoding that too, not simply how often word A comes after B.
When I'm in a conversation I'm also selecting the optimal word from a predefined dictionary. That's precisely what's like speaking in a given language. Sure, I'm thinking a bit ahead and I can tap into my memory, feelings and experiences which influences everything.
But the optimal part is derived from context for me, it changes which word I use when I talk to a colleague, family or friend, but I might want to say the same thing. For stock LLM-s everything must be defined in the prompt if we are talking about zero-shot inference.
These models are opening good insight on how language works and I don't find that too dehumanising. There's plenty of room exists still for me to be human and do non AI things.
I get the notion that if we understand fully how something works the magic is gone, this always happens to AI. Are we afraid that this might happen to us too?
- rhdunn 13 days ago
  
  I've experienced LLMs forgetting details (no memory). This is expecially a problem when the information is out of the context window, but I've seen it in other cases as well.
  I've experienced LLMs lacking spatial awareness, such as switching locations in a description despite no indication of moving to the new location. The same applies to other concepts that have a visual/spatial component.
  I've also experienced LLMs struggling to get subtext, some metaphors, etc., especially when used in casual conversations instead of as a question/answer style prompt.
  LLMs are great, but need more work to fix these gaps.
  
  jameshart 13 days ago
  
  Same goes for humans
- mjburgess 13 days ago
  
  > Are we afraid that this might happen to us too?
  No, I'm more frustrated by the pseudoscience that models of frequency associations in text are explanations of people (or anything else). The choice isn't between a pseudoscientific behaviouralism where animals have no presence in the world, no mental faculties, and so on vs. "magic".
  > When I'm in a conversation I'm also selecting the optimal word from a predefined dictionary
  Consider it this way: the probability dististribution over all possible world for you speaking is parameterized on space and time: Pyou(x, t; ...). And of the LLM generating text, Pllm(historical data).
  So imagine plotting the live probability distributions of Pyou and Pllm for any given situation. As you think, imagine, move, recall, prefer, desire... the Pyou goes "wild" with dramatic discontinous shifts in distribution brought about by these causes.
  Whereas the Pllm remains the same. It never changes. It never reacts to anything at all.
  The whole distribution over all prior text tokens, Pllm is a stationary model of frequency associations. Yours is not. This makes all the difference in the world when claiming that Pllm somehow models, or is even relevant to, Pyou.
  
  tiborsaas 13 days ago
  
  I think main tension here is that you think people confuse an LLM with a person having real world like experiences? I get that, and it's an interesting phenomenon, it's just a helpful way to talk to these models.
  When I type in a query to ChatGPT, I know there's no real "you" in it, but it's just helpful for me to narrow down that Pllm probability space to give me a good enough answer I'm looking for. It's a helpful shortcut if I treat it like a person, but I should know it's a really good prediction machine which has access to most of the combined human knowledge in a latent space.
  I rarely go "wild", people rarely go "wild" with their train of thoughts and actions. Everybody is moving on a more or less predefined set of rails. How long that rail is and where it goes depends on the person and situation.
  Pyou is also limited in the sense that most of my outputs are history based, but all the sensory inputs I have gained over the years are now compressed into various neural circuits. I can act in surprising ways, but I'm more than a language model, although language is a huge part of me.
  I'm more like a Pyou(xyz, t, Pllm, I) where "I" is the magic soup of human experience encoded in some gray matter.
  You are right that most LLM-s are static, but we shouldn't ignore that complex systems (agents) can be built with LLM-s in which the language model is only an I/O layer to static knowledge. There could be other moving, dynamic parts which can be continuously updated and these parts can modify the behaviour of the system.
  
  Jensson 13 days ago
  
  > I think main tension here is that you think people confuse an LLM with a person having real world like experiences?
  Many people do make this mistake, you see it all the time with people asking the LLM about itself or saying it is sentient based on text it outputs when asked about what it thinks etc.
  
  TeMPOraL 13 days ago
  
  Zoom in a bit. Freeze Pyou to a single conversation. A single spoken sentence, then another. There isn't enough time for emotions or imagination to shift. It's you in a particular situation, with some thought to express, words flowing out of your mouth.
  I don't know about you, but to me this moment feels exaxtly like being an LLM.
  I keep arguing that LLMs aren't similar to humans in entirety, but rather just to the "inner voice" - the bit that feeds your consciousness strings of words, which you utter, or consider, or send back if they make no sense.
  
  mjburgess 13 days ago
  
  The problem here is that the actual Pyou isnt a distribution over all possible words.. that's just the observer's model of what's going on -- ie., your friend supposes that you could say anything.
  Actually: (1) there's very very small number of possible words you are considering; (2) you arent considering 'words' at all, but future cognitive and sensory-motor states/vocalisation actions; (3) your vocalisation is moderated by a vast array of other causes/reasons (eg., being kind); etc.
  The P(next word|previous, LLM-model, corpus...) for an LLM isnt an abstraction; it's actually implemented in its training.
  The only sense in which, even in an instance, we seem to compute P(next|previous) is purely a radical abstraction which has nothing to do with any property we posess, but is an epistemic artefact from the outside.
eggdaft 13 days ago

I think what this argument is missing is the emergent properties of the LLM.
In order to “predict the next word”, the LLM doesn’t just learn the most likely word from a corpus for the preceding string. If that were true, it would not generalise outside of its training set.
The LLM learns about the structure of the language, the context, and in the process of doing so constructs a model of the world as represented by words.
Admittedly the model is still limited, but it seems to me that there is something more insightful to be gleaned here: that given enough data, and sufficient pressure to learn, that excelling at scale on a relatively simple task leads indirectly to a form of intelligence.
For me the biggest takeaway of LLMs might be that “intelligence is pretty cheap, actually” and that the human brain is not so remarkable as we’d like to believe.
- mjburgess 13 days ago
  
  Each word is taken to a distribution over words this is where the illusion of "context" largely comes from. eg., "cat" is replaced by a weighted: (cat, kitten, pet, mammal, ...) which is obtained via frequencies in a historical dataset.
  So technically the LLM is not doing P(next word |previous word) -- but rather, P(associated_words(next word)|assocated_words(previous), associated_words(previous_-1), ...).
  This means its search space for each conditional step is still extremely large in the historical corpus, and there's more flexibility to reach "across and between contexts" -- but it isnt sensitive to context.. we just arranged the data that way.
  Soon enough people with enough money will build diagnostic (XAI) models of LLMs that are powerful enough to show this process at work over its training data.
  To visualize roughly, imagine you're in a library and you're asked a question. The first word selects a very large number of pages across many books (and whole books), the second word selects both other books, and pages across the books you have. Keep going.. each more word you're ask, you convert to a set of words, and find more pages and books and also get narrower paragraph samples from the ones you have. Now finally, with total set of pages and paragraphs etc. you have to hand at the end of the question, you then find the word most probable following the other.
  This process will eventually be visualised properly, with a real-world LLM, but it'll take a significant investement to build this sort of explanatory model.. since you need to reverse from weights to training data across the entire inference process.
  
  eggdaft 13 days ago
  
  The context comes from the attention mechanism, not from word embeddings.
  
  mjburgess 13 days ago
  
  Run attention on an ordinal word embedding and see what happens
  
  eggdaft 13 days ago
  
  Well yes, necessary but not sufficient, obviously.
- euroderf 13 days ago
  
  > and that the human brain is not so remarkable as we’d like to believe.
  Well, it IS pretty seamlessly integrated with a very impressive suite of sensors.
  
  danielbln 13 days ago
  
  Yes, our human sensor fusion is remarkable. The input signal of say our eyes is warped, upside down and low resolution apart from a tiny patch that races across the field of vision to capture high resolution samples (saccades). Yet, to us, it feels seamless and encompassing.
- ryandvm 13 days ago
  
  Bingo.
  When I write some 100% bespoke code that is rather hastily composed and then paste it all into ChatGPT4 asking it to "refactor this code with a focus on testability and maintainability" and not only does it do so, but it does a pretty damn good job about it, it feels rather reductive to say "it's just providing the next most likely word".
  I mean, maybe that's how it works, but that statistical output clearly involves modeling what my code does and what I want it to do. Rather than make me think LLMs are a cheap trick, it just has me thinking, "shit - maybe that's all I do too."
  
  Jensson 13 days ago
  
  Averaged faces are beautiful, averaged code is clean. Not sure how that is hard to believe. Just don't extrapolate it too far or it will get strange.
logicallee 13 days ago

>When the LLM generates each word it does so because it is, on average, the most common word in a corpus of text on which it was trained:
But ChatGPT 4 follows instructions passably well. For example I just asked it: "Construct a sentence of at least 10 words each of which is extremely grammatically unlikely to follow the word before it. (For example "be are isn't had" as each of those words is impossible after the word before it.) Do not give any explanation of how you have arrived at your answer, reply only with your answer. However, as you construct it ensure that you cannot think of any context in which each next word would ever come after the word before it. Reply with your constructed nonsense sentence only."
Indeed it replied with a good nonsense sentence: "Dogs swimming beautifully reads soft under Wednesday during sky oranges" ("sky oranges" is unlikely, "under Wednesday" is nonsensical and ungrammatical) and when I complained that "dogs swimming" could be sensible as can "swimming beautifully" it came up with an even more nonsense sentence "Apples slowly would butter river quickly seven whenever blue music".
Do you think "Wednesday" is really the most likely word to follow "under" and "river" is really the most likely to follow "butter", or isn't it obvious that it was, for lack of a better word, "trying to" follow my prompt?
https://chat.openai.com/share/50037af6-0f3e-4de3-aff7-53a7b9...
- mjburgess 13 days ago
  
  Well ChatGPT is a mixture of LLMs, and no doubt a great deal more to make this work out (i'd suppose, eg., that they augment their datasets to have conversational framing, with actions/verbs etc. augumented -- or they acheive similar with models-on-top).
  Nevertheless, roughly consider a dataset D for which we have an approximate stochastic model of its conditional frequency associations: P(next|previous..., D) etc.
  Then if your prompt really got that reply, from this model, it would do so like this:
  "Construct" is first projected to an encoding which replaces it, effectively, with a set of related words (Construct, Make, Create, Write...) all weighted by how they co-occur with construct.
  Then we sample from D based on this word set, obtaining roughly, all conversations where these related words were used, call this Dc.
  Next take "a sentence" and replace it with its word-set, say, (Sentence, Phrase, Words, ...) and sample conversations from Dc in which these occur, Dcs..
  And so on. Since each token in your prompt actually corresponds to basically all possible words but weighted by association, each "filtering operation" actually selects vast amounts of the training data (space).
  Finally, consider the reverse problem: what words could this system possibly produce from this process that weren't relevant to your prompt? Given enough data (PBs of text from all possible digitized conversations, books, etc.) then a sensible-seeming answer becomes the only plausible one to generate.
  Now, I do think here PBs wouldnt be enough to generate a single statistical model that behaved this way -- so you need a mixture of them (ie., ChatGPT) and I suspect you also need a system for regulating discrete constraints such as quantities. I suspect many deployed LLMs have improved in this area due to models trained to be specifically sensitive to quantities.
  
  danielbln 13 days ago
  
  There are plenty of LLMs that aren't MoE/ensemble, and there are also plenty of LLMs that are pure completion models, that haven't been fine-tuned/RLHF'd to be conversational. I would recommend you read up a bit more on how modern LLMs work, I get the feeling your intuition on that could improve.
  edit: I can't reply to the child comment as we've reached the thread limit, but I can say that LLMs are not trained on a tiny subset of data, they are trained on as much data as possible. A LLM becomes converational/instruct due to fine tuning it with reinforcement learning data. GPT-3.5 is by all accounts not an ensemble model, Llama2/3 is NOT an ensemble model/MoE, yet will allow you to do in-context learning/few shot prompting effortlessly. As said, I think your intuition on how these LLMs work and (as far as we know) how they work, needs readjustment.
  
  mjburgess 13 days ago
  
  I dont see what I'm missing. I'm addressing why ChatGPT generated a response given a prompt. If another LLM had been used, something far simpler, the explanation would be different.
  If a highly simplified LLM will generate text against discrete quantitative constraints, under a variety of scenarios, then I've underestimated how highly structured the relevant training data must be.
  An LLM trained on a physics textbook isnt going to be conversational; one trained on shakespear will generate text from elizabethan english..
  ie., in every case, the explanation of why any given response was generated is given by explaining the distribution of its dataset. So if a shakespear LLM generates, "to be or otherwise to be not is alike everything ere annon" we will be mostly explaining how/why those words were used by shakespear.
  and if an LLM is small, and is actually discretely sensitive to quantities across a large vareity of domains.. my guess is that its training data has been specially prepared. This is jsut a guess about hte nature of human commnuication though, it has nothing to do with LLMs. I just guess that we don't distribute "quantity tokens" in such a highly patterned way that a simple LLM model would work to find it
- danielbln 13 days ago
  
  Yeah, I feel OP is ignoring the magic of in-context learning in their slightly reductive view on how LLMs work.
  
  mjburgess 13 days ago
  
  The purpose of my original comment wasn't to accurately depict LLMs, but to introduce the properties of people that cause us to write/speak etc. which LLMs aren't sensitive to. The point was to answer the question, "in what ways arent we just doing the same?"
  The point of the LLM bit is that the property of the world that LLMs are sensitive to is the distribution of text tokens in their training data. Regardless of which features of this distribution any given AI model captures, it is necessarily a model of the dataset's actual P(token|tokens..).
  In the case of LLMs its a very high dimensional model so that P(word|previous words) is actually modelled by something like: P(word|P(prompt embedding space|answer embedding space) | ...) -- but this makes no difference to the "arent we doing the same?" question. We dont use frequency associations between parts of a historical corpus when we speak.
  
  barfbagginus 13 days ago
  
  I don't think the question of are we doing the same is meaningful except on the surface, where we focus on the function that is performed, and ignore what we know about the mechanism performing the function.
  On the surface, in the presence of in context learning, novel out of distribution contexts, and reactive coupling with a world context like a python repl, simulation, robot, or other source of empirical feedback, then yes, there is a sense in which LLMs do the same kinds of things, and can perform the same kinds of functions.
  Given an experimental and out of distribution context that no human has seen, an LLM can generate novel hypotheses, experiment to test these hypotheses, and converge on the truth. It doesn't matter if this functionality attains from a corpus conditioned token generator, or a biological network of spiking neurons. It's important to point out that both systems support that function, without appealing to reductions, which in both cases would trivialize and obscure the higher order functions. If we're physically reductive with LLMs that throws away the functionalist view, and reduces our ability to actually expect, predict, and elicit higher order functionalities.
  
  TeMPOraL 13 days ago
  
  Everything you say here makes sense, except the last bit:
  > We dont use frequency associations between parts of a historical corpus when we speak.
  But that's the thing, it seems we do. Arguably, the very meaning of concepts is determined solely by associations with other concepts, in a way remarkably similar if not identical to frequency associations.
  
  mjburgess 13 days ago
  
  No, no.. the semantics of words is not other words.
  Cavemen wander around, they fall over a pig, they point to pig and say "pig". Other cavemen observe. Later, when they want a pig, they say "pig". No one here knows anything about pigs other than that there is something in the world which causes people to say "pig" and each caveman is able to locate that thing after awhile.
  The vast majority of language is nothing more than this: words point outside themselves to the world, this pointing is grown in us through acquaintance with the world.
  Now, in general, the cause of my saying "pig" is not me falling over one. Suppose I say, to a friend, "I've always thought pigs were cute, until i saw a big one!"
  So here, "I" points at both me as a body, but also plausibly at my model of myself (etc.), "always" modifies "thought" ... so "I've always thought" ends up being a statement about how my own models of my self over time have changed.. and so on for "pigs" and the like.
  We do not know that this is what our words mean. We have no idea that what we're referring to when I say "I've always thought" -- the nature of the world that our words refers to requires, in general, science to explain. Words are, at first, just a familiar way of throwing darts at a target which we can see, but not describe nor explain.
  It is this process which is entirely absent in an LLM. An LLM isnt throwing a dart at anything, it isnt even speaking. It's replaying historical darts matches between people.
  And this is just to consider reference. There are other causes of our using words much more complex than our trying to refer to things, likewise, these are absent from the LLM.
  
  adammarples 13 days ago
  
  They're basically describing a markov chain with word tokenisation. Which is so remarkably out of date compared to how a modern got works.
gizmo 13 days ago

Humans are capable of intricate complex thought that involves "mental states of recollection, preference, taste". However, LLMs have demonstrated pretty conclusively that most of the thoughts we have do not require any of that. Language turns out to be much simpler than previously assumed.
Thinking hard it makes your brain hurt. It's exhausting. Most of the work we do, including programming, is not like that. Some of the work we do is fiendishly difficult but much of it is more like word-completion based on prior experience.
Evolutionary processes optimize for energy efficiency. We don't think at 100% brain power all the time because we can't afford to. It makes a ton of sense, in retrospect, that our brains have optimized for language to the point that very little compute is required. And even so, the brain still consumes 20% of our daily calories.
Hard thinking is the exception and casual thinking is the norm. Why is it so hard to persuade people of anything on the internet? Because we mostly engage in LLM-like auto-completion. Very little actual thinking is involved and very few calories are spent.
ThinkingAgain 13 days ago

Can it also explain the following: "An alien named abcdpqrs landed on earth yesterday. What is the name of the alien who landed yesterday on earth?"
Chatgpt answers it correctly. abcdpqrs is perhaps not in the training set. If it is we can pick some othername.
jameshart 13 days ago

LLMs don’t ’replay variations on prior conversations’ though.
Predicting the probable token in a conversation requires predicting the probable subject of the conversation, predicting the interlocutors’ relationship and manner of speaking to one another, predicting the state of recollection, preference and taste of the speaker, predicting the speaker’s mental model…
If the LLM isn’t predicting all of those things then it will produce poor predictions of the next word; doing it well - and humans tend to agree that in a vast array of cases state of the art LLMs do predict tokens very well - requires that prediction model to predict all that context as well.
- mjburgess 13 days ago
  
  Alas, no it doesn't. Language induces this sort of anthropomorphism in people, I guess, so consider images.
  Suppose I take a billion images of all the coffee cups in the world, at a set of angles on the cup, and then build an associative (ie., frequency) statistical model of their pixels (ie., statistical AI). Consider generating one pixel at a time, in sequence, through the image. My associative model tells me P(col of next pixel | all previous).
  Now, I can generate coffee cups images similar to any variation or combination of the images in the dataset. Now, you might say, "well you can only do that if you have a model of a coffee cup" (rather than of pixels) -- if so, just generate a coffee cup at one of the angles not in the dataset. This will not happen, because the model has not been provided with enough information to do so.
  Namely, the model does not know the distance from the camera, the camera lens parameters, the angle to the coffee cup, etc. So there's literally a very very large inifinity of possible objects at unseen angles. Consider that underneath a coffee cup, the bottom might be missing entirely, etc.
  Now it will appear to know all of these things, because its just generating images with these same parameters (camera, angle, distance, etc.). But as soon as you want "a coffee further away than has been seen before", or "a coffee using a macro lens", etc. the whole thing will fall over.
  It is you, the view, who attributes 3D knowledge to the model because under ordinary circumstances the cause of a photo is features of a 3D environment.
  
  jameshart 13 days ago
  
  You’re saying this with confidence as if there isn’t a large body of working image and video generation algorithms out there that can produce physically plausible images of objects transposed into circumstances that don’t exist in their training set. A coffee using a macro lens for example.
  Is it so hard to believe that such models have developed a sense for how light propagates through a scene, a sense for how physical objects change when viewed from different angles, a sense for how lens distortion interacts with light? For goodness’ sake, these same models have a sense of what Greg Rutkowski’s art style is - we are well beyond ‘they’re just remembering pixels from past coffeecups’
  
  mjburgess 13 days ago
  
  > it so hard to believe that such models have developed a sense for how light propagates
  Well, its not a matter or belief or otherwise. I'm a trained practitioner in statistics, AI, physics, and other areas and you can show trivially that you cannot learn light physics from pixel distributions.
  Pixel distributions aren't stationary, and are caused by a very very large number of factors; likewise the physics of light for any given situation is subject to a large number of causes, all of them entirely absent from from the pixel distributions. This is a pretty trivial thing to show.
  > have a sense of what Greg Rutkowski’s art style is
  Well what these models show is that when you have PBs of image data and TBs of associated text data, you can relate words and images together usefully. In particular, you can use patterns of text tokens to sample from image distributions, and combine and vary these samples to produce novel images.
  The patterns in text and images are caused by people speaking, taking photos, etc. Those patterns necessarily obtain in any generated output. As in, if you train an LLM/etc. on how to speak, using vast amounts of conversational data, it cannot do anything other than appear to speak: that is the only thing the data distribution makes possible.
  Likewise here, the image generator has a compressed representation of PBs of pixel data which can be sampled from using text. So when you say, "Greg Rutkowski" you select for a highly structured image space, whose structure the original artists placed there.
  The generative model itself is not imparting structure to the data, it isnt aware of stlyle.. it's sampling from structure that we placed there. When we did so it was because we were, eg., in the room and taking a photo; or imagining what it would be like to apply preraphelite paintaing styles to 60s psychedelic colour pallets because we sensed that fashions of a century ago would now be regarded as cool.
  
  TeMPOraL 13 days ago
  
  The point of shoving so much data at those models is to help them pick up on the "very very large number of factors".
  There was a story I saw on HN a few times in the past, but which I can't find anymore, of someone training a simple, dumb neural net to predict a product (or a sum?) of two numbers, and discovering to their surprise that, under optimization pressure, the network eventually picked up Fourier Transform.
  It doesn't seem out of realm of possibility for a large model to pick up on light propagation physics and basic 3D structure of our reality just from watching enough images. After all, the information is implicitly encoded there, and you can handwave a Bayesian argument that it should be extractable.
  
  infecto 13 days ago
  
  Genuine question, what does it mean to be a trained practitioner in statistics, AI, physics and other areas?
  
  mjburgess 13 days ago
  
  My undergrad/grad work is in Physics; I presently consult on statistics and AI (and other areas); I may soon start a part-time PhD on how to explain AI models. I am presently, as I type, avoiding rewriting a system to explain AI models because I dislike doing things ive done.
  Its quite hard to see the full picture of how these statistical models work without experience across a hard science, stats and AI itself. However, people with backgrounds in mathematical finance would also have enough context. But its seemingly rare in physics, csci, stats, ai, etc. fields alone.
  I'd hope that most practitioners in applied statistics could separate properties of the data generating process from properties of its measures; but that hope is fading the more direct experience I have of the field of statistics. I had thought that, at least within the field, you wouldn't have the sort of pseudoscientific thinking that goes along with associative modelling. I think mathematical finance is probably the only area where you can reliably get an end-to-end picture on reality-to-stats models.
  
  gizmo 13 days ago
  
  Humans have painted with wonky perspective and impossible shadows because they didn't know better for literally 50.000 years. And those humans were just as smart as we are. Just look at 13th century paintings. Does this prove that humans back then didn't understand what a coffee cup looks like when rotated? No. So what does this prove about midjourney? Nothing.
  
  mjburgess 13 days ago
  
  I appreciate then when you're not an expert in physics, statistics and so on, all you have to go on are these circumstantial arguments, "two things that seem similar to me are alike, therefore they are alike in the same way".
  However, I am making no such argument. I am explaining that statistical models of pixel frequencies cannot model the causes of those frequencies. I am illustrating this point with an example, not proving it.
  If you want more detail about the reason it cannot: when the back of a coffee cup looks like the front, you can generate the back. But you cannot generate the bottom. (assuming the bottom doesn't occur in the dataset) -- why? Because the pixel distributions for the bottom of a cup have zero information about the rest of it.. and the model has no information about the bottom.
  If you want a "proof" you'd need at least to be familiar with applied mathematics and the like:
  Say the RGB value of each pixel, X of photos of coffee cups obtains from a data generating process parameterized on: distance from camera, lens focal length, angle to cup, lighting conditions, etc. Now produce a model of such causes, call it Environment(distance, angle, cup albedio,...).
  Then show that X ~ E|fixed-paramerters induces a frequency distribution of pixels, f1(next|previous) = P(Xi...n|Xj...n); then for any variation in a fixed parameter induces a completely different distribution, say f2, f3, f4, ... Now check that the covariance distribution for most pairs of fs, shows that any given f is almost zero-informative about any other f.
  Having done this, compare with a non-statistical (eg., video game) model of Environment where parameters are varied.. and show that all frames, say v, of the video game generated do have high covariance over the time of their sampling. The video game model covaries with most f1..fn; for the associative statistical model it only covaries with f1, or a very small number of others.
  There's something very obvious about this if you understand how these statistical AI systems work: in cases where variations in the environment induce radically different distributions the AI will fail; in cases where they are close enough, they will (appear) to succeed.
  The marketability of generative AI comes from rigging the use cases to situations where we don't need to change the environment. ie., you aren't exposed to the fact that when you generated a photo you could not have got the same one "at a different distance".
  If a video game was built this way it would be unplayable: consider every time you move the camera all the objects randomly change their apparent orientation, distance, style, etc.
  
  gizmo 13 days ago
  
  Humans have those exact same constraints. For the longest time we could only speculate what the dark side of the moon looked like, for instance.
  Yes, LLMs are constrained in what output they can generate based on their training data. Just as we humans are constrained in the output we can generate. When we talk about things we don't understand we speak gibberish, just like LLMs.
  
  krapp 12 days ago
  
  >Humans have those exact same constraints. For the longest time we could only speculate what the dark side of the moon looked like, for instance.
  That isn't the exact same constraint. We could speculate that the moon had a "dark side," because we understood what a moon was, and what a sphere was. LLMs cannot speculate about things outside of their existing data model, at all.
  >When we talk about things we don't understand we speak gibberish, just like LLMs.
  No we don't, wtf? We may create inaccurate models or theories, but we don't just chain together random strings of words the way LLMs do.
  
  croon 13 days ago
  
  > Is it so hard to believe that such models have developed a sense for how light propagates through a scene...
  This specifically is the thing I usually notice in AI images (outside of the hand trope).
  I'm not GP, and at best a layman in the field, but it's not hard to believe it's possible to generate believable lighting, given enough training data, but if I'm not mistaken it would be through sheer volume of properties like lighting/shadow here usually follows item here.
  But it's extremely inefficient, and not like we reason. It's like learning the multiplication table without understanding math. Just pairing an infinite amount of properties with each other.
  We on the other hand develop a grasp of where lighting exists (sun/lamp) and surmise where shadows fall and can muster any image in our mind using that model instead.
  
  jddj 13 days ago
  
  Is that really true?
  I can go to a huggingface space right now and type in koala wearing a suit serving coffee at a republican rally and there's a reasonable chance I get a result that's something along those lines. Is that meaningfully different to "coffee using a macro lens"?
  
  mjburgess 13 days ago
  
  Those models were not trained on the restricted dataset i'm talking about.
  I'm saying you deliberately construct a dataset which, say, does not include cups at various distances, angles, etc. but has as many as you like at a fixed range of these parameters (lens, distance, lighting, angle...).
  Now, you will get, from this model, just coffee cup images with these same parameters (eg., distance from the camera).
  Real-world generative systems are deliberately not constrained this way, and require many many PBs of images under various conditions to overcome this problem.
  Nevertheless you can actually still see this limitation: most generated photos etc. show subjects in "photographic distance/focus/etc. conditions", ie., its hard to get a photo of a person who isnt framed as if they were teh subject of a photo.
  Whereas, if you were in a room with a friend, you can take a photo at any angle/distnace.. even, say, from the top of their ears down. You will not get this freedom with a statistical model of pixel patterns
  
  jddj 13 days ago
  
  I can't argue with that, so I think unfortunately I may have missed the original point.
  The sun revolved around the earth for a long time until our own model was updated to include more data.
- civilized 13 days ago
  
  > humans tend to agree that in a vast array of cases state of the art LLMs do predict tokens very well
  This argument is backwards. Humans don't measure the next token prediction ability of the agents they speak to, human or AI. We rate speakers on whether they seem to understand what we say in context and respond by contributing useful information and analysis.
  The attributes you're saying can be inferred from known superior next token prediction ability are the things we can actually detect and measure, at least qualitatively. Next token prediction quality is not measurable by humans in any human-meaningful way. Improving test cross entropy by 50% doesn't mean anything to us. It is irrelevant except as a mechanism to train LLMs.
  
  TeMPOraL 13 days ago
  
  Point is that the simplest way to excel in next token prediction in the way human consider correct - which is rated by how people feel the predictor mimics a human understanding - is to actually have a world model and other components of human understanding.
  Understanding and compression are the same thing. LLMs are fed a huge chunk of totality of human knowledge, and optimized to compress it well. They for sure aren't doing it by Huffman-encoding a multidimensional lookup table.
  
  civilized 13 days ago
  
  > Point is that the simplest way to excel in next token prediction in the way human consider correct - which is rated by how people feel the predictor mimics a human understanding - is to actually have a world model and other components of human understanding.
  This is a speculative theory for why a next token predictor might sound like it knows what it's talking about. Not something we actually know.
  
  jameshart 13 days ago
  
  I mean, I think it was implied that humans judge the ‘next token prediction’ ability of LLMs as being good based on the quality of the overall output.
  
  civilized 13 days ago
  
  In which case you have a trivial point rather than a backwards argument: "the output seems like it knows what's it talking about, and the easiest way to explain that is if it really knows what it's talking about."
chx 13 days ago

https://hachyderm.io/@inthehands/112006855076082650
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
throw310822 13 days ago

> When the LLM generates each word it does so because it is, on average, the most common word in a corpus of text on which it was trained: "wearing" follows, "I like what you were" because most people who were having these conversations, captured in the training data, were talking about clothes.
Yes. Now extend this to the context length of GPT-4 turbo- about 240 pages of text. So from your description, "wearing" is just the "most common word" to follow those 240 pages of (previously unseen, unique) text according to its training data. Quite simple, nothing to see here, I suppose.

astrange 13 days ago

> The assumption that most people make is that these models can answer questions or chat with you, but in reality all they can do is take some text you provide as input and guess what the next word (or more accurately, the next token) is going to be.

These two things cannot be compared or contrasted. It's very common to see people write something like "LLMs don't actually do <thing they obviously actually do>, they just do <dismissive description of the same thing>."

Typically, like here, the dismissive description just ignores the problem of why it manages to write complete novel sentences when it's only "guessing" subword tokens, why those sentences appear to be related to the question you asked, and why they are in the form of an answer to your question instead of another question (which is what base models would do).

vasco 13 days ago

If someone asks me "What is your name?" my reply is also simply just guessing what token would go well next in the full text of the conversation.
- miningape 13 days ago
  
  No you "know" what your name is and you retrieve that information. It's infinitely different to rolling a dice and picking a name based on what number comes up
  
  astrange 13 days ago
  
  This is two different parts of the system.
  > No you "know" what your name is and you retrieve that information.
  The LLM has this (literally - most of it is a key value store). It outputs a probability for all possible next tokens.
  > It's infinitely different to rolling a dice and picking a name based on what number comes up
  The sampling algorithm running the LLM then does this part, and adds randomness to make it more "creative".
  So if you want factual information then don't be so random you skip over the right answer.
  
  realusername 13 days ago
  
  I feel like I work in similar ways, if you ask my name there's a probability that I'll answer with the shorten version or the longer one pretty much randomly and there's no conscious effort about it.
  While the LLMs do have hallucinations, basic stuff like this will never trigger any.
  Where I feel the most differences are aren't in the token concept but rather on the deep reasoning (which we don't use as much in my opinion)
  
  agumonkey 13 days ago
  
  Sometimes I'm tempted to think that my brain is as fuzzy if not more than an LLM
- input_sh 13 days ago
  
  If asked repeatedly, are you gonna answer that question with a different name, depending on which "memory" you randomly pick from?
  
  kevindamm 13 days ago
  
  Somebody with MPD might.
- edflsafoiewq 13 days ago
  
  That's an interesting example, since an LLM has no concept of a "self" and literally does not know who it is. It can only answer it "correctly" if you prefix it with a prompt telling it who and what it is.
  
  throaway893 13 days ago
  
  You also have been told your name, most likely by your parents. It also had to be explained to you what you are, at some point in your life.
- poniko 13 days ago
  
  It's really not though, generally maybe but you also give it a split second to think about if it's a good time to lie, make a joke or maybe just not reveal your name.
  
  vasco 12 days ago
  
  And any of those choices will be your best guess at the next tokens in the text of the conversation respecting not only the conversation but also your self image and surroundings.
- vrighter 13 days ago
  
  what if you're the only john in a world of steves? What would happen then?
  
  vasco 12 days ago
  
  I'd say, hi I'm john
7sidedmarble 13 days ago

Where can you read more on that matter of it answering your question rather than asking more?
- astrange 13 days ago
  
  That's called instruction tuning.
  https://arxiv.org/abs/2308.10792
pottspotts 13 days ago

I came here to make this comment as well.
This line of reasoning that LLMs "only predict" the next token is akin to saying humans can only think or speak one word at a time. Yes, we use one token/word at a time, but it is the aggregate thought that matters, regardless of what underlies it.
- winternewt 13 days ago
  
  I think the mistake people make is assuming that "probability" is a simple concept.
  If there are 50K possible tokens and I don't have any other information, I could make a naive estimate that every token has equal probability and start generating text that is just gibberish. With the simple single-token Markov-chain example I would estimate probabilities based the previous token, and that probability estimate would be much better. If you use it for generating text it will look like something that is almost, but not quite, entirely unlike human speech. [1]
  The difference lies entirely in how accurately you model the world and what information you have available when estimating probabilities. Models like GPT4 happen to be very good at it because they encode a huge amount of knowledge about the world and take a lot of context into account when estimating the probability. That's not something to be taken lightly.
  [1] https://projects.haykranen.nl/markov/demo/
  
  XorNot 13 days ago
  
  I am skeptical anyone saying this is making a mistake: it only ever really comes up when someone has specific priors they're wanting to litigate - best summarized by the timeless: you cannot make a man understand something when his paycheque depends on his not understanding it.
- lm28469 13 days ago
  
  When the other camp is treating it like an oracle of truth and a sentient being it's hard to pick a side tbh.
theshrike79 13 days ago

Yeah, an LLM is not a Markov Chain. The only similarity is that they string words together with weighted possibilities. That's about it.
- astrange 13 days ago
  
  Well, it is a Markov chain if you do greedy sampling, which 99% of the time you do. So the weird part is why it still works so well.
  If you do beam search, RAG, tool usage, etc then the whole system no longer is one.
photon_lines 13 days ago

Yeah - most of the online descriptions aren't even remotely accurate nor close to explaining how LLMs like ChatGPT actually work. They are not simple 'next-word predictors' and most of the online tutorials / info don't go into fine-tuning nor the intricate details of chain of thought reasoning (which I personally believe plays a huge role in ChatGPT's amazing performance). If you want my own detailed description you can find it here: https://photonlines.substack.com/p/intuitive-and-visual-guid...
- sk11001 13 days ago
  
  I think too many people confuse the base model (which can be called a next token predictor) with the fine-tuned chat model which is specifically modified to carry a conversation, be helpful and be as factually correct as possible.

mft_ 13 days ago

How does this concept explain (for example) an LLM’s ability to provide a precis of an article? Or to compare two blocks of text and highlight differences? Or to take an existing block of code and find and correct an error?

amanzi 13 days ago

Stephen Wolfram has a great post on this: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
It's really long! But worth the read.
lnenad 13 days ago

Exactly what I am wondering, these are the complex things that "Oh it just dumps out tokens" explanations don't explain.

thefz 13 days ago

> On the other side, given the propensity of LLMs to hallucinate, I wouldn't trust any workflow in which the LLM produces output that goes straight to end users without verification by a human.

Yep. Nice article, though!

pietmichal 13 days ago

This was such a nice primer that inspired me to give Karpathy's series another try. Loved the explanation!

l5870uoo9y 13 days ago

Are there any open source implementation of neural network "functions"? And the layering, transformers and attention mechanisms.

barfbagginus 13 days ago

Here's a walkthrough nlp.seas.harvard.edu/annotated
Here's a cool animated transformer, also open source prvnsmpth.github.io/animated
Here's the "attention is all you need paper" with links to open source implementations paperswithcode.com/paper/attention-is-all

z7 13 days ago

"I'll begin by clearing a big misunderstanding people have regarding how Large Language Models work. The assumption that most people make is that these models can answer questions or chat with you, but in reality all they can do is take some text you provide as input and guess what the next word (or more accurately, the next token) is going to be."

What separates this from the following:

"I'll begin by clearing a big misunderstanding people have regarding how the human brain works. The assumption that most people make is that the brain can think, reason, and understand language, but in reality all it can do is process electrical and chemical signals."

seydor 13 days ago

yeah i think we are at a point of reflection in philosophy of mind, where we ponder if that our "thinking", our communication and dialogues are mechanistic continuations taking place serially in one or more brains
- z7 13 days ago
  
  Whether brains or LLMs, claiming that "in reality all they can do" is X seems dubious, given that a system can have properties that its constituents do not have on their own. As an explanation of the emerging complexity of what "they can do" it is fundamentally unsatisfying.

alabhyajindal 13 days ago

Great read, thanks for sharing!

Cyphase 13 days ago

[flagged]

fredoliveira 13 days ago

This type of comment blows my mind and frankly, I expect more from people who browse HN. Is this supposed to be a dig at the author? Since when is writing a book about another technology somehow indicative of a lack of expertise in anything else? If anything, the generality just means the guy is smart and able to apply his knowledge across multiple subject matters.
- Cyphase 13 days ago
  
  I guess I knew I was taking a risk with that comment, but I was thinking more of it being low value; I didn't even consider that it might be taken this way. It wasn't a dig of any sort. It was just a late-night quote-unquote "interesting" observation from a Python fan, with simonw being known for Django and LLM research. I probably should have added something to make that clearer.
  
  fredoliveira 13 days ago
  
  Fair, and I appreciate the explanation.
  I do think this provides value. Perhaps not to everyone here, or people who have been working on LLMs for a while. But to people who think they have no ability to grok (pun not intended) LLMs because they don't have the mathematical chops? I think it'll be valuable for them.
  
  Cyphase 13 days ago
  
  Wow, there's really a kink in the wire somewhere between you and I. :)
  > I guess I knew I was taking a risk with that comment, but I was thinking more of it being low value; I didn't even consider that it might be taken this way.
  > ... I was thinking more of it being low value ...
  "It" being my comment; I was taking a risk because my intention with the comment was just a relatively bland observation. A low value _comment_.
  Again, no digs about the article whatsoever, at all, 0.000%. You are completely misunderstanding my intention. No hard feelings at all, just want to clarify that for the record. :)
  My comment was just a low value, passing observation, basically equivalent to if this article had been written by Sarah Wexler (made up name) and I had said, "Hmm, another person with the initials S.W. explaining LLMs..".