Visualizing Attention, a Transformer's Heart [video]

999 points by rohitpaulk 15 days ago

As someone with a background in quantum chemistry and some types of machine learning (but not neural networks so much) it was a bit striking while watching this video to see the parallels between the transformer model and quantum mechanics.

In quantum mechanics, the state of your entire physical system is encoded as a very high dimensional normalized vector (i.e., a ray in a Hilbert space). The evolution of this vector through time is given by the time-translation operator for the system, which can loosely be thought of as a unitary matrix U (i.e., a probability preserving linear transformation) equal to exp(-iHt), where H is the Hamiltonian matrix of the system that captures its “energy dynamics”.

From the video, the author states that the prediction of the next token in the sequence is determined by computing the next context-aware embedding vector from the last context-aware embedding vector alone. Our prediction is therefore the result of a linear state function applied to a high dimensional vector. This seems a lot to me like we have produced a Hamiltonian of our overall system (generated offline via the training data), then we reparameterize our particular subsystem (the context window) to put it into an appropriate basis congruent with the Hamiltonian of the system, then we apply a one step time translation, and finally transform the resulting vector back into its original basis.

IDK, when your background involves research in a certain field, every problem looks like a nail for that particular hammer. Does anyone else see parallels here or is this a bit of a stretch?

francasso 15 days ago

I don't think the analogy holds: even if you forget all the preceding non linear steps, you are still left with just a linear dynamical system. It's neither complex nor unitary, which are two fundamental characteristics of quantum mechanics.
bdjsiqoocwk 15 days ago

I think you're just describing a state machine, no? The fact that you encode the state in a vector and steps by matrices is an implementation detail...?
- Xcelerate 15 days ago
  
  Perhaps a probabilistic FSM describes the actual computational process better since we don’t have a concept equivalent to superposition with transformers (I think?), but the framework of a FSM alone doesn’t seem to capture the specifics of where the model/machine comes from (what I’m calling the Hamiltonian), nor how a given context window (the subsystem) relates to it. The change of basis that involves the attention mechanism (to achieve context-awareness) seems to align better with existing concepts in QM.
  One might model the human brain as a FSM as well, but I’m not sure I’d call the predictive ability of the brain an implementation detail.
  
  BoGoToTo 15 days ago
  
  | context window
  I actually just asked a question on the physics stack exchange that is semi relevant to this. https://physics.stackexchange.com/questions/810429/functiona...
  In my question I was asking about a hypothetical time-evolution operator that includes an analog of a light cone that you could think of as a context window. If you had a quantum state that was evolved through time by this operator then I think you could think of the speed of light being a byproduct of the width of the context window of some operator that progresses the quantum state forward by some time interval.
  Note I am very much hobbyist-tier with physics so I could also be way off base and this could all be nonsense.
  
  ricardobeat 15 days ago
  
  I’m way out of my depth here, but wouldn’t such a function have to encode an amount of information/state orders of magnitude larger than the definition of the function itself?
  If this turns out to be possible, we will have found the solution to the Sloot mystery :D
  https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_System
  
  DaiPlusPlus 14 days ago
  
  The article references patent “1009908C2” but I can’t find it in the Dutch patent site, nor Google Patent search.
  The rest of the article has “crank” written all over it; almost certainly investor fraud too - it’d be straightforward to fake the claimed smartcard video thing to a nontechnical observer - though not quite as egregious as Steorn Orbo or Theranos though.
  
  fruktmix 12 days ago
  
  How can I not have heard of this before?! Sounds like the plot for a thriller movie.
- feoren 14 days ago
  
  Not who you asked (and I don't quite understand everything) but I think that's about right, except in the continuous world. You pick an encoding scheme (either the Lagrangian or the Hamiltonian) to go from state -> vector. You have a "rules" matrix, very roughly similar to a Markov matrix, H, and (stretching the limit of my knowledge here) exp(-iHt) very roughly "translates" from the discrete stepwise world to the continuous world. I'm sure that last part made more knowledgeable people cringe, but it's roughly in the right direction. The part I don't understand at all is the -i factor: exp(-it) just circles back on itself after t=2pi, so it feels like exp(-iHt) should be a periodic function?
  
  empiricus 14 days ago
  
  Yes, exp(-iHt) means the vector state is rotating as time passes, and it rotates faster when the Hamiltonian (energy) is bigger. This rotation gives the wave like behavior. Slightly related, there is an old video of Feynman where he tries to teach quantum mechanics to some art students, and he explains this complex rotation and its effects without any reference to math.
BoGoToTo 15 days ago

I've been thinking about his a bit lately. If time is non-continuous then could you model the time evolution of the universe as some operator recursively applied to the quantum state of the universe? If each application of the operator progresses the state of the universe by a single planck-time could we even observe a difference between that and a universe where time is continuous?
- tweezy 14 days ago
  
  So one of the most "out there" non-fiction books I've read recently is called "Alien Information Theory". It's a wild ride and there's a lot of flat-out crazy stuff in it but it's a really engaging read. It's written by a computational neuroscientist who's obsessed with DMT. The DMT parts are pretty wild, but the computational neuroscience stuff is intriguing.
  In one part he talks about a thought experiment modeling the universe as a multidimensional cellular automata. Where fundamental particles are nothing more than the information they contain. And particles colliding is a computation that tells how that node and the adjacent nodes to update their state.
  Way out and not saying there's anything truth to it. But it was a really interesting and fun concept to chew on.
  
  phrotoma 14 days ago
  
  Definitely way out there and later chapters are what I can only describe as wild conjecture, but I also found it to be full of extremely accessible foundational chapters on brain structure and function.
  
  andoando 14 days ago
  
  Im working on a model to do just that :) The game of life is not too far off either.
  
  Gooblebrai 14 days ago
  
  You might enjoy his next book: Reality Switch.
- BobbyTables2 15 days ago
  
  I think Wolfram made news proposing something roughly along these lines.
  Either way, I find Planck time/energy to be a very spooky concept.
  https://wolframphysics.org/
- pas 14 days ago
  
  This sounds like the Bohmian pilot wave theory (which is a global formulation of QM). ... Which might be not that crazy, since spooky action at a distance is already a given. And in cosmology (or quantum gravity) some models are describing a region of space based only its surface. So in some sense the universe is much less information dense, than we think.
  https://en.m.wikipedia.org/wiki/Holographic_principle
cmgbhm 14 days ago

Not a direct comment on the question but I had a math PhD as an intern before. One of his comments was having tons of high dimensional linear algebra stuff was super advanced 1900s and has plenty of room for new cs discovery.
Didn’t make the “what was going on then in physics “ connection until now.
tpurves 14 days ago

So what you are saying is that, we've reached the point where our own most sophisticated computer models are starting to approach the same algorithms that define the universe we live in? Aka, the simulation is showing again?
lagrange77 15 days ago

I only understand half of it, but it sounds very interesting. I've always wondered, if the principle of stationary action could be of any help with machine learning, e.g. provide an alternative point of view / formulation.

seydor 15 days ago

I have found the youtube videos by CodeEmporium to be simpler to follow https://www.youtube.com/watch?v=Nw_PJdmydZY

Transformer is hard to describe with analogies, and TBF there is no good explanation why it works, so it may be better to just present the mechanism, "leaving the interpretation to the viewer". Also, it's simpler to describe dot products as vectors projecting on one another

mjburgess 15 days ago

The explanation is just that NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words). Their weights are a model of this distribution. LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data.
Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.
Why is 'London in UK' "known" but 'London in France' isnt? Just because 'UK' much more frequently occurs in the dataset.
The algorithm isnt doing anything other than aligning computation to hardware; the computation isnt doing anything interesting. The value comes from the conditional probability structure in the data. -- that comes from people arranging words usefully, because they're communicating information with one another
- nerdponx 15 days ago
  
  I think you're downplaying the importance of the attention/transformer architecture here. If it was "just" a matter of throwing compute at probabilities, then we wouldn't need any special architecture at all.
  P(next_word|previous_words) is ridiculously hard to estimate in a way that is actually useful. Remember how bad text generation used to be before GPT? There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.
  
  JeremyNT 15 days ago
  
  > There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.
  Isn't that essentially what mjburgess said in the parent post?
  > LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data... The algorithm isnt doing anything other than aligning computation to hardware
  
  nerdponx 14 days ago
  
  Not really, and no. Torch and CUDA align computation to hardware.
  If it were just a matter of doing that, we would be fine with fully-connected MLP. And maybe that would work with orders of magnitude more data and compute than we currently throw at these models. But we are already pushing the cutting edge of those things to get useful results out of the specialized architecture.
  Choosing the right NN architecture is like feature engineering: the exact details don't matter that much, but getting the right overall structure can be the difference between learning a working model and failing to learn a working model, from the same source data with the same information content. Clearly our choice of inductive bias matters, and the transformer architecture is clearly an improvement over other designs.
  Surely you wouldn't argue that a CNN is "just" aligning computation to hardware, right? Transformers are clearly showing themselves as a reliably effective model architecture for text in the same way that CNNs are reliably effective for images.
  
  rsfern 14 days ago
  
  There’s some interesting work replacing scaled dot product attention and position embeddings with fixed format MLPs [0] - so I tend to lean towards thinking of classic transformers as having a reasonable enough inductive bias and the scalability to actually realize the amount of compute that’s needed
  0: https://arxiv.org/abs/2105.08050
  
  mjburgess 14 days ago
  
  Err... no. MLPs are fundamentally sequential algorithms (backprop weight updating). All major innovations in NN design have been to find ways of designing the architecture to fit GPU compute paradigms.
  It was an innovation, in the 80s, to map image structure to weight structure that underpins CNNs. That isnt what made CNNs trainable though.. that was alexnet, and just go read the paper... its pretty upfront about how the NN architecture is designed to fit the GPU... that's the point of it
  
  mjburgess 15 days ago
  
  Yes, it's really hard -- the innovation is aligning the really basic dot-product similarity mechanism to hardware. You can use basically any NN structure to do the same task, the issue is that they're untrainable because they arent parallizable.
  There is no innovation here in the sense of a brand new algorithm for modelling conditional probabilities -- the innovation is in adapting the algorithm for GPU training on text/etc.
  
  HarHarVeryFunny 15 days ago
  
  > Yes, it's really hard -- the innovation is aligning the really basic dot-product similarity mechanism to hardware. You can use basically any NN structure to do the same task, the issue is that they're untrainable because they arent parallizable.
  This is only partially true. I wouldn't say you could use *any* NN architecture for sequence-to-sequence prediction. You either have to model them as a potentially infinite sequence with an RNN of some sort (e.g. LSTM), or, depending on the sequence type, model them as a hierarchy of sub-sequences, using something like a multi-layered convolution or transformer.
  The transformer is certainly well suited to current massively parallel hardware architectures, and this was also a large part of the motivation for the design.
  While the transformer isn't the only way to do seq-2-seq with neural nets, I think the reason it is so successful is more than simply being scalable and well matched to the available training hardware. Other techniques just don't work as well. From the mechanistic interpretability work that has been done so far, it seems that learnt "induction heads", utilizing the key-based attention, and layered architecture, are what give transformers their power.
  
  bruce343434 15 days ago
  
  I don't know why you seem to have such a bone to pick with transformers but imo it's still interesting to learn about it, and reading your dismissively toned drivel of "just" and "simply" makes me tired. You're barking up the wrong tree man, what are you on about.
  
  mjburgess 15 days ago
  
  No issue with transformers -- the entire field of statistical learning, decision trees to NNs, do the same thing... there's no mystery here. No person with any formal training in mathematical finance, applied statistics, hard experimental sciences on complex domains... etc. would be taken in here.
  I'm trying my best to inform people who are interested in being informed, against an entire media ecosystem being played like a puppet-on-a-string by ad companies. The strategy of these companies is to exploit how easy is it to strap anthropomorphic interfaces over models of word frequencies and have everyone lose their minds.
  Present the same models as a statistical dashboard, and few would be so adamant that their sci-fi fantasy is the reality.
  
  jameshart 15 days ago
  
  “There’s no mystery here”
  Nobody’s claiming there’s ‘mystery’. Transformers are a well known, publicly documented architecture. This thread is about a video explaining exactly how they work - that they are a highly parallelizable approach that lends itself to scaling back propagation training.
  “No person with … formal training … would be taken in here”
  All of a sudden you’re accusing someone of perpetuating a fraud - I’m not sure who though. “Ad companies”?
  Are you seriously claiming that there hasn’t been a qualitative improvement in the results of language generation tasks as a result of applying transformers in the large language model approach? Word frequencies turn out to be a powerful thing to model!
  It’s ALL just hype, none of the work being done in the field has produced any value, and everyone should… use ‘statistical dashboards’ (whatever those are)?
  
  fellendrone 14 days ago
  
  > models of word frequencies
  Ironically, your best effort to inform people seems to be misinformed.
  You're talking about a Markov model, not a language model with trained attention mechanisms. For a start, transformers can consider the entire context (which could be millions of tokens) rather than simple state to state probabilities.
  No wonder you believe people are being 'taken in' and 'played by the ad companies'; your own understanding seems to be fundamentally misplaced.
  
  saeranv 14 days ago
  
  I think they are accounting for the entire context, they specifically write out:
  >> P(next_word|previous_words)
  So the "next_word" is conditioned on "previous_words" (plural), which I took to mean the joint distribution of all previous words.
  But, I think even that's too reductive. The transformer is specifically not a function acting as some incredibly high-dimensional lookup table of token conditional probabilities. It's learning a (relatively) small amount of parameters to compress those learned conditional probabilities into a radically lower-dimensional embedding.
  Maybe you could describe this as a discriminative model of conditional probability, but at some point, we start describing that kind of information compression as semantic understanding, right?
  
  nerdponx 14 days ago
  
  It's reductive because it obscures just how complicated that `P(next_word|previous_words)` is, and it obscures the fact that "previous_words" is itself a carefully-constructed (tokenized & vectorized) representation of a huge amount of text. One individual "state" in this Markov-esque chain is on the order of an entire book, in the bigger models.
  
  mjburgess 14 days ago
  
  It doesnt matter how big it is, it's properties dont change. eg., it never says, "I like what you're wearing" because it likes what I'm wearing.
  It seems there's an entire generation of people taken-in by this word, "complexity" and it's just magic sauce that gets sprinkled over ad-copy for big tech.
  We know what it means to compute P(word|words), we know what it means that P("the sun is hot") > P("the sun is cold") ... and we know that by computing this, you arent actaully modelling the temperature of the sun.
  It's just so disheartening how everyone becomes so anthropomorphically credulous here... can we not even get sun worship out of tech? Is it not possible for people to understand that conditional probability structures do not model mental states?
  No model of conditional probabilities over text tokens, no matter how many text tokens it models, ever says, "the weather is nice in august" because it means the weather is nice in august. It has never been in an august; or in weahter; nor does it have the mental states for preference, desire.. nor has it's text generation been caused by the august weather.
  This is extremely obvious, as in, simply refelect on why the people who wrote those historical text did so.. and reflect on why an LLM generates this text... and you can see that even if an LLM produced word-for-word MLK's I have a dream speech, it does not have a dream. It has not suffered any oppression; nor organised any labour; nor made demands on the moral conscience of the public.
  This shouldnt need to be said to a crowd who can presumably understand what it means to take a distribution of text tokens and subset them. It doesnt matter how complex the weight structure of an NN is: this tells you only how compressed the conditional probability distribution is over many TBs of all of text history.
  
  nerdponx 14 days ago
  
  You're tilting at windmills here. Where in this thread do you see anyone taking about the LLM as anything other than a next-token prediction model?
  Literally all of the pushback you're getting is because you're trivializing the choice of model architecture, claiming that it's all so obvious and simple and it's all the same thing in the end.
  Yes, of course, these models have to be well-suited to run on our computers, in this case GPUs. And sure, it's an interesting perspective that maybe they work well because they are well-suited for GPUs and not because they have some deep fundamental meaning. But you can't act like everyone who doesn't agree with your perspective is just an AI hypebeast con artist.
  
  mjburgess 14 days ago
  
  ah, well there's actually two classes of replies and maybe i'm confusing one for the other here.
  My claim regarding architecture follows just formally: you can take any statistical model trained via gd and phrase it as a kNN. The only difference is how hard it is to produce such a model from fitting to data, rather than from rephrasing.
  The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).
  
  nerdponx 13 days ago
  
  I think I see the crux of the disagreement.
  > The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).
  But it's not just about hardware. Maybe it would be, if we had access to an infinite stream of perfectly noise-free training data for every conceivable ML task. But we also need to worry about actually getting useful information out of finite data, not just finite computing resources. That's the limit you should be thinking about: the information content of input data, not compute cycles.
  And yes, when trying to learn something as tremendously complicated as a world-model of multiple languages and human reasoning, even a dataset as big as The Pile might not be big enough if our model is inefficient at extracting information from data. And even with the (relatively) data-efficient transformer architecture, even a huge dataset has an upper limit of usefulness if it contains a lot of junk noise or generally has a low information density.
  I put together an example that should hopefully demonstrate what I mean: https://paste.sr.ht/~wintershadows/7fb412e1d05a600a0da5db2ba.... Obviously this case is very stylized, but the key point is that the right model architecture can make good use of finite and/or noisy data, and the wrong model architecture cannot, regardless of how much compute power you throw at the latter.
  It's Shannon, not Turing, who will get you in the end.
  
  mjburgess 7 days ago
  
  text is not a valid measure of the world, so there is no "informative model" ie., a model of the data generating process to fit it to. there is no sine curve, indeed there is no function from world->text -- there are an infinite family of functions, none of which is uniquely sampled by what happens to be written down
  transformers, certainly, arent "informative" in this sense: they start with no prior model of how text would be distributed given the structure of the world.
  these arguments all make radical assumptions that we are in somethihng like a physics experiment -- rather than scraping glyphs from books and replaying their patterns
  
  drdeca 14 days ago
  
  Perhaps you have misunderstood what the people you are talking about, mean?
  Or, if not, perhaps you are conflating what they mean with something else?
  Something doesn’t need to have had a subjective experience of the world in order to act as a model of some parts of the world.
  
  eutectic 15 days ago
  
  Different models have different inductive biases. There is no way you could build GPT4 with decision trees.
  
  andoando 14 days ago
  
  I don't think thats true, modeling the data as vectorizations and having paths between them is entirely different from say, building a decision tree. The attention head is also something entirely different.
  And the model is capable of doing much more than just making simple predictions, it can build generalized abstractions and form coherent outputs on text/logic its never seen.
  
  divan 15 days ago
  
  Do you have blog or anything to follow?
  
  mjburgess 15 days ago
  
  I may start publishing academic papers in XAI as part of a PhD; if I do, I'll share somehow. The problem is the asymmetry of bullshit: the size of paper necessary for academics to feel that claims have been evidenced is book-length for critique but 2pg for "novel contributions".
  
  kordlessagain 15 days ago
  
  Somebody's judgment weights need to be updated to include emoji embeddings.
  
  YetAnotherNick 15 days ago
  
  No. This is blatantly false. The belief that recurrent model can't be scaled is untrue. People have recently trained MAMBA with billions of parameters. The fundamental reason why transformers changed the field is that they are lot more scalable context length wise, and LSTM, LRU etc doesn't come close.
  
  HarHarVeryFunny 14 days ago
  
  Yes, but pure Mamba doesn't perform as well as a transformer (and neither did LTSMs). This is why you see hybrid architectures like Jamba = Mamba + transformer. The ability to attend to specific tokens is really key, and what is lost in recurrent models where sequence history is munged into a single state.
  
  YetAnotherNick 14 days ago
  
  That's my point. It doesn't perform in terms of loss, even though it performs well enough in terms of compute
  
  mjburgess 15 days ago
  
  > they are lot more scalable context length wise
  Sure, we're agreeing. I'm just being less specific.
  
  YetAnotherNick 15 days ago
  
  Scalable as in loss wise scalable, not compute wise.
- IanCal 15 days ago
  
  This is wrong, or at least a simplification to the point of removing any value.
  > NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words).
  They are trained to maximise this, yes.
  > Their weights are a model of this distribution.
  That doesn't really follow, but let's leave that.
  > Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.
  Here's the rub. If how you describe them is all they're doing then a sequence of never-before-seen words would have no valid response. All words would be equally likely. It would mean that a single brand new word would result in absolute gibberish following it as there's nothing to go on.
  Let's try:
  Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.
  Result: You now have two kjsdhlisrnj.
  I would wager a solid amount that kjsdhlisrnj never appears in the input data. If it does pick another one, it doesn't matter.
  So we are learning something more general than the frequencies of sequences of tokens.
  I always end up pointing to this but OthelloGPT is very interesting https://thegradient.pub/othello/
  While it's trained on sequences of moves, what it does is more than just "sequence a,b,c is followed by d most often"
  
  mjburgess 15 days ago
  
  Any NN "trained on" data sampled from an abstract complete outcome space (eg., a game with formal rules; mathematical sequences, etc) can often represent that space completely. It comes down to whether you can form conditional probability models of the rules, and that's usually possible because that's what abstract rules are.
  > I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.
  1. P(number-word|tell me how many...) > P(other-kinds-of-words|tell me how many...)
  2. P(two|I have one ... I add another ...) > P(one|...) > P(three|...) > others
  This is trivial.
  
  IanCal 15 days ago
  
  Right, learning more abstract rules about how things work is the goal and where the value comes in. Not all algorithms are able to do this, even if they can do what you describe in your first comment.
  That's why they're interesting, othellogpt is interesting because it builds a world model.
  
  mjburgess 15 days ago
  
  It builds a model of a "world" whose structure is conditional probabilities, this is circular. It's like saying you can use a lego model to build a model of another lego model. All the papers which "show" NNs building "world" models arent using any world. It's lego modelling lego.
  The lack of a world model only matters when the data NNs are trained on aren't valid measures of the world that data is taken to model. All the moves of a chess game are a complete model of chess. All the books ever written aren't a model of, well, anything -- the structure of the universe isnt the structure of text tokens.
  The only reason all statistical algorithms, including NNs, appear to model the actual world is because patterns in data give this appearance: P(The Sun is Hot) > P(The Sun is Cold) -- there is no model of the sun here.
  The reason P("The Sun is Hot") seems to model the sun, is because we can read the english words "sun" and "hot" -- it is we who think the machine which generates this text does so semantically.. but the people who wrote that phrase in the dataset did so; the machine is just generating "hot" because of that dataset.
  
  IanCal 15 days ago
  
  Othellogpt is fed only moves and builds a model of the current board state in its activations. It never sees a board.
  > It's like saying you can use a lego model to build a model of another lego model.
  No it's like using a description of piece placements and having a picture in mind about what the current model looks like.
  
  mjburgess 15 days ago
  
  The "board" is abstract. Any game of this sort is defined by a series of conditional probabilities:
  {P(Pawn_on_sqare_blah|previous_moves) ... etc.}
  What all statistical learning algorithms model is sets of conditional probabilities. So any stat alg is a model of a set of these rules... that's the "clay" of these models.
  The problem is the physical world isn't anything like this. The reason I say, "I liked that TV show" is because I had a series of mental states caused by the TV show over time (and so on). This isnt representable as a set of conditional probs in the same way.
  You could imagine, at the end of history, there being a total set of all possible conditional probabilities: P(I liked show|my_mental_states, time, person, location, etc.) -- this would be uncomputable, but it could be supposed.
  If you had that dataset then yes, NNs would learn the entire structure of the world, because that's the dataset. The problem is that the world cannot be represented in this fashion, not that NNs could model it if it could be. A decision tree could.
  P(I liked the TV show) doesnt follow from any dataset ever collected. It follows from my mental states. So no NN can ever model it. They can model frequency associations of these phrases in historical text documents: this isnt a model of hte world
  
  IanCal 15 days ago
  
  > Any game of this sort is defined by a series of conditional probabilities: {P(Pawn_on_sqare_blah|previous_moves) ... etc.}
  That would always be 1 or 0, but also that data is not fed into othellogpt. That is not the dataset. It is not fed in board states at all.
  It learns it, but it is not the dataset.
  
  mjburgess 15 days ago
  
  It is the dataset. When you're dealing with abstract objects (ie., mathematical spaces), they are all isomorphic.
  It doesnt matter if you "feed in" 1+1+1+1 or 2+2 or sqrt(16).
  The rules of chess are encoded either explicit rules or by contrast classes of valid/invalid games. These are equivalent formulations.
  When you're dealing with text tokens it does matter if "Hot" is frequently after "The Sun is..." because reality isnt an abstract space, and text tokens arent measures of it.
  
  IanCal 15 days ago
  
  > It is the dataset.
  No. A series of moves alone provides strictly less information than a board state or state + list of rules.
  
  mjburgess 15 days ago
  
  If the NN learns the game, that is itself an existence proof of the opposite, (by obvious information-theoretic arguments).
  Training is supervised, so you don't need bare sets of moves to encode the rules; you just need a way of subsetting the space into contrast classes of valid/invalid.
  It's a lie to say the "data" is the moves, the data is the full outcome space: ({legal moves}, {illegal moves}) where the moves are indexed by the board structure (necessarily, since moves are defined by the board structure -- its an abstract game). So there's two deceptions here: (1) supervision structures the training space; and (2) the individual training rows have sequential structure which maps to board structure.
  Complete information about the game is provided to the NN.
  But let's be clear, the othellogpt still generates illegal moves -- showing that it does not learn the binary conditional structure of the actual game.
  The deceptiveness of training a NN on a game whose rules are conditional probability structures and then claiming the very-good-quality conditional probability structures it finds are "World Models" is... maddening.
  This is all just fraud to me; frauds dressing up other frauds in transparent clothing. LLMs trained on the internet are being sold as approximating the actual world, not 8x8 boardgames. I have nothing polite to say about any of this
  
  IanCal 15 days ago
  
  > It's a lie to say the "data" is the moves, the data is the full outcome space: ({legal moves}, {illegal moves})
  There is nothing about illegal moves provided to othellogpt as far as I'm aware.
  > Complete information about the game is provided to the NN.
  That is not true. Where is the information that there are two players provided? Or that there are two colours? Or how the colours change? Where is the information about invalid moves provided?
  > But let's be clear, the othellogpt still generates illegal moves -- showing that it does not learn the binary conditional structure of the actual game.
  Not perfectly, no. But that's not at all required for my point, though is relevant if you try and use the fact it learns to play the game as proof that moves provide all information about legal board states.
  
  mjburgess 15 days ago
  
  How do you think the moves are represented?
  All abstract games of this sort are just sequences of bit patterns, each pattern related to the full legal space by a conditional probability structure (or, equivalently, as set ratios).
  Strip away all the NN b/s and anthropomorphic language and just represent it to yourself using bit sets.
  Then ask: how hard is it to approximate the space from which these bit sets are drawn using arbitrarily deep conditional probability structures?
  it's trivial
  the problem the author sets up about causal structures in the world cannot be represented as a finite sample of bit set sequences -- and even if it could, that isnt the data being used
  the author hasn't understood the basics of what the 'world model' problem even is
  
  pas 14 days ago
  
  how does it work underneath?
  "kjsdhlisrnj" is in the context, it gets tokenized, and now when the LLM is asked to predict/generate next-token sequences somehow "kjsdhlisrnj" is there too. it learns patterns. okay sure, they ger encoded somehow, but during infernce how does this lead to application of a recalled pattern on the right token(s)?
  also, can it invent new words?
  
  empath-nirvana 14 days ago
  
  Yes. This is one of the first things people did with chatgpt:
  can you help me invent a language for a fictional setting with talking frogs. Give me a sample sentence that frogs might say in their language, along with a translation. ChatGPT Creating a language for a fictional setting can be a fun and imaginative task! For your setting with talking frogs, we'll consider the environment and characteristics that might influence their language. Here’s a brief outline of the linguistic features and a sample sentence:
  Linguistic Features of the Frog Language Phonetics: Since frogs are known for their croaks and ribbits, their language might include a lot of guttural and clicking sounds, which can be represented in writing with characters like "k", "g", "q", and clicks like "!" or "ǂ".
  Syntax: We might imagine that frogs are direct and to the point, so their sentence structure could be subject-verb-object (SVO), similar to English, for straightforward communication.
  Vocabulary: Words would likely relate to their daily life—water, flies, lily pads, weather changes, predators, etc.
  Grammar: Simplified grammar with minimal use of tenses and conjugations to reflect an emphasis on the present moment, which is typical for many animalistic languages in fiction.
  Sample Language: Froglan Froglii (hello) Lok (water) Krik (eat) Zraak (fly - the insect) !quu (danger) ǂogg (jump) Sample Sentence in Froglan "Froglii! Lok krik zraak ǂogg."
  Translation "Hello! Jump eat fly water."
  This sentence, structured in a somewhat English syntax for ease of understanding, would literally mean that a frog is greeting another and then mentioning that it will jump towards water to eat flies. The sentence structure and vocabulary can be further refined and expanded based on how deeply you want to dive into the language creation!
- albertzeyer 15 days ago
  
  You are more speaking about n-gram models here. NNs do far more than that.
  Or if you just want to say that NNs are used as a statistical model here: Well, yea, but that doesn't really tell you anything. Everything can be a statistical model.
  E.g., you could also say "this is exactly the way the human brain works", but it doesn't really tell you anything how it really works.
  
  cornholio 15 days ago
  
  > "this is exactly the way the human brain works"
  I'm always puzzled by such assertions. A cursory look at the technical aspects of an iterated attention - perceptron transformation clearly shows it's just a convoluted and powerful way to query the training data, a "fancy" Markov chain. The only rationality it can exhibit is that which is already embedded in the dataset. If trained on nonsensical data it would generate nonsense and if trained with a partially non-sensical dataset it will generate an average between truth and nonsense that maximizes some abstract algorithmic goal.
  There is no knowledge generation going on, no rational examination of the dataset through the lens of an internal model of reality that allows the rejection of invalid premises. The intellectual food already chewed and digested in the form of the training weights, with the model just mechanically extracting the nutrients, as opposed to venturing in the outside world to hunt.
  So if it works "just like the human brain", it does so in a very remote sense, just like a basic neural net works "just like the human brain", i.e individual biological neurons can be said to be somewhat similar.
  
  pas 14 days ago
  
  If a human spends the first 30 years of their life in a cult they will be also speaking nonsense a lot - from our point of view.
  Sure, we have a nice inner loop, we do some pruning, picking and choosing, updating, weighting things based on emotions, goals, etc.
  Who knows how complicated those things will prove to model/implement...
  
  mjburgess 15 days ago
  
  My description is true of any statistical learning algorithm.
  The thing that people are looking to for answers, the NN itself, does not have them. That's like looking to Newton's compass to understand his general law of gravitation.
  The reason that LLMs trained on the internet and every ebook has the structure of human communication is because the dataset has that structure. Why does the data have that structure? this requires science, there is no explanation "in the compass".
  NNs are statistical models trained on data -- drawing analogies to animals is a mystification that causes people's ability to think clearly he to jump out the window. No one compares stock price models to the human brain; no banking regulator says, "well your volatility estimates were off because your machines had the wrong thoughts". This is pseudoscience.
  Animals are not statistical learning algorithms, so the reason that's uninformative is because it's false. Animals are in direct causal contact with the world and uncover its structure through interventional action and counterfactual reasoning. The structure of animal bodies, and the general learning strategies are well-known, and having nothing to do with LLMs/NNs.
  The reason that I know "The cup is in my hand" is not because P("The cup is in my hand"|HistoricalTexts) > P(not "The cup is in my hand"|HistoricalTexts)
  
  vineyardmike 15 days ago
  
  > The reason that I know "The cup is in my hand" is not because P("The cup is in my hand"|HistoricalTexts) > P(not "The cup is in my hand"|HistoricalTexts)
  I mostly agree with your points, but I still disagree with this premise. Humans (and other animals) absolutely are statistical reasoning machines. They're just advanced ones which can process more than "text" - they're multi-modal.
  As a super dumb-simple set of examples: Think about the origin of the phrase "Cargo Cult" and similar religious activities - people will absolutely draw conclusions about the world based on their learned observations. Intellectual "reasoning" (science!) really just relies on more probabilities or correlations.
  The reason you know the cup is in your hand is because P("I see a cup and a hand"|HistoryOfEyesight) + P("I feel a cylinder shape"|HistoryOfTactileFeeling) + .... > P(Inverse). You can pretend it's because humans are intelligent beings with deep reasoning skills (not trying to challenge your smarts here!), but humans learn through trial and error just like a NN with reinforcement learning.
  Close your eyes and ask a person to randomly place either a cup from your kitchen in your hand or a different object. You can probably tell which one is it is. Why? Because you have learned what it feels like, and learned countless examples of cups that are different, from years of passive practice. Thats basically deep learning.
  
  mjburgess 15 days ago
  
  I mean something specific by "statistics": modelling frequency associations in static ensembles of data.
  Having a body which changes over time that interacts with a world that changes over time makes animal learning not statistical (call it, say, experimental). That animals fall into skinner-box irrational behaviour can be modelled as a kind of statistical learning, but it actually isnt.
  It's a failure of ecological salience mechanisms in regulating the "experimental learning" that animals engage in. Eg., with the cargo cults the reason they adopted that view was because their society had a "big man" value system based on material acquisition and western waring powers seemed Very Big and so were humiliating. In order to retain their status they adopted (apparently irrational) theories of how the world worked (gods, etc).
  From the outside this process might seem statistical, but it's the opposite. Their value system made material wealth have a different causal salience which was useful in their original ecology (a small island with small resources), but it went haywire when faced with the whole world.
  Eventually these mechanisms update with this new information, or the tribe dies off -- but what's going wrong here is that the very very non-statistical learning ends up describable that way.
  This is indeed, why we should be very concerned about people skinner-boxing themsleves with LLMs
  
  data_maan 14 days ago
  
  > Having a body which changes over time that interacts with a world that changes over time makes animal learning not statistical (call it, say, experimental). That animals fall into skinner-box irrational behaviour can be modelled as a kind of statistical learning, but it actually isnt.
  RL is doing just this, simulating an environment. And we can have an agent "learn" in that environment.
  I think tying learning to a body is too restrictive. The
  You strongly rely on the assumption that "something else" generates the statistics we observe, but scientifically, there exists little evidence whether that "something else" exists (see eg the Bayesian brain).
  
  mjburgess 14 days ago
  
  You need some way of inducing distributions in reality, ie., making the ice cube.
  If you're just subject to time-varying, random, stochastic, perceptual distributions you have no way of estimating the properties of the data generating process (reality).
  You need to be the one in control of the distribution in order to study it: this is the lesson of the whole history of science as an experimental discipline.
  
  drdeca 12 days ago
  
  What about astronomy and cosmology?
  
  vineyardmike 15 days ago
  
  > Having a body which changes over time that interacts with a world that changes over time makes animal learning not statistical (call it, say, experimental).
  The "experiment" of life is what defines the statical values! Experimentation is just learning what the statistical output of something is.
  If I hand you a few dice, you'd probably be able to guess the statistical probability of every number for given roll. Because you've learned that through years of observation building a mental model. If I hand you a weighted die, suddenly your mental model is gone, and you can re-learn experimentally by rolling it a bunch. How can you explain experimental learning except "statistically"?
  > they adopted (apparently irrational) theories of how the world worked (gods, etc)
  They can be wrong without being irrational. Building an airport doesn't make planes show up, but planes won't show up without an airport. If you're an island nation with little understanding of the global geopolitical environment of WWII, you'd have no idea why planes started showing up on your island, but they keep showing up, and only at an airport. It seems rational to assume they'd continue showing up to airports.
  > that animals fall into skinner-box irrational behaviour can be modelled as a kind of statistical learning, but it actually isnt
  What is it if not statistical?
  Also skinner boxes are, in a way, perfectly rational. There's no way to understand the environment, and if pushing a button feeds you, then rationally you should push the button when hungry. Humans like to think we're smart because we've invented deductive reasoning, and we quote "correlation is not causation" that we're not just earning to predict the world around us from past experiences.
  
  mjburgess 15 days ago
  
  For dice the ensemble average is the time-average: if you roll the dice 1000 times the probability of getting a different result doesn't change.
  For almost everything in the world, action on it, changes it. There are vanishingly few areas where this isn't the case (most physics, most chemistry, etc.).
  Imagine trying to do statistics but every time you sampled from reality the distribution of your sample changes not due to randomness, but because reality has changed. Now, can you do statistics? No.
  It makes all the difference in the world to have a body and hold the thing you're studying. Statistics is trying to guess the shape of the ice cube from the puddle; animal learning is making ice cubes.
  
  Demlolomot 15 days ago
  
  If learning in real life over 5-20 years shows the same result as a LLM being trained by billions of tokens, than yes it can be compared.
  And there are a lot of people out there who do not a lot of reasoning.
  After all optical illusions exist, our brain generalizes.
  The same thing happens with words like the riddle about the doctor operating on a child were we discover that the doctor is actually a female.
  And while llms only use text, we can already see how multimodal models become better, architecture gets better and hardware too.
  
  mjburgess 15 days ago
  
  I don't know what your motivation in comparison is; mine is science, ie., explanation.
  I'm not interested that your best friend emits the same words in the same order as an LLM; i'm more interested that he does so because he enjoys you company whereas the LLM does not.
  Engineer's overstep their mission when they assume that because you can substitute one thing for another, and sell a product in doing so, that this is informative. It isnt. I'm not interested in whether you can replace the sky for a skybox and have no one notice -- who cares? What might fool an ape is everything, and what that matters for science is nothing.
  
  Demlolomot 15 days ago
  
  My thinking is highly influenced by brain research.
  We don't just talk about a LLM we talk about a neuronal network architecture.
  There is a direct link to us (neural networks)
- michaelt 15 days ago
  
  That's not really an explanation that tells people all that much, though.
  I can explain that car engines 'just' convert gasoline into forward motion. But if a the person hearing the explanation is hoping to learn what a cam belt or a gearbox is, or why cars are more reliable now than they were in the 1970s, or what premium gas is for, or whether helicopter engines work on the same principle - they're going to need a more detailed explanation.
  
  mjburgess 15 days ago
  
  It explains the LLM/NN. If you want to explain why it emits words in a certain order you need to explain how reality generated the dataset, ie., you need to explain how people communicate (and so on).
  There is no mystery why an NN trained on the night sky would generate nightsky-like photos; the mystery is why those photos have those patterns... solving that is called astrophysics.
  Why do people, in reasoning through physics problems, write symbols in a certain order? Well, explain physics, reasoning, mathematical notation, and so on. The ordering of the symbols gives rise to a certain utility of immitating that order -- but it isnt explained by that order. That's circular: "LLMs generate text in the order they do, because that's the order of the text they were given"
  
  michaelt 15 days ago
  
  That leaves loads of stuff unexplained.
  If the LLM is capable of rewording the MIT license into a set of hard-hitting rap battle lyrics, but the training dataset didn't contain any examples of anyone doing that, is the LLM therefore capable of producing output beyond the limits of its training data set?
  Is an LLM inherently constrained to mediocrity? If an LLM were writing a novel, does its design force it to produce cliche characters and predictable plotlines? If applied in science, are they inherently incapable of advancing the boundaries of human knowledge?
  Why transformers instead of, say, LSTMs?
  Must attention be multi-headed? Why can't the model have a simpler architecture, allowing such implementation details to emerge from the training data?
  Must they be so big that leading performance is only in the hands of multi-billion-dollar corporations?
  What's going on with language handling? Are facts learned in an abstract enough way that they can cross language barriers? Should a model produce different statements of fact when questioned in different languages? Does France need a French-language LLM?
  Is it reasonable to expect models to perform basic arithmetic accurately? What about summarising long documents?
  Why is it that I can ask questions with misspellings, but get answers with largely correct spelling? If misspellings were in the training data, why aren't they in the output? Does the cleverness that stops LLMs from learning misspellings from the training data also stop them from learning other common mistakes?
  If LLMs can be trained to be polite despite having examples of impoliteness in their training data, can they also be trained to not be racist, despite having examples of racism in their training data?
  Can a model learn a fact that is very rarely present in the training data - like an interesting result in an obscure academic paper? Or must a fact be widely known and oft-repeated in order to be learned?
  Merely saying "it predicts the next word" doesn't really explain much at all.
  
  mjburgess 15 days ago
  
  Which conditional probability sequences can be exploited for engineering utility cannot be known ahead of time; nor is it explained by the NN. It's explained by investigating how the data was created by people.
  Train a NN to generate pictures of the nightsky: which can be used for navigation? Who knows, ahead of time. The only way of knowing is to have an explanation of how the solar system works and then check the pictures are accurate enough.
  The NN which generates photos of the nightsky has nothing in it that explains the solar system, nor does any aspect of an NN model the solar system. The photos it was trained on happened to have their pixels arranged in that order.
  Why those arrangements occur is explained by astrophysics.
  If you want to understand what ChatGPT can do, you need to ask OpenAI for their training data and then perform scientific investigations of its structure and how that structure came to be.
  Talking in terms of the NN model is propaganda and pseudoscience: the NN didnt arrange the pixels, gravity did. Likewise, the NN isnt arranging rap lyrics in that order because it's rapping: singers are.
  There is no actual mystery here. It's just we are prevented form access to the data by OpenAI, and struggle to explain reality which generated that data -- which requires years of actual science.
  
  pas 14 days ago
  
  It has a lot of things already encoded regarding the solar system, but it cannot really access it, it cannot - as far as I know - run functions on its own internal encoded data, right? If it does something like that, it's because it learned that higher-level pattern based on training data.
  The problem with NN arrangements in general is that we don't know if it's actually pulling out some exact training data (or a useful so-far-unseen pattern from the data!) or it's some distorted confabulation. (Clever Hans all again. If I ask ChatGPT to code me a nodeJS IMAP backup program it does, but the package it gleeful imports/require()s is made up.
  And while the typical artsy arts have loose rules, where making up new shit based on what people wish for is basically the only one, in other contexts that's a hard no-no.
- forrestthewoods 15 days ago
  
  I find this take super weak sauce and shallow.
  This recent $10,000 challenge is super super interesting imho. https://twitter.com/VictorTaelin/status/1778100581837480178
  State of the art models are doing more than “just” predicting the probability of the next symbol.
  
  mjburgess 15 days ago
  
  You underestimate the properties of the sequential-conditional structure of human communication.
  Consider how a clever 6yo could fake being a physicist with access to a library of physics textbooks and a shredder. All the work is done for them. You'd need to be a physicist to spot them faking it.
  Of course, LLMs are in a much better position than having shredded physics textbooks -- they have shreddings of all books. So you actually have to try to expose this process, rather than just gullibly prompt using confirmation bias. It's trivial to show they work this way, both formally and practically.
  The issue is, practically, gullible people aren't trying.
  
  astrange 14 days ago
  
  You can program algorithms into transformer networks, up to the limit of how many computations you get.
  https://srush.github.io/raspy/
  Are you going to do computer reductionism too and say computers can't do arithmetic, they just run electricity through silicon?
  
  mjburgess 14 days ago
  
  Sure, electronic computers largely don't implement arithmetic.
  The value of electronic computers derives from the fact they can "differentially activate" electronic devices (graphics cards -> lcd screens, etc.). If they werent electronic, they'd be basically useless -- since electricity is essentially the only "power transmission" force which we can reliably use.
  I do think much of the metaphorical language we use around these devices completely mystifies them. These abstractions we use fail all the time, and reveal themselves as deceptions. We ought, often, get back to reality.
  The reality of NNs, implemented on electronic devices, is that they can accept digitally encoded electronic signals and output likewise, where the variations in output signals are models of conditional probability structures
  This, as a an analogy for learning, for animals, for people, etc. is madness; provably so. It's a convenient pseudoscience, for an era obsessed with the power of electricity to power devices of automation, rather than steam. Were we in the steam age, The Brain would be hydaulic. The gullible in the intelligensia are always obsessed by the baubles and trinkets of elite appeal.
  
  forrestthewoods 14 days ago
  
  I don’t find your model either convincing or useful.
- sirsinsalot 15 days ago
  
  It isn't some kind of Markov chain situation. Attention cross-links the abstract meaning of words, subtle implications based on context and so on.
  So, "mat" follows "the cat sat on the" where we understand the entire worldview of the dataset used for training; not just the next-word probability based on one or more previous words ... it's based on all previous meaning probability, and those meaning probablility and so on.
- seydor 15 days ago
  
  People specifically would like to know what the attention calculations add to this learning of the distribution
  
  ffwd 15 days ago
  
  Just speculating but I think attention enables differentiation of semantic concepts for a word or sentence within a particular context. Like for any total set of training data you have a lesser number of semantic concepts (like let's say you have 10000 words, then it might contain 2000 semantic concepts, and those concepts are defined by the sentence structure and surrounding words, which is why they have a particular meaning), and then attention allows to differentiate those different contexts at different levels (words/etc). Also the fact you can do this attention at runtime/inference means you can generate the context from the prompt, which enables the flexibility of variable prompt/variable output but you lose the precision of giving an exact prompt and getting an exact answer
  
  ffwd 15 days ago
  
  I'm not one to whine about downvotes but I just have to say, it's a bad feeling when I can't even respond to the negative feedback because there is no accompanying comment. Did I misinterpret something? Did you? Who will ever know when there is no information. :L
- astrange 14 days ago
  
  LLMs don't work on words, they work on sequences of subword tokens. "It doesn't actually do anything" is a common explanation that's clearly a form of cope, because you can't even explain why it can form complete words, let alone complete sentences.
- fspeech 14 days ago
  
  There are an infinite number of distributions that can fit the training data well (e.g., one that completely memorize the data and therefore replicate the frequencies). The trick is to find the distributions that generalize well, and here the NN architecture is critical.
- fellendrone 15 days ago
  
  > Why does, 'mat' follow from 'the cat sat on the ...'
  You're confidently incorrect by oversimplifying all LLMs to a base model performing a completion from a trivial context of 5 words.
  This is tantamount to a straw man. Not only do few people use untuned base models, it completely ignores in-context learning that allows the model to build complex semantic structures from the relationships learnt from its training data.
  Unlike base models, instruct and chat fine-tuning teaches models to 'reason' (or rather, perform semantic calculations in abstract latent spaces) with their "conditional probability structure", as you call it, to varying extents. The model must learn to use its 'facts', understand semantics, and perform abstractions in order to follow arbitrary instructions.
  You're also confabulating the training metric of "predicting tokens" with the mechanisms required to satisfy this metric for complex instructions. It's like saying "animals are just performing survival of the fittest". While technically correct, complex behaviours evolve to satisfy this 'survival' metric.
  You could argue they're "just stitching together phrases", but then you would be varying degrees of wrong:
  For one, this assumes phrases are compressed into semantically addressable units, which is already a form of abstraction ripe for allowing reasoning beyond 'stochastic parroting'.
  For two, it's well known that the first layers perform basic structural analysis such as grammar, and later layers perform increasing levels of abstract processing.
  For three, it shows a lack of understanding in how transformers perform semantic computation in-context from the relationships learnt by the feed-forward layers. If you're genuinely interested in understanding the computation model of transformers and how attention can perform semantic computation, take a look here: https://srush.github.io/raspy/
  For a practical example of 'understanding' (to use the term loosely), give an instruct/chat tuned model the text of an article and ask it something like "What questions should this article answer, but doesn't?" This requires not just extracting phrases from a source, but understanding the context of the article on several levels, then reasoning about what the context is not asserting. Even comparatively simple 4x7B MoE models are able to do this effectively.
- raindear 14 days ago
  
  But why do transformers perform better than older language models including other neural language models.
- nextaccountic 15 days ago
  
  > Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.
  What about cases that are not present in the dataset?
  The model must be doing something besides storing raw probabilities to avoid overfitting and enable generalization (imagine that you could have a very performant model - when it works - but it sometimes would spew "Invalid input, this was not in the dataset so I don't have a conditional probability and I will bail out")
blt 14 days ago

As a computer scientist, the "differentiable hash table" interpretation worked for me. The AIAYN paper alludes to it by using the query/key/value names, but doesn't explicitly say the words "hash table". I guess some other paper introduced them?
nerdponx 15 days ago

> TBF there is no good explanation why it works
My mental justification for attention has always been that the output of the transformer is a sequence of new token vectors such that each individual output token vector incorporates contextual information from the surrounding input token vectors. I know it's incomplete, but it's better than nothing at all.
- eurekin 15 days ago
  
  > TBF there is no good explanation why it works
  I thought the general consesus was: "transformers allow neural networks to have adaptive weights".
  As opposed to the previous architectures, were every edge connecting two neurons always has the same weight.
  EDIT: a good video, where it's actually explained better: https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay
- rcarmo 15 days ago
  
  You're effectively steering the predictions based on adjacent vectors (and precursors from the prompt). That mental model works fine.

rayval 15 days ago

Here's a compelling visualization of the functioning of an LLM when processing a simple request: https://bbycroft.net/llm

This complements the detailed description provided by 3blue1brown

bugthe0ry 15 days ago

When visualised this way, the scale of GPT-3 is insane. I can't imagine what 4 would like here.
- spi 15 days ago
  
  IIRC, GPT-4 would actually be a bit _smaller_ to visualize than GPT3. Details are not public, but from the leaks GPT-4 (at least, some by-now old version of it) was a mixture of expert, with every model having around 110B parameters [1]. So, while the total number of parameters is bigger than GPT-3 (1800B vs. 175B), it is "just" 16 copies of a smaller (110B) parameters model. So if you wanted to visualize it in any meaningful way, the plot wouldn't grow bigger - or it would, if you included all different experts, but they are just copies of the same architecture with different parameters, which is not all that useful for visualization purposes.
  [1] https://medium.com/@daniellefranca96/gpt4-all-details-leaked...
  
  joaogui1 14 days ago
  
  Mixture of Experts is not just 16 copies of a network, it's a single network where for the feed forward layers the tokens are routed to different experts, but the attention layers are still shared. Also there are interesting choices around how the routing works and I believe the exact details of what OpenAI is doing are not public. In fact I believe someone making a visualization of that would dispell a ton of myths around what are MoEs and how they work
lying4fun 14 days ago

amazing visualisation

tylerneylon 15 days ago

Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory.

One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:

https://learnandburn.ai/p/how-to-build-a-10m-token-context

(I edited that article.)

danielhanchen 15 days ago

Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.
In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.
On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.
rahimnathwani 15 days ago

He lists Ring Attention and half a dozen other techniques, but they're not within the scope of this video: https://youtu.be/eMlx5fFNoYc?t=784

promiseofbeans 15 days ago

His previous post 'But what is a GPT?' is also really good: https://www.3blue1brown.com/lessons/gpt

YossarianFrPrez 15 days ago

This video (with a slightly different title on YouTube) helped me realize that the attention mechanism isn't exactly a specific function so much as it is a meta-function. If I understand it correctly, Attention + learned weights effectively enables a Transformer to learn a semi-arbitrary function, one which involves a matching mechanism (i.e., the scaled dot-product.)

hackinthebochs 15 days ago

Indeed. The power of attention is that it searches the space of functions and surfaces the best function given the constraints. This is why I think linear attention will never come close to the ability of standard attention, the quadratic term is a necessary feature of searching over all pairs of inputs and outputs.

abotsis 15 days ago

I think what made this so digestible for me were the animations. The timing, how they expand/contract and unfold while he’s speaking.. is all very well done.

_delirium 15 days ago

That is definitely one of the things he does better than most. He actually wrote his own custom animation library for math animations: https://github.com/3b1b/manim
- divan 15 days ago
  
  Also check out community edition: https://www.manim.community

nostrebored 15 days ago

Working in a closely related space and this instantly became part of my team's onboarding docs.

Worth noting that a lot of the visualization code is available in Github.

https://github.com/3b1b/videos/tree/master/_2024/transformer...

sthatipamala 15 days ago

Sounds interesting; what else is part of those onboarding docs?

bilsbie 15 days ago

I finally understand this! Why did every other video make it so confusing!

chrishare 15 days ago

It is confusing, 3b1b is just that good.
- visarga 15 days ago
  
  At the same time it feels extremely simple
  attention(Q,K,V) = softmax (Q K^T √ dK ) @ V
  is just half a row; the multi-head, masking and positional stuff just toppings
  we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math
  
  diedyesterday 13 days ago
  
  Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.
  Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).
  
  bilsbie 15 days ago
  
  For me I never had too much trouble understanding the algorithm. But this is the first time I can see why it works.
ur-whale 15 days ago

> Why did every other video make it so confusing!
In my experience, with very few notable exceptions (e.g. Feynmann), researchers are the worst when it comes to clearly explaining to others what they're doing.
I'm at the point where I'm starting believe that pedagogy and research generally are mutually exclusive skills.
- namaria 15 days ago
  
  It's extraordinarily difficult to imagine how it feels not to understand something. Great educators can bridge that gap. I don't think it's correlated with research ability in any way. It's just a very rare skill set, to be able to empathize with people who don't understand what you do.
thomasahle 15 days ago

I'm someone who would love to get better at making educational videos/content. 3b1b is obviously the gold standard here.
I'm curious what things other videos did worse compared to 3b1b?
- bilsbie 15 days ago
  
  I think he had a good, intuitive understanding that he wanted to communicate and he made it come through.
  I like how he was able to avoid going into the weeds and stay focused on leading you to understanding. I remember another video where I got really hung up on positional encoding and I felt like I could t continue until I understood that. Or other videos that overfocus on matrix operations or softmax, etc.
thinkingtoilet 15 days ago

Grant has a gift of explaining complicated things very clearly. There's a good reason his channel is so popular.
Al-Khwarizmi 15 days ago

Not sure if you mean it as rhetorical question but I think it's an interesting question. I think there are at least three factors why most people are confused about Transformers:
1. The standard terminology is "meh" at most. The word "attention" itself is just barely intuitive, "self-attention" is worse, and don't get me started about "key" and "value".
2. The key papers (Attention is All You Need, the BERT paper, etc.) are badly written. This is probably an unpopular opinion. But note that I'm not diminishing their merits. It's perfectly compatible to write a hugely impactful, transformative paper describing an amazing breakthrough, but just don't explain it very well. And that's exactly what happened, IMO.
3. The way in which these architectures were discovered was largely by throwing things at the wall and seeing what sticked. There is no reflection process that ended on a prediction that such an architecture would work well, which was then empirically verified. It's empirical all the way through. This means that we don't have a full understanding of why it works so well, all explanations are post hoc rationalizations (in fact, lately there is some work implying that other architectures may work equally well if tweaked enough). It's hard to explain something that you don't even fully understand.
Everyone who is trying to explain transformers has to overcome these three disadvantages... so most explanations are confusing.
- cmplxconjugate 15 days ago
  
  >This is probably an unpopular opinion.
  I wouldn't say so. Historically it's quite common. Maxwell's EM papers used such convoluted notation it it quite difficult to read. It wasn't until they were reformulated in vector calculus that they became infinitely more digestible.
  I think though your third point is the most important; right now people are focused on results.
- maleldil 15 days ago
  
  > This is probably an unpopular opinion
  There's a reason The Illustrated Transformer[1] was/is so popular: it made the original paper much more digestible.
  [1] https://jalammar.github.io/illustrated-transformer/
Solvency 15 days ago

Because:
1. good communication requires an intelligence that most people sadly lack
2. because the type of people who are smart enough to invent transformers have zero incentive to make them easily understandable.
most documents are written by authors subconsciously desperate to mentally flex on their peers.
- penguin_booze 15 days ago
  
  Pedagogy requires empathy, to know what it's like to not know something. They'll often draw on experiences the listener is already familiar with, and then bridge the gap. This skill is orthogonal to the mastery of the subject itself, which I think is the reason most descriptions sound confusing, inadequate, and/or incomprehensible.
  Often, the disseminating medium is a one-sided, like a video or a blog post, which doesn't help, either. A conversational interaction would help the expert sense why someone outside the domain find the subject confusing ("ah, I see what you mean"...), discuss common pitfalls ("you might think it's like this... but no, it's more like this...") etc.
- WithinReason 15 days ago
  
  2. It's not malice. The longer you have understood something the harder it is to explain it, since you already forgot what it was like to not understand it.

shahbazac 15 days ago

Is there a reference which describes how the current architecture evolved? Perhaps from very simple core idea to the famous “all you need paper?”

Otherwise it feels like lots of machinery created out of nowhere. Lots of calculations and very little intuition.

Jeremy Howard made a comment on Twitter that he had seen various versions of this idea come up again and again - implying that this was a natural idea. I would love to see examples of where else this has come up so I can build an intuitive understanding.

HarHarVeryFunny 14 days ago

Roughly:
1) The initial seq-2-seq approach was using LSTMs - one to encode the input sequence, and one to decode the output sequence. It's amazing that this worked at all - encode a variable length sentence into a fixed size vector, then decode it back into another sequence, usually of different length (e.g. translate from one language to another).
2) There are two weaknesses of this RNN/LSTM approach - the fixed size representation, and the corresponding lack of ability to determine which parts of the input sequence to use when generating specific parts of the output sequence. These deficiencies were addressed by Bahdanau et al in an architecture that combined encoder-decoder RNNs with an attention mechanism ("Bahdanau attention") that looked at each past state of the RNN, not just the final one.
3) RNNs are inefficient to train, so Jakob Uszkoreit was motivated to come up with an approach that better utilized available massively parallel hardware, and noted that language is as much hierarchical as sequential, suggesting a layered architecture where at each layer the tokens of the sub-sequence would be processed in parallel, while retaining a Bahdanau-type attention mechanism where these tokens would attend to each other ("self-attention") to predict the next layer of the hierarchy. Apparently in initial implementation the idea worked, but not better than other contemporary approaches (incl. convolution), but then another team member, Noam Shazeer, took the idea and developed it, coming up with an architecture (which I've never seen described) that worked much better, which was then experimentally ablated to remove unnecessary components, resulting in the original transformer. I'm not sure who came up with the specific key-based form of attention in this final architecture.
4) The original transformer, as described in the "attention is all you need paper", still had a separate encoder and decoder, copying earlier RNN based approaches, and this was used in some early models such as Google's BERT, but this is unnecessary for language models, and OpenAI's GPT just used the decoder component, which is what everyone uses today. With this decoder-only transformer architecture the input sentence is input into the bottom layer of the transformer, and transformed one step at a time as it passes through each subsequent layer, before emerging at the top. The input sequence has an end-of-sequence token appended to it, which is what gets transformed into the next-token (last token) of the output sequence.
- krat0sprakhar 14 days ago
  
  Thank you for this summary! Very well explained. Any tips on what resources you use to keep updated on this field?
  
  HarHarVeryFunny 14 days ago
  
  Thanks. Mostly just Twitter, following all the companies & researchers for any new announcements, then reading any interesting papers mentioned/linked. I also subscribe to YouTube channels like Dwarkesh Patel (interviewer) and Yannic Kilcher (AI News), and search out YouTube interviews with the principles. Of course I also read any AI news here on HN, and sometimes there may be interesting information in the comments.
  There's a summary of social media AI news here, that sometimes surfaces something interesting.
  https://buttondown.email/ainews/archive/
ollin 15 days ago

karpathy gave a good high-level history of the transformer architecture in this Stanford lecture https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618

namelosw 15 days ago

You might also want to check out other 3b1b videos on neural networks since there are sort of progressions between each video https://www.3blue1brown.com/topics/neural-networks

jiggawatts 15 days ago

It always blows my mind that Grant Sanderson can explain complex topics in such a clear, understandable way.

I've seen several tutorials, visualisations, and blogs explaining Transformers, but I didn't fully understand them until this video.

chrishare 15 days ago

His content and impact is phenomenal

mastazi 15 days ago

That example with the "was" token at the end of a murder novel is genius (at 3:58 - 4:28 in the video) really easy for a non technical person to understand.

hamburga 15 days ago

I think Ilya gets credit for that example — I’ve heard him use it in his interview with Jensen Huang.

justanotherjoe 15 days ago

It seems he brushes over the positional encoding, which for me was the most puzzling part of transformers. The way I understood it, positional encoding is much like dates. Just like dates, there are repeating minutes, hours, days, months...etc. Each of these values has shorter 'wavelength' than the next. The values are then used to identify the position of each tokens. Like, 'oh, im seeing january 5th tokens. I'm january 4th. This means this is after me'. Of course the real pos.encoding is much smoother and doesn't have abrupt end like dates/times, but i think this was the original motivation for positional encodings.

nerdponx 15 days ago

That's one way to think about it.
It's clever way to encode "position in sequence" as some kind of smooth signal that can be added to each input vector. You might appreciate this detailed explanation: https://towardsdatascience.com/master-positional-encoding-pa...
Incidentally, you can encode dates (e.g. day of week) in a model as sin(day of week) and cos(day of week) to ensure that "day 7" is mathematically adjacent to "day 1".

bjornsing 15 days ago

This was the best explanation I’ve seen. I think it comes down to essentially two aspects: 1) he doesn’t try to hide complexity and 2) he explains what he thinks is the purpose of each computation. This really reduces the room for ambiguity that ruins so many other attempts to explain transformers.

stillsut 14 days ago

In training we learn a.) the embeddings and b.) the KQ/MLP-weights.

How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?

Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?

rollinDyno 15 days ago

Hold on, every predicted token is only a function of the previous token? I must have something wrong. This would mean that within the embedding of "was", which is of length 12,228 in this example. Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?

jgehring 15 days ago

That's what happens in the very last layer. But at that point the embedding for "was" got enriched multiple times, i.e., in each attention pass, with information from the whole context (which is the whole novel here). So for the example, it would contain the information to predict, let's say, the first token of the first name of the murderer.
Expanding on that, you could imagine that the intent of the sentence to complete (figuring out the murderer) would have to be captured in the first attention passes so that other layers would then be able to integrate more and more context in order to extract that information from the whole context. Also, it means that the forward passes for previous tokens need to have extracted enough salient high-level information already since you don't re-compute all attention passes for all tokens for each next token to predict.
- causal 14 days ago
  
  > you don't re-compute all attention passes for all tokens for each next token to predict.
  You don't? I imagine the attention maps could be pretty different between n and n+1 tokens.
  Edit: Or maybe you just meant you don't compute attention Σ(n) times for each new token?
diedyesterday 13 days ago

> "Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?"
Not with this GPT. The context size would not allow keeping attention to the total meaning of more than 2048 tokens (as reflected in the transformed embedding of that context's last token). For a substantial part of a novel, it would require a much larger context size with then presumably will need a higher dimensional embedding/semantic space.
causal 14 days ago

I read this comment yesterday and keep thinking about it. That final token really must "comprehend" everything leading up to it, right? In which case longer context lengths are just trying to pack more meaning into that embedding state.
Which means the embedding model must do a lot of the lifting to be able to accurately represent meaning across long contexts so well. Now I want to know more about how those models are derived.
vanjajaja1 15 days ago

at that point what it has is not a representation of the input, its a representation of what the next output could be. ie. its a lossy process and you can't extract what came in the past, only the details relevant to next word prediction
(is my understanding)
- rollinDyno 15 days ago
  
  If the point was the presentation of only the next token, and predicted tokens were a function of only the preceding token, then the vector of the new token wouldn’t have the information to produce new tokens that kept telling the novel.
faramarz 15 days ago

it's not about a single point encapsulating a novel, but how sequences of such embeddings can represent complex ideas when processed by the model's layers.
each prediction is based on a weighted context of all previous tokens, not just the immediately preceding one.
- rollinDyno 15 days ago
  
  That weighted context is the 12228 dimensional vector, no?
  I suppose that when you each element in the vector weighs 16 bits then the space is immense and capable to have a novel in a point.
  
  faramarz 12 days ago
  
  GPT-4 is configurable up to 96 layers, each running their own embeddings. I think it was a business choice to afford the compute while they scale.
  
  causal 14 days ago
  
  But if I understand correctly, GPT-4 reduces that to a 1536-dimensional vector. Roughly 1/8th. It's counterintuitive to me.
evolvingstuff 14 days ago

You are correct, that is an error in an otherwise great video. The k+1 token is not merely a function of the kth vector, but rather all prior vectors (combined using attention). There is nothing "special" about the kth vector.

kordlessagain 15 days ago

What I'm now wondering about is how intuition to connect completely separate ideas works in humans. I will have very strong intuition something is true, but very little way to show it directly. Of course my feedback on that may be biased, but it does seem some people have "better" intuition than others.

thomasahle 15 days ago

I like the way he uses a low-rank decomposition of the Value matrix instead of Value+Output matrices. Much more intuitive!

imjonse 15 days ago

It is the first time I hear about the Value matrix being low rank, so for me this was the confusing part. Codebases I have seen also have value + output matrixes so it is clearer that Q,K,V are similar sizes and there's a separate projection matrix that adapts to the dimensions of the next network layer. UPDATE: He mentions this in the last sections of the video.

cs702 14 days ago

Fantastic work by Grant Sanderson, as usual.

Attention has won.[a]

It deserves to be more widely understood.

---

[a] Nothing has outperformed attention so far, not even Mamba: https://arxiv.org/abs/2402.01032

mehulashah 15 days ago

This is one of the best explanations that I’ve seen on the topic. I wish there was more work, however, not on how Transfomers work, but why they work. We are still figuring it out, but I feel that the exploration is not at all systematic.

spacecadet 15 days ago

Fun video. Much of my "art" lately has been dissecting models, injecting or altering attention, and creating animated visualizations of their inner workings. Some really fun shit.

j_bum 15 days ago

Link? Sounds fun and reminds me of this tweet [0]
[0] https://x.com/jaschasd/status/1756930242965606582
- spacecadet 15 days ago
  
  Nah someone down voted it. And yes, it looks like that + 20 others that are animated.
  
  CamperBob2 15 days ago
  
  Downvotes == empty boats. If "Empty Boat parable" doesn't ring a bell, Google it...
  
  globalnode 15 days ago
  
  unless an algorithm decides to block or devalue the content, but yeah i looked it up, very interesting parable, thanks for sharing.
  
  spacecadet 15 days ago
  
  anger is a gift

kjhenner 14 days ago

The first time I really dug into transformers (back in the BERT days) I was working on a MS thesis involving link prediction in a graph of citations among academic documents. So I had graphs on the brain.

I have a spatial intuition for transformers as a sort of analog to a message passing network over a "leaky graph" in an embedding space. If each token is a node, its key vector sets the position of an outlet pipe that it spews value to diffuse out into the embedding space, while the query vector sets the position of an input pipe that sucks up value other tokens have pumped out into the same space. Then we repeat over multiple attention layers, meaning we have these higher order semantic flows through the space.

Seems to make a lot of sense to me, but I don't think I've seen this analogy anywhere else. I'm curious if anybody else thinks of transformers in this way. (Or wants to explain how wrong/insane I am?)

jacksonhacker 15 days ago

[dead]