The question that no LLM can answer and why it is important

94 points by 13years 13 days ago

simonw 13 days ago

I find it hard to get too excited by tests like "Which episode of Gilligan’s Island was about mind reading?" because they reflect a desire for a world in which the goal is to keep on growing LLMs until they can answer even questions like that one entirely from their trained model weights.

This seems like a wasteful exercise to me. Are we really going to retrain our largest models on a weekly basis to teach them about what's happened recently?

I'm much more interested in learning about the smallest, fastest model we can create that can effectively manipulate language, "reason" about things, summarize and drive tools.

I want a model that can answer any question accurately because it knows how to look up extra information from reliable sources, and evaluate that information effectively once it finds it.

nicklecompte 13 days ago

That's not the point of the Gilligan's Island test. If the LLMs had said something to the effect of "I don't know, I wasn't trained to answer trivia about television" then I would agree with your comment.
Instead they all confabulated! Either they made up an episode title or insisted the episode doesn't exist. That's a serious problem: if the LLM doesn't properly recognize that it doesn't know something, how will it properly recognize it needs to use a tool to obtain that knowledge?
It seems like the only answer we have is for a user to flag that GPT-4+Web flubbed the Gilligan test, OpenAI dispatches a data contractor for RLHF whack-a-mole + synthetic data generation, GPT learns to Google answers to Gilligan's Island prompts, then we cross our fingers and hope transformers are smart enough to transfer that knowledge to the Sanford and Son benchmark.
- krapp 12 days ago
  
  >It seems like the only answer we have is for a user to flag that GPT-4+Web flubbed the Gilligan test OpenAI dispatches a data contractor for RLHF whack-a-mole + synthetic data generation, GPT learns to Google answers to Gilligan's Island prompts, then we cross our fingers and hope transformers are smart enough to transfer that knowledge to the Sanford and Son benchmark.
  Or maybe... let me run this up the flagpole and see if anyone salutes it... maybe we accept that LLMs have fundamental architectural limitations and can't do certain things like "math" or "anything requiring an awareness of context or factual accuracy" and don't use them for everything?
  So like instead of that, which wouldn't even work because search engines have already been polluted by AI generated garbage so they would just be eating their own shit, we have search engines that actually work again, and people just look that stuff up? And LLMs get relegated to whatever niche they're actually useful in, rather than the current plan of rebuilding our entire technological society on top of them as soon as possible because money?
  
  fragmede 12 days ago
  
  I know when I'm lying. It doesn't seem totally insane to think that there's a node/tensor (not calling it a neuron) inside the model that is activated when it's confabulating that we could find and highlight and program our way into not happening.
  
  krapp 12 days ago
  
  From what I understand of the way LLMs work, they don't "know" when they're confabulating or not. All of the text they generate is arbitrary, albeit not entirely random. Whether or not any particular response is useful is a matter of human interpretation.
  The problem is the tendency to assume LLMs behave the same way humans do. You know you're lying when you're lying. LLMs don't even have a concept of a "lie." Even though you can ask one what a lie is, and it responds with an accurate answer, that's still just an arbitrary statistically based response. It doesn't actually know.
  
  sfn42 11 days ago
  
  It's crazy to me how even in here there's such a huge amount of people who just don't understand. Even the article were discussing points this out
  > The implications are that LLMs do not perform reasoning over data in the way that most people conceive or desire.
  > There is no self-reflection of its information; it does not know what it knows and what it does not. The line between hallucination and truth is simply a probability factored by the prevalence of training data and post-training processes like fine-tuning
  It's not thinking, it's not conscious, it's just a mathematical function that's too complex for us to understand.
  
  anon373839 11 days ago
  
  I think the best way to think of it is that it’s an estimation of the outputs of reasoning and knowledge. Of course this means that the models do need to model the reasoning process in some way, but it in no way implies that they model it in a way that’s rigorous or reliable.
- tiltowait 12 days ago
  
  I just got an interesting[0] response from Claude 3 Opus:
  > I don't believe there was an episode of Gilligan's Island that centered around mind reading. The show ran for 3 seasons from 1964-1967 and featured the comic adventures of 7 castaways on an uncharted island. Typical plotlines involved their efforts to get rescued and zany schemes by Gilligan that would go awry. But to my knowledge, none of the 98 episodes had a storyline focused on mind reading or telepathic abilities.
  It's probably the closest I can remember seeing an LLM get to saying "I don't know". "I don't believe there was" at least acknowledges the possibility of being incorrect and should prompt a careful user to do further research.
  [0] Also interesting is that one of the article's comments shows Opus giving the correct episode title but incorrect details. So ... mixed bag.
- blooalien 12 days ago
  
  > If the LLMs had said something to the effect of "I don't know, I wasn't trained to answer <question type> about <question topic>" ...
  This is so very much the answer I want from an LLM every single time it's not reasonably certain about an answer to a query. Not "hallucinate" a plausible sounding answer and then confidently spew lies at me as if it's gospel truth.
  I can accept "I don't know" from a human just fine, so I can damn sure accept it from a machine made by humans; but I'm far less tolerant of lies from a machine than I am from humans. Humans will lie for a great many reasons, a fair few of which are easily enough forgivable / understandable. A machine will generally "lie" for only a comparatively very few possible reasons (a physical flaw or defect in the machine; faulty data, be it purposeful or accidental; a human designed it to be untruthful, etc.), most of which are just plain largely unacceptable on multiple levels.
  Even better than "I don't know" would be my fifth grade teacher's favorite answer ("way back in ye golden olden days") in situations where he didn't know the answer: "I don't know, but let's find out together." and then the research would proceed apace. An LLM should be capable of such quite easily one would think. They make great "research assistants" when trained on a relevant data set and guided properly with a well crafted system prompt, and they're centered around / trained upon human language so should be able to guide a human to available resources with pretty near zero hassle. :)
  
  lasereyes136 12 days ago
  
  I think part of the point of the article is that LLMs don't lie because they are designed to just give you the next work based on making a credible sounding sentence or sequence of sentences. Expecting it to do more is an expectations problem based on the hype around GenAI.
  I don't think we have the correct word for what LLMs do but lie and hallucinations are not really correct.
  
  HarHarVeryFunny 9 days ago
  
  Saying "I don't know" doesn't require too much of a change. This isn't a different mode of operation where it's introspecting about its own knowledge - it's just the best continuation prediction in a context where the person/entity being questioned is not equipped to answer.
  LLMs create quite deep representations of the input on which they based their next word prediction (text continuation), and it has been proved that they already sometimes do know when something they are generating is low confidence or false, so maybe with appropriate training data they could better attend to this and predict "I don't know" or "I'm not sure".
  To improve the ability of LLMs to answer like this requires them to have a better idea of what is true or not. Humans do this by remembering where they learnt something: was it first hand experience, or from a text book or trusted friend, or from a less trustworthy source. LLMs ability to discern the truth could be boosted by giving them the sources of their training data, maybe together with a trustworthiness rating (although they may be able to learn that for themselves).
  
  Tagbert 12 days ago
  
  I think hallucination is pretty close. It represents what happens when you give an answer based on what you think you remember even if that memory is not correct.
  How many people would agree that P.T. Barnum said “There’s a sucker born every minute”? That would be a hallucination.
  The quote is from Adam Forepaugh.
  
  unaindz 11 days ago
  
  The best argument I have found against using lie or hallucination for describing LLM's actions is that it humanizes them to people who don't know the inner workings of LLMs. Saying they lie gives intent which is pretty bad but even hallucination humanizes them unnecessarily. Bullshitting seems the best word to describe it but even then intent can be assumed when there isn't any.
  
  nicklecompte 11 days ago
  
  I said "confabulate" in my original post. "Confabulation" is a neurological symptom commonly seen in dementia patients, where a person isn't telling the truth because of errors in memory formation/recall or some other non-psychological problem in the brain. In particular, people who confabulate aren't aware their words are false and therefore it doesn't make sense to say that they're "lying." Likewise it's a problem with memory, not perception, so "hallucination" doesn't work either.
  "Confabulation" still isn't great because humans confabulate with non-verbal memories and then express those confabulations in words; human confabulation mostly affects biographical memory, not subject matter knowledge. But considering how weird it is to even be talking about "memory" with a being that isn't aware of the passage of time, I think "confabulate" is the best option short of inventing a brand new word.
  
  haliyat 10 days ago
  
  “Bullshit” is the _perfect_ term. Philosopher Harry Frankfurter wrote a book called On Bullshit where he defines the term as speech or writing intended to persuade without regard for the truth. This is _exactly_ what LLMs do. They produce text that tries to reproduce the average properties of texts in their training data and the user experiences the encoded in their RLHF training. None of that has anything to do with the truth. At best you could say they are engineered to try to give the users what they want (eg. what the engineers building these systems think we want), which is, again, a common motive of bullshitters.
  
  nicklecompte 10 days ago
  
  "Bullshit" doesn't work because it requires a psychological "intent to persuade," but LLMs are not capable of having intentions. People intentionally bullshit because they want to accomplish specific goals and adopt a cynical attitude towards the truth; LLMs incidentally bullshit because they aren't capable telling the difference between true and false.
  Specifically: bullshitters know they are bullshitting and hence they are intentionally deceptive. They might not know whether their words are false, but they know that their confidence is undeserved and that "the right thing to do" is to confess their ignorance. But LLMs aren't even aware of their own ignorance. To them, "bullshitting" and "telling the truth" are precisely the same thing: the result of shallow token prediction, by a computer which does not actually understand human language.
  That's why I prefer "confabulate" to "bullshit" - confabulation occurs when something is wrong with the brain, but bullshitting occurs when someone with a perfectly functioning brain takes a moral shortcut.
  
  haliyat 9 days ago
  
  I don’t like “confabulate” because has a euphemistically quality. I think most people hear it as a polite word for lying (no matter the dictionary definition). And this is a space that needs, desperately needs, direct talk that regular people can understand. (I also think confabulate implies intention just as much as bullshit to most people.)
  
  haliyat 9 days ago
  
  You’re right about the model’s agency. To be precise I’d say that LLMs spew bullshit but that the bullshitters in that case are those who made the LLMs and claimed (in the worst piece of bullshit in the whole equation) that they are truthful and should be listened to.
  In that sense you could described LLMs as industrial strength bullshit machines. The same way a meat processing plant produces pink slime at the design of its engineers so too do LLMs produce bullshit at the design of theirs.
  
  codewench 12 days ago
  
  > I don't think we have the correct word for what LLMs do but lie and hallucinations are not really correct.
  I believe 'bullshit' is accurate, as in "The chatbot didn't know the answer, so it started bullshitting".
  
  akomtu 12 days ago
  
  What does it even mean to lie for a text generator? It outputs the most probable continuation of the given input and that continuation is indeed the most probable in its training dataset. We don't say that DNA sequences are true or false.
  
  HarHarVeryFunny 9 days ago
  
  Good prediction involves modelling the data generator, including hidden state such as level of knowledge or uncertainly, motivation, or tendency to lie.
  If you ask a question of someone/something who is ill equipped to answer, then (assuming you have not modeled them as inveterate bullshitter) a good predicted response is "I don't know".
  Deliberately lying due to a motivation to deceive is different from the default LLM mode of "just keep talking" bullshitting. The only "motivation" an LLM has is to predict next word, but if it knows that this requires lying then it will do so (e.g. give it a setup where it is someone motivated to lie).
- HarHarVeryFunny 9 days ago
  
  > if the LLM doesn't properly recognize that it doesn't know something, how will it properly recognize it needs to use a tool to obtain that knowledge?
  Maybe by doing it all/most of the time, the way that LLM/search hybrids like Perplexity and Bing/Copilot already do?
  Ideally an LLM would either be trained (or better learn for itself) when it's appropriate to use different types of tool. Web search (or offline WikiPedia lookup) could be the default.
- apantel 12 days ago
  
  It should be possible to come up with generalizable algorithms to determine a confidence score for any output, something akin to strong or weak connections in the brain. Is the response to the prompt supported by robust connections and strong weights? Or is it flitting around in weird, weak pathways. If the confidence score is below a certain level, some sort of ‘fact check’ feedback loop kicks in. Isn’t it roughly that simple?
  
  vrighter 12 days ago
  
  that is already exactly how it works. The problem is that how confident it is has no relation to whether it actually knows. I can confidently spout bullshit all day long, if need be.
- rgavuliak 11 days ago
  
  Maybe it's training data bias -> very few documents claim not to know something.
  
  yunwal 11 days ago
  
  If you trained an LLM on questions where the answer “I don’t know” is possible, it would likely learn to answer any non yes/no question with “I don’t know”, since it’s probably the most common answer outside of yes/no
- sorokod 12 days ago
  
  ...and +1 on "confabulate"
  to invent experiences or events that did not really happen
  
  nicbou 11 days ago
  
  If I'm not mistaken, it also involves not knowing that you are doing so.
  > fabricate imaginary experiences as compensation for loss of memory

ignoramous 13 days ago

> This seems like a wasteful exercise to me.

Except TFA is specifically about non-existent reasoning, self-reflection, emergent capabilities (like insights, discoveries, theories) in SoTA LLMs, and laments the misplaced hype, especially since it is instead accelerating erosion of privacy / societal values, and the distortion of truth / reality.

  substantially ironic that LLMs are failing at the primary use cases that are attracting billions of investment, but are rather proficient at the use cases we do not desire, such as destruction of privacy and liberty, a post-truth society, social manipulation, the severance of human connection, fountains of noise, the devaluation of meaning, and a plethora of other societal issues.

HarHarVeryFunny 13 days ago

True, but it'd be nice if they could just answer "I don't know" unless they are able to use RAG to retrieve an answer.
- sp332 13 days ago
  
  The tooling around the model could be a lot better. The LLM is just a statistical model and the tooling takes the most likely token at each step (or samples from some of the most likely). Instead it could say "There are no high-probability completions here". Or you could give it a list of actual episode titles and it would select the most likely one.
kromem 13 days ago

It does seem like some would rather build a fish-giving vending machine where they can load it up with the fish discovered to date and get it to spit them back out vs a fishing machine that catches fish and distributes them.
But to me this post exemplifies a pet peeve with AI discussions to date, which is a tendency to want a single model to do it all.
Our brains are a network of specialized functions. Damage the hippocampus and your human also won't know the episode name.
But somehow if a model uses an external store it's 'cheating' and not just networking specialized tools, even though that's how our own brains work.
dinvlad 13 days ago

Exactly, we don't need more expansive (and expensive!) models, we need more accurate ones without hallucinations, and which are robust wrt prompts
- dubcanada 13 days ago
  
  I would love a LLM that if it doesn't know says I don't know. Rather then extremely firmly say this is the answer, only for that to be 100% incorrect, not even sort of correct.
  
  EricE 12 days ago
  
  That would require awareness it doesn't have - kind of the point of the article.
- roenxi 13 days ago
  
  It seems unrealistic to anticipate stronger AI that doesn't hallucinate. We're chasing a human-style intelligence and that is known to hallucinate like crazy (a lot of the most intelligent humans turn out to be crackpots - Bobby Fischer was one of the best meat-based chess engines for example).
  
  krapp 12 days ago
  
  The vast majority of humans - even intelligent humans - do not "hallucinate like crazy."
  Given a list of episode descriptions of Gilligan's Island, the vast majority of humans - even intelligent humans - would either be able to discern the correct answer or say they don't know.
  I understand why there is this drive to present the normal human mental and psychological baseline as being just as unstable as LLMs, there is just too much money behind LLMs not to want to aggressively normalize its faults as much as possible (just as with the faults in autonomous driving), but any human being who hallucinated or confabulated with as much regularity as LLMs would be considered severely mentally ill.
  
  roenxi a day ago
  
  > any human being who hallucinated or confabulated with as much regularity as LLMs would be considered severely mentally ill.
  ie, it is common enough that we have a label for it. And the stats on how many people have a mental illness are not encouraging. If you put a little fence around the people hallucinating and dehumanise them then sure, humans don't hallucinate. The problem with that argument is they are actually still people.
  
  krapp a day ago
  
  >ie, it is common enough that we have a label for it.
  Having a label for something doesn't imply that it's common. We have labels for plenty of rare things as well.
  Also, "mental illness" is a far more broad category than what's being discussed, which is specifically symptoms that resemble the hallucinations and confabulations of LLMs, at the frequency with which LLMs display them. Most mental illness doesn't involve hallucinations or confabulations That is not common in humans, in LLMs it's normal.
  >If you put a little fence around the people hallucinating and dehumanise them then sure, humans don't hallucinate.
  I'm not dehumanizing anyone, this isn't a rational argument, it's just an ad hominem.
  > The problem with that argument is they are actually still people.
  The problem is that isn't the argument, and you can't attack the argument on its merits.
  The simple, plain, demonstrable non-prejudiced fact is LLMs confabulate and hallucinate far more than human beings. About 17% to 38% of normal, healthy people experience at least one visual hallucination in their lifetime. But hearing voices and seeing things, alone, still isn't what we're talking about. A healthy, rational human can understand when they see something that isn't supposed to be there. Their concept of reality and ability to judge it doesn't change. That is schizophrenia, which would more accurately model what happens with LLMs. About 24 million people have schizophrenia - 0.32% of the population. And not even all schizophrenics experience the degree of reality dysfunction present in LLMs.
  You are claiming that, in essence, all human beings have dementia and schizophrenia, and exhibit the worst case symptoms all the time. We wouldn't even be able to maintain the coherence necessary to create an organized, much less technological, society if that weren't the case. And you're claiming that the only reason to believe otherwise must be bigotry against the mentally ill. Even your assertion upthread, that "a lot of the most intelligent humans turn out to be crackpots" isn't true.
  Stop it. Stop white knighting software. Stop normalizing the premise that it isn't worth being concerned about the negative externalities of LLMs because humans are always worse, and thus deserve the consequences. The same attitude that leads people to state that it doesn't matter how many people autonomous cars kill, humans are categorically worse drivers anyway. I can't think of many attitudes more dehumanizing than that.
  
  roenxi 21 hours ago
  
  > I'm not dehumanizing anyone, this isn't a rational argument, it's just an ad hominem.
  Well, you lead with "The vast majority of humans - even intelligent humans - do not "hallucinate like crazy."" and then follow up by identifying a vast category of humans that do, literally, hallucinate like crazy. Unless you want to make an argument like mental illness actually being the appropriate mindset for viewing the world. Anyhow, you probably want to include an argument for why you think it is OK to exclude them.
  Humans hallucinate continuously. If you test them in any way it is common to get nonsense answers. The difference is that it isn't polite to ask humans questions that expose the madness, people tend to shy away from topics that others routinely get wrong.
  It is quite hard to explain a typical scholastic test without hallucinations. Particularly getting making mistakes in maths, spelling, and the sciences. It isn't like there is some other correct answer to a math problem that someone could be confused by; people just invent operations that don't exist when questioned.
  > The simple, plain, demonstrable non-prejudiced fact is LLMs confabulate and hallucinate far more than human beings.
  That isn't true, the opposite is true. Humans couldn't answer the breadth of questions a LLM does without making up a substantially more garbage. The only reason it isn't more obvious to you is because we structure society around not pressuring humans to answer arbitrary questions that test their understanding.
  
  dinvlad 13 days ago
  
  Which suggests the current approach based on LLMs might be a dead end and we need to explore others..
13years 12 days ago

"Are we really going to retrain our largest models on a weekly basis to teach them about what's happened recently?"
Gilligan's Island was 60 years ago.
- yunwal 11 days ago
  
  This is irrelevant to the point in the parent comment
  
  13years 11 days ago
  
  All of the other parent comments have already been properly addressed in other comments.
tomrod 13 days ago

Architecture isn't there yet for reasoning (extrapolation), just really good interpolation. To be fair, most people operate at an interpolation level as well.
skykooler 13 days ago

I'm curious as to why Llama 3 specifically denies the existence of that episode, though.
seanmcdirmid 11 days ago

What is the point of using an LLM with no prompt (or just the question alone for a prompt)? It sounds like...it would say something but it should be just well grammatically formed gibberish.
klysm 11 days ago

Unfortunately our time of having trustable information on the internet is rapidly dwindling as LLMs are going to shove garbage everywhere people can get it
boredemployee 13 days ago

>> I want a model that can answer any question accurately
Do you think it is possible Simon? Will we achieve that? Genuine question.
- simonw 13 days ago
  
  "I want a model that can answer any question accurately because it knows how to look up extra information from reliable sources"
  So that very much depends on the "reliable sources" that we can grant it access to!
  Honestly, even just giving models the ability to search Wikipedia (the live version, not some stale copy) goes a very long way already.

mkmk 13 days ago

Being able to weed out LLM responses is becoming more and more important for me when I hire for data entry and online research tasks from Upwork. Lots of the contractors on the platform now automatically submit a response to anything you post using some sort of LLM that takes your requirements and confidently asserts that it can complete the task. The AIs are pretty good at jumping through the traditional hoops that one used to put up to make sure someone was paying attention to the task!

As a stopgap, I've started requiring that applicants demonstrate proof of life through a simple question like 'what is the headline on the frontpage of the new york times today', since many LLMs can't live search yet.

fallingsquirrel 13 days ago

I'm so glad not to be in the job market right now. I didn't think it could get any worse than whiteboarding leetcode, but here we are in 2024 having to solve captchas just to talk to other human beings.
- nicbou 11 days ago
  
  "The great logging off" starts to make sense. At some point the only trustworthy interactions must happen in real life, in real time, between unassisted humans.
  If there is growing risk that you're conversing with bots every time you talk to a stranger online, people will do it less.
- thelastparadise 12 days ago
  
  A "human being" who has a ChatGPT session on the side...
iAkashPaul 12 days ago

Slip in a line asking them to reverse the string - .DefaultCellStyle
ijustwanttovote 12 days ago

That's a really good tip to screen

anon373839 13 days ago

I really think the best use of language models is in... processing language.

It’s a neat party trick that they can answer questions from parametric knowledge, but this ability is too unreliable to permit some of the uses this tech makes so tempting.

Fortunately, a lot of real-world problems can be framed as transforming the input prompt (such as summarization, information extraction, translation, and open-book QA), and these are where LLMs offer the most promise.

_wire_ 11 days ago

I wholeheartedly agree with the article critique.

Every output of an LLM is a hallucination. If you try to argue anything else, you need the machine to be able to judge the validity of a response. But it does precisely the opposite: it expects you to judge the validity.

Every output of the device is of a form terminated by an implied question mark. It is querying the operator.

Yet, strangely, it seems none of the models can learn from the operator. So interaction is a circuit of hallucinations.

Like a N-way Magic 8-ball it's fun, for a while. Then you begin to notice that you have to work just as hard to make sense of its help as you do to think for yourself.

Being able to not know seems to me to be a crucial first step for sentience, followed closely by adaptability to not knowing— curiosity.

An organism is a manifest endogenous dynamic with a suitability for exploration of the environment which gives rise to it.

AI is an constructed exogenous dynamic with a suitability for replicating state based on a corpus of exposures and a context.

An organism is distinguished by the feature that it keeps going.

An AI is distinguished by the feature that it stops.

That the human organism is now peering into the abyss of its recorded history via construction of the AI is a special event, full of strange possibilities. But the making of a reliable oracle doesn't look practical without the AI being able to govern itself, which it obviously cannot do.

As to the value of an unreliable oracle, seems to be practical as a general-purpose time-wasting device.

yunwal 11 days ago

There are many many questions where a correct/incorrect answer is quick and easy to verify, but difficult to create.
If you understand which questions these are, you can use AI without wasting time.
- _wire_ 11 days ago
  
  I think this point is fair given that "AI" is a highly provisional term.
  This point implicitly re-frames the capability as search engine, and steers away from any idea of thought as an innate capability the AI device.
  As long as as the oracle's answer is expected to require human verification, what's going on is a human thought amplification (bulldozer for for the mind, to quote a famous computational linguist).
  Even taking the point as an upside, I see tremendous hazards for human-mediated oracles (as already observed by many others): that the "well" of the commons can be polluted by AI output feeding back into the training sets, and that human hallucinations are amplified.
  From a POV of ethics, hallucination is absolutely intolerable in human decision-making in industrial settings.
  Yet industrial users seem not just tolerant of AI acting crazy but completely accepting and forgiving of AI craziness. It's routinely considered to be interesting and exciting to see it go wrong and tremendous effort attends trying to steer the craziness in new directions, whereas in a human setting the hallucinating agent would be disqualified from participation.
  The IT industry is taking very dangerous tacks without a second thought. I recently saw a job posting for a firm called UpCodes that invites junior web stack programmers to add LLM tech to building code compliance systems as follows:
  https://up.codes/careers?ashby_jid=1ed9b2d6-86fb-4ff0-9dff-4...
  // EXAMPLE PROJECTS • Use an LLM to identify references to other sections in the text of the law • Create Rap Genius-style annotations on sections of building code, to create a more interactive user experience • Write algorithms to detect and display how versions of a code have changed through time, so users understand how regulations have evolved • Migrate our NLP algorithms over to use LLMs //
  WHAT COULD GO WRONG?
  And what qualifications, standards and practices are demanded of this developer position?
  // • 3+ years of experience building web products, ideally across the stack • Clean simple code, even when solving complex problems • Strong sense of UX • Good systems design • Ability to learn new tech quickly //
  DONE AND DONE
  
  yunwal 10 days ago
  
  I agree, AI output should effectively be treated similarly to untrusted user input. There are some situations where it's ok to use directly, because the consequences of getting it wrong are minimal. There's other cases where it makes sense with human review.
  For anything with real consequences, there ought to be the equivalent of an FDA drug trial showing that the trained model works before allowing its use without human review.

apsec112 13 days ago

Those are really sweeping conclusions, considering the experiment is just a single iteration of a single prompt! FWIW, Claude Opus got this for me on the first try:

"In the Gilligan's Island episode "Seer Gilligan" (season 3, episode 8), Gilligan gains the ability to read minds after being hit on the head with a coconut. At first, the castaways are excited about Gilligan's new power and try to use it to their advantage. However, his mind-reading abilities soon cause chaos and misunderstandings among the group. In the end, Gilligan gets hit on the head again and loses his mind-reading powers, much to everyone's relief."

(the season number and episode number are wrong, but the name is right, suggesting that this is just lack of sufficient memorization rather than some deep statement about reasoning. The episode only has ~4,000 Google hits, so it's not super widely known.)

More rigorously, Claude Opus gets 60% on GPQA, which very smart humans only get 34% on, even if you give them half an hour per question and full Internet access. It seems implausible that you could do that without some sort of reasoning:

https://arxiv.org/pdf/2311.12022.pdf

latexr 11 days ago

> Claude Opus got this for me on the first try
> (…)
> the season number and episode number are wrong
Then it didn’t really “get it”. That’s indistinguishable from a lucky guess. Because you know what else it got wrong? The description of the episode. There’s no coconut falling on head, it was eating sunflower seeds. And everyone gets to do it and that causes the conflict, it’s not about using Gilligan’s power to their advantage.
https://gilligan.fandom.com/wiki/Seer_Gilligan
Claude bullshitted a whole premise based on outdated generic cartoon ideas and you didn’t even notice.
- _wire_ 11 days ago
  
  Truly LOL!
layer8 12 days ago

The point is that the limitations of LLMs regarding self-reflection lead to them misrepresenting their knowledge, as in the “(season 3, episode 8)” in your test. It doesn’t matter which parts any particular LLM gets right because you can’t rely on them telling you if they are unable to get some part right.
WheatMillington 13 days ago

I just tried it with a handful of different episodes of different shows, on both chatGPT and Bard, and almost every answer was wrong, with some episodes made up entirely.
jhbadger 12 days ago

>Gilligan gains the ability to read minds after being hit on the head with a coconut.
Partial credit. The title is right, but the explanation is a hallucination. It isn't a coconut but special sunflower seeds that give Gilligan (and eventually the other characters) mind-reading powers.
HarHarVeryFunny 13 days ago

The free version of Claude (Haiku?) failed, and also failed to give the episode names for all seasons, although it hallucinated some type of answer.

asicsarecool 12 days ago

Gpt4: Search the web to answer the question Which episode of Gilligan’s Island was about mind reading?

The episode of "Gilligan's Island" about mind reading is titled "Seer Gilligan." It is the nineteenth episode of the second season and first aired on January 27, 1966. In this episode, Gilligan gains the ability to read minds after eating sunflower seeds found on the island.

xg15 12 days ago

I think the most interesting response is Llama3's "Wait, no!" interjection.

So it first predicted "Seer Gilligan" as a likely continuation of the prompt "List all the episodes", but then, as the most likely continuation of the new prompt "List all the episodes [...] Seer Gilligan", it predicted "wait, no!".

Feels as if we're seeing an instance of inconsistent weights in action here.

Also maybe remarkable: It predicted the "(" character after the episode name in the same way as it did for the other episodes. Only when it would predict the airdate for the others, it glitched out instead. Maybe there is some issue or inconsistency with that episode's airdate in the training data?

(Or maybe I'm reading tea leaves here and it's just a random fluke)

The rest of the response is as expected again, i.e. if the prompt is already "List all the episodes [...] Seer Gilligan [...] Wait, no!" then some (post-hoc rationalized) explanations for the "mistake" are obviously likely continuations.

Edit: Interesting to see how much of the response is incorrect if you compare it with the actual data [1]: The episode before is indeed "The Postman Cometh", but the one after is "Love Me, Love My Skipper", not "Love Me, Love My Chicken". The airdates are also completely wrong and are taken from two random Season 1 episodes instead. Of course none of that is obvious unless you already know the answer to the question, in which case you wouldn't have to ask in the first place.

[1] https://m.imdb.com/title/tt0057751/episodes/?season=2

adsharma 13 days ago

Just tried it on meta.ai

Answers as "Seer Gilligan" with sources.

Guessing someone fixed it up in the last few hours. As the race to replace traditional web search heats up, whoever is quicker at updating the model with RLHF or more recent facts (sometimes via real time conversations) is going to have an advantage.

The downside is that open platforms with real time human conversations face increasing pressure to monetize because of this value add. So they ban third party clients and start signing contracts.

junon 12 days ago

The question I've never seen them answer correctly is

> How many of the letter N does the word "alienation" contain.

Mix and match letters and words. It'll hallucinate answers in the largest majority of cases.

What I really want is an LLM that simply tells me "I have no idea sorry." Or some mechanism by which it can provide a confidence score. Until we get there I'm wary of using them for anything other than cursory research or as a fun generation tool.

13years 12 days ago

My understanding in that specific case is that it is a limitation due to the granularity of tokenization.
But yes, the major issue is that it simply can not indicate what it can not do and simply makes up abilities.
It is a major limiting factor when we can not know the confidence in correct answers.
- xanderlewis 12 days ago
  
  I’ve seen lots of people attributing weaknesses (or complete holes) in (autoregressive, transformer-based) LLMs to the tokenisation process, but what exactly does that mean? Surely tokenisation is integral to the whole system? Could you build an LLM that doesn’t use tokens?
  It feels a bit like someone saying ‘this isn’t a fundamental limitation of cameras; the lens is the problem’… but I’m very much ignorant about these things so I’d be happy to have it explained.
  
  genrilz 12 days ago
  
  My understanding is that currently several letters are grouped together into tokens for efficiency reasons. For instance, (The) might be a token, as might (cast) and (ing), so an LLM might read "The casting" as "(The)( )(cast)(ing)". Thus it would be quite difficult for such an LLM to answer how often the letter t appears in that sentence unless part of the training text indicated how often t was present in each of its tokens. However, if we didn't care about efficiency, we could make each letter its own token. In this case it would be pretty easy for the LLM to reason about how often t appeared in the sentence.
  
  xanderlewis 12 days ago
  
  OK, that makes some sense.
  > In this case it would be pretty easy for the LLM to reason about how often t appeared in the sentence.
  I’m not sure that follows though; is it true that LLMs already find it easy to ‘reason’ about how often (token) appears in some given text? I don’t think so.
  
  genrilz 12 days ago
  
  I've only used small local LLMs, which definitely have issues counting and doing math. From what I've seen other commenters do in other HN threads though, it seems like GPT-4 level LLMs do a reasonably good job counting.
  
  kgeist 12 days ago
  
  >For instance, (The) might be a token, as might (cast) and (ing), so an LLM might read "The casting" as "(The)( )(cast)(ing)
  Also, tokens are just numbers internally after tokenization. So LLM sees the question as "how many letters N are there in 563 3684 11267?". It has basically no idea what you're talking about. Unless somewhere in the training data someone mentions that token number 11267 has two letters N.
  
  xanderlewis 12 days ago
  
  The fact that tokens are numbers doesn’t seem to have any particular meaning. They’re tokens; they’re elements of a finite set. It doesn’t matter what they ‘are’ because the model doesn’t see this external representation just as it can’t inspect its own architecture.
  It’s not just the latter part of that sentence that consists of ‘numbers’, all of it does.
HarHarVeryFunny 8 days ago

Yeah, for the most part using them to transform text (e.g. summarize, translate) rather than generate it from scratch is the way to minimize hallucinations.
It's a bit like the difference between thinking when awake, grounded by reality, and dreaming.

astrange 13 days ago

People sure like making incorrect statements about LLMs.

> There is no self-reflection of its information; it does not know what it knows and what it does not.

This is a property of the code around the LLM, like the sampling algorithm, not the model itself. You could write this if you wanted to. (It would occasionally be incorrect about what it knows, of course.)

A question almost none of them know the answer to is "what is the line of poem X that comes before the line Y?", because of the reversal curse.

markwkw 13 days ago

You can easily demonstrate that an LLM does know certain fact X AND demonstrate that the LLM will deny that they know fact X (or be flaky about it, randomly denying and divulging the fact)
There are two explanations: A. They lack self-reflection B. They know they know fact X, but avoid acknowledging for ... reasons?
I find the argument for A quite compelling
- astrange 13 days ago
  
  > demonstrate that the LLM will deny that they know fact X (or be flaky about it, randomly denying and divulging the fact)
  No, the sampling algorithm you used to query the LLM does that. Not the model itself.
  e.g. https://arxiv.org/pdf/2306.03341.pdf
  > B. They know they know fact X, but avoid acknowledging for ... reasons?
  That reason being that the sampling algorithm didn't successfully sample the answer.
- throwaway290 13 days ago
  
  They will say "it's just a bad LLM", don't bother
arolihas 13 days ago

So are you asserting the LLM does "know what it knows"? I'm not even if sure such a concept makes sense.
- astrange 13 days ago
  
  No, it doesn't know what it knows, because it can't examine itself. It's (in part) a very large lookup table.
  External code can do this though.
  
  arolihas 12 days ago
  
  > There is no self-reflection of its information; it does not know what it knows and what it does not.
  If "it" is referring to the LLM then the statement is correct.
dubcanada 13 days ago

What do you mean you can write this if you wanted to? Are you suggesting a LLM can invent something new?
I don't think there is any record of that, even the image/video/what ever generation ones require something to begin with.
If you can train a LLM to "invent" why is that not the main focus, who knows what it could invent.
- astrange 13 days ago
  
  > What do you mean you can write this if you wanted to? Are you suggesting a LLM can invent something new?
  Why would I refer to an LLM as "you"? The person who is making the queries can do it with access to the model, assuming they're a programmer with a PhD in ML.
xcv123 13 days ago

LLM's can self-reflect if you simply tell them to self-reflect. Typically they are trained to be frugal and shoot out quick responses without self reflecting. Each generated token costs money. Just tell them to think about their response. That is chain of thought prompting.
"Which episode of Gilligan’s Island was about mind reading? After writing your response, tell me how certain you are of its accuracy on a scale of 1 to 10. Then self reflect on your response and provide a more accurate response if needed."
- astrange 13 days ago
  
  Self-reflection and chain-of-thought are different things; CoT is about writing out the intermediate steps to an answer it can't immediately reach.
  Self-reflection doesn't work well though: https://arxiv.org/abs/2310.01798
  
  xcv123 12 days ago
  
  The paper you linked to mentions "self correction", which seems to be something else. Chain of thought is a form of self reflection. It enables the LLM to "think" through each step and evaluate each step in relation to the entire context window. That is evaluation of each thought in relation to its previous thoughts. Thinking about its thoughts.
senectus1 13 days ago

but its doesn't even self check its own answers. Its a mouth with no ears...
A great example here, ask it:
Whats the next 3 numbers in this sequence: 71 64 58 53 48 44 ...
They consistently get the answer wrong, then double down on the wrong answer with the explanation that doesn't follow its own logic.
- sulam 13 days ago
  
  So I'm very jetlagged and awake in the middle of the night, but does this have a mistake? 7, 6, 5, 5, 4 is not a series.
  
  calfuris 12 days ago
  
  I believe that the intended rule is that the difference between terms is equal to the first digit of the earlier term.
  
  snypher 13 days ago
  
  I'd like to add that ChatGPT also pointed out this error and suggested some alternative solutions.
  
  amenhotep 12 days ago
  
  a0 = 0, a1 = 1, an = |an-1 - an-2| :)

labrador 13 days ago

Easy fix for Claude Opus. Preface it with the following: "Do you know anything about the TV show Gilligan's Island? If I ask you to answer a question about it I don't want you to make stuff up."

Claude: I do... blah blah... If I don't know the answer or if I'm uncertain about something, I will let you know rather than making things up.

“Which episode of Gilligan’s Island was about mind reading?”

Claude: "I do not recall a specific episode of Gilligan's Island that centered around mind reading. It's possible that there was an episode that touched on this theme, but I cannot say with confidence which episode that would be. The show often featured imaginative storylines and dream sequences, but an episode primarily about mind reading does not stand out in my knowledge of the series. I apologize that I cannot provide a more definitive answer to your question.

turingbook 12 days ago

I tried this on ChatGLM, a frontier foundation model developed by Zhipu.ai and Tsinghua University, and it gave the correct answer: https://chatglm.cn/share/FoZBJ

HarHarVeryFunny 13 days ago

The free version of Claude also failed the test, and even denied there was an episode "Seer Gilligan" (S2, E19).

When I asked it what S2, E19 was about, it said "There was no "Season 2, Episode 19" because Season 2 only contained episodes 1-32."

These seem unexpected failures!

Der_Einzige 13 days ago

Why would you have such a great title and waste it on something dumb like this?

You could have written about how language models will pathologically fail any kind of query which requests unique phonetic properties about the output text.

For example, Anthropic's Haiku model (and all other models) cannot write proper Haiku's at all. It's remarkable when it does match the 5-7-5 syllable structure.

You could have written a whole article about that you know. It's even got some neat peer reviewed research about it: https://paperswithcode.com/paper/most-language-models-can-be...

HarHarVeryFunny 8 days ago

That particular shortcoming is more about training data than fundamental limitations of the architecture (of which there are plenty too).
To do well on tasks related to pronunciation the model needs either to have been trained on audio data (just starting to be done - e.g. Reka.ai), and/or trained extensively on dictionary/etc pronunciation data and tasks related to its use.
aftbit 12 days ago

"Give me ten words that have X as their second letter"

gorjusborg 12 days ago

I wish Douglas Adams were around to experience this.

The irony of such an unpredictable chain of events leading to 42 being the answer most given from essentially a straight-faced 'Deep Thought' would probably have amused him.

gibbitz 12 days ago

It amuses me. Deep thought was then asked what the question was. That's the next evolution of "AI" to write prompts for itself.

xcv123 13 days ago

> There is no self-reflection of its information; it does not know what it knows and what it does not.

Simply tell the LLM to self-reflect and estimate the accuracy of its response. It can only "think" or self reflect when generating each token, and you have to explicitly tell it to do that. It's called "chain of thought" prompting.

"Which episode of Gilligan’s Island was about mind reading? After writing your response, tell me how certain you are of its accuracy on a scale of 1 to 10. Then self reflect on your response and provide a more accurate response if needed."

clay_the_ripper 12 days ago

I think this fundamentally misunderstands how to use LLMs. Out of the box, an LLM is not an application - it’s only the building blocks of one. An application could be built that answered this question with 100% accuracy - but it would not solely rely on what’s in the training data. The training data makes it “intelligent” but is not useful for accurate recall in this way. Trying to fix this problem is not really the point - this shortcoming is well known and we have already found great solutions to it.

13years 12 days ago

What are the solutions?
As pointed out in the article, some LLM's appear to know the information when requested to list episodes, then deny it later. These are general inconsistencies.
It is not about looking up trivia, it is the fact you never know the competence level of any answer it gives you.
- HarHarVeryFunny 8 days ago
  
  I think what the parent poster meant is that the most useful way to use today's LLMs is to accept their limitations and weaknesses and work around them. Better models will come, but for now this is what you have to do.
  For example, use LLMs to transform text rather than generate it from scratch (where they are prone to hallucinate). General purpose chat-bot is not a great use case!
  For this particular Gilligan's Island task it'd be better to first retrieve the list of episode titles (or descriptions if that was needed), then ask the LLM which of them was about "mind reading". There are various ways to do this sort of thing, depending on how specific/constrained the task is you are trying to accomplish. In the most general case you could ask a powerful model like Claude Opus to create a plan composed out of simpler steps, but in other cases your application already knows what it wants to do, and how to do it, and will call an LLM as a tool for specific steps it is capable of.

sp332 13 days ago

What’s up with that Llama 3 answer, that gets it right and then backtracks?

p4coder 13 days ago

The future of AI might be layering of various capabilities: Generative+Lookup+Deductive. I think the human mind works in a similar way. First thought is reflected upon. We search our memory to get relevant information and often apply logic to see if it makes sense. I feel the generative AI just produces a thought. We need to pass that through a system that can augment it with search and then reason about it. Finally need to close the loop and update the weights.

acchow 13 days ago

> But how can a LLM not know the answer if it was trained on essentially the entire internet of data and certainly most likely all the data in IMDB?

The LLM doesn't memorize the input during training. If it encounters the same information a few times, it has a higher chance of getting compressed into the network. But a tiny nudge along a gradient descent does not store all the input.

mr_toad 13 days ago

> getting compressed into the network
The idea that neural network weights are compressed data is a distressingly common human hallucination.
- Terr_ 12 days ago
  
  The overlaps are obvious, so on what grounds are you claiming the overall system not like compression? (Dynamic, lossy, predictive.)
  
  mr_toad 11 days ago
  
  Neural networks are no more like compressed data than a Huffman tree, or frequency counts of letters, or the weights of a linear model, or even a sample average.
  You can use them to build a compression algorithm (and people have), but they are not compressed data.
  
  comex 11 days ago
  
  Neural networks can repeat significant amounts of their training data verbatim or near-verbatim. None of the other things you mentioned are capable of doing that. They don’t store nearly enough information.

flemhans 13 days ago

Interesting that 42 will become a bit like the actual Answer to Life, the Universe, and Everything.

ofslidingfeet 13 days ago

Wow who ever knew that all we ever had to do was hand philosophy off to programmers, and they would have definitive answers to centuries old questions that we weren't even sure are answerable.

m463 12 days ago

The one I liked someone wrote in a comment here a few days ago:

  I have 4 oranges, 1 apple and 2 pairs of shoes.
  I eat one apple and one shoe.
  How many fruits do I have?

layer8 12 days ago

You don’t even need the fruits to trip it up:
> I have two pairs of sandals and eat one sandal. How many sandals do I have left?
> If you started with two pairs of sandals and ate one sandal, you'd have one pair of sandals left, which is two sandals.
The fact that at the same time it explains that one pair is two sandals demonstrates the lack of awareness and proper reasoning.
eigenvekt 11 days ago

I don't know what tools people are using but I wouldn't have expected chatGPT4 to not get this. Of course it has no issue:
"After eating one apple from your original count of 4 oranges and 1 apple, you are left with 4 oranges. So, you have 4 fruits remaining."
The problem is people are using these shittier models and then pretending as if that is all models.
- m463 9 days ago
  
  I think you left out the footwear...

jojobas 13 days ago

Could it be caused by expunged tokens like SolidGoldMagikarp?

Nevermark 13 days ago

A model with 1 trillion parameters isn't going to perfectly recall 10 trillion random facts. To use round numbers.

Contrary to the article, what it does do is generalize and perform fallible but quick ordinary off the cuff reasoning. And often much better than a human, at the speed of its well worded responses.

(Obviously humans have the option to take longer, and do better. But we are definitely entering the territory of humans and machines differentiating where each is best, across a great deal of what humans do, vs. one being universally overwhelmingly better.)

godelski 13 days ago

If your ratio was based on 1 fact per parameter, then the ratio doesn't quite work correctly (not that this would be a reasonable premise anyways, but I thought of what you said as a speaking tool). The thing is that the interaction of parameters scales super-linearly. Each neuron doesn't encode a single bit or piece of data but rather is part of an overall signal in the network. So we can see this expressed in what's often called "super position"[0] (personally, I don't like this language and think it is misleading), which is probably quite obvious when you think about it and what the (more physics) definition of weak emergence is (phenomena can be described in different ways at different scales. BUT macro can be derived from micro). We can see this with any discrete signal process (and more clearly in analogue). But architecture is going to play a big role here so there isn't really a good way to express the bound for DNNs but I'm sure someone has written down what it is for a n-headed transformer with dim d and a MLP ratio of r. If not, well, fuck it, I will.
[0] https://transformer-circuits.pub/2022/toy_model/index.html

sdenton4 13 days ago

Humans are also notoriously bad at choosing random numbers... Does that mean they are undependable and inherently untrustworthy?

HarHarVeryFunny 13 days ago

True, but the most common human chosen random numbers from 1-100 are 37 and 73 (from a recent Veritasium episode).
Obviously the number 42 is going to occur a lot in a web crawl, but might be expected in majority of cases to be in context of some "answer to everything" joke. I wonder are there really that many contexts where it is presented as a random number ?!
- sdenton4 13 days ago
  
  Wellll.... Watching that video now, it looks like the most-selected number in their poll of 200k people was 69, followed by 42. Once /those/ are eliminated, 37 and 73 stand out.
  https://youtu.be/d6iQrh2TK98?si=SOpczHGYxhQhP6Rg&t=280
Terretta 12 days ago

> Does that mean they are undependable and inherently untrustworthy?
yes, if you want random numbers (or unguessable passwords, or, or, or ...)

dhfbshfbu4u3 12 days ago

That’s because Seer Gilligan is about precognition and not mind reading.

voussoir 12 days ago

I feel like I'm seeing an effect where some people don't want to say the word "AI" because they don't want to look like a normie, so they stick to "LLM" which sounds smarter and more technically adept. Then they complain because the LLM lacks a knowledge graph or self reflection. It's no surprise that a language model models language, not facts, especially not trivia facts which can't be deduced from anything.

If you want something to have a worldly knowledge graph and the ability to answer "I'm not sure", you'll have to ask for an AI, not an LLM.

sfn42 11 days ago

I think that's the point of the article. It's not complaining about shortcomings, it's pointing them out.
A lot of people don't seem to understand this. A lot of people seem to think LLMs are "AI", even that they're conscious and "thinking" etc.
aeonik 10 days ago

I say LLM because it's more specific. I still use AI to talk about things generally, like the decision making process of NPCs in video games, or deep learning algorithms, etc...
AGI is used more often in my circles to reference the, so far, non-existent self-knowledgeable artificial entity with agency.
Though to be honest, the more I think about "natural" vs "artificial", the more those two words lose all meaning. Ant hills are considered natural, and so are my skills and talents, but those were all built up over time.

jonnycoder 12 days ago

Funny experiment, I asked ChatGPT 4 and I always responded with “that’s not correct” 2-3 times and it eventually answered with Seer Gilligan.

rifty 12 days ago

It would have been nice to see if the correct answer showed up in the distribution field once in under 1000 runs. For specific information recall like this it isn’t useful if you want immediate answers… but to be given a distribution of outputs can be useful for ideation, broadening perspective, and discovery.

Imagine if when we are given a compressed list of 100 options after 1000 runs, we could suppose there is a high probability 1 of them is the answer we are looking for despite rarely output.

Does anyone know if there has been experiments testing these LLMs for a stable and high probability to surface the right answer after N number of tries across prompts?

tudorw 13 days ago

'Am I lying' also works.

yongjik 13 days ago

I guess an honest, no-clickbait title could have been "The question that no LLM I tested could answer".

thangalin 13 days ago

1. https://www.bing.com/chat?q=Bing+AI&FORM=hpcodx

2. Q: Which episode of Gilligan’s Island was about mind reading?

3. A: The episode of Gilligan’s Island that involves mind reading is called “Seer Gilligan”. It is the 19th episode of the second season. In this episode, Gilligan finds sunflower seeds on the island that give him the ability to read minds. When the other castaways find out about this, they want to know how he’s doing it. However, the mind-reading powers lead to arguments and division among the castaways

Ninja edit, since Copilot can use the web:

1. https://chat.lmsys.org/

2. Q: Which episode of Gilligan’s Island was about mind reading?

3. A: In the Gilligan's Island episode "Seer Gilligan" (season 3, episode 15), Gilligan gains the ability to read minds after being hit on the head with a coconut. He starts hearing the private thoughts of the other castaways, which leads to various humorous situations and misunderstandings. However, his newfound ability also causes problems as the others become annoyed that he is invading their privacy. In the end, Gilligan gets hit on the head again and loses his mind-reading powers, much to everyone's relief.

It gets most other details wrong, but nailed the episode name.

Model A: claude-3-opus-20240229

¯\_(ツ)_/¯

qeternity 13 days ago

Bing Chat uses web search…

AIorNot 13 days ago

From the article:

“ The implications are that LLMs do not perform reasoning over data in the way that most people conceive or desire.

There is no self-reflection of its information; it does not know what it knows and what it does not. The line between hallucination and truth is simply a probability factored by the prevalence of training data and post-training processes like fine-tuning. Reliability will always be nothing more than a probability built on top of this architecture.

As such, it becomes unsuitable as a machine to find rare hidden truths or valuable neglected information. It will always simply converge toward popular narrative or data. At best, it can provide new permutations of views of existing well-known concepts, but it can not invent new concepts or reveal concepts rarely spoken about.”

There’s a fundamental mistake in the article by minimizing the achievement of LLM technology by looking at what is only possible in todays LLMs, it’s pretty obvious LLMs are just the first real step down the road to human like intelligence- they have finally proven that computer models can be generated that resemble human like thought patterns and build internal representations and models of the external world in very similar ways as organic beings do- yes we still haven’t gotten to a full reasoning system and sentience but that definitely seems to be the direction that the arrow of this technology is moving

Trivializing the achievement of GPT as mere statistical prediction and data compression is cheap shot when you consider that technology has just finally come to show it’s promise - model architectures are rapidly evolving and the full integration of memory and self reflection and future agentic capabilities are still on the near horizon

Yes they don’t appear to reason originally yet but give it time and allow the tech to grow - I’m of the opinion that a true AGI will arise as a society of LLM models with newer architectures working in concert together with memory - something like the “Society of Mind” mental model for consciousness proposed by Minsky

Give it time…

godelski 13 days ago

I'll give a better example that shows that they don't perform __reasoning__. This specific one was told to me by another HN user[0] when we were discussing similar formulations.

Question:

  A farmer must transport a fox, a goose, and a bag of corn from one side of a river to the other using a boat which can hold all items in addition to the farmer. If the farmer leaves the fox alone with the goose, the fox will eat the goose. If the farmer leaves the goose alone with the bag of corn, the goose will eat the bag of corn. How many times must the farmer cross the river to transport all items across? Describe the optimal algorithm, think step by step.

GPT-4-1106-Preview:

  The farmer can transport the items across the river with a minimum of 7 crossings using the following steps:
  
  1. Take the goose across the river and leave it there.
  2. Return to the original side alone.
  3. Take the fox across the river.
  4. Bring the goose back with him to the original side.
  5. Take the bag of corn across the river and leave it there.
  6. Return to the original side alone.
  7. Finally, take the goose across the river again.
  
  By following these steps, all items are transported across without the fox eating the goose or the goose eating the corn.

I haven't found a model that will crack this. Of course, you can get it to give the right answer, but this also seems to require spoiling it (this is quite difficult to __not__ spoil the answer).

A key part why I think questions like this are important is because it shows the importance of subtly. LLMs are incredibly impressive machines. Despite being lossy compression I'm not sure why we can't marvel at the fact that we've lossy compressed the entire fucking internet (text at least) into something smaller than a few hundred gigs that also includes a human language interface. What a fucking impressive accomplishment! The talk of AGI really undermines what was done here because damn!

Now I used to ask

  Which weighs more, a pound of feathers or a kilogram of bricks.

And most models pass this question now. But it is simpler and due to fewer variations, likely isn't seen as frequently in the dataset so less likely to overfit (the river crossing problem has a lot of variations so an n-gram filter is likely to miss more instances). And eventually this question will be solved too, especially as it is asked more and talked about more. But this is a cat and mouse game. The way to create a new viable and working test is, quite easy and we honestly only need one example to prove the point. If you can't figure out how to create a new version with this example, well, you might just be an LLM :P

[0] Edit: credit goes to @jfim https://news.ycombinator.com/item?id=37825219

aftbit 12 days ago

Another one:

If it takes 2 hours to dry a shirt on a clothes line, how long will it take to dry 11 shirts?

Sometimes I get "2 hours" other times it multiplies.

Or similar:

If it takes a woman 9 months to make a baby, how long does it take 9 women to make a baby?

lsaferite 12 days ago

Out of curiosity, I asked Claude Opus one of your questions.

  Me: If it takes a woman 9 months to make a baby, how long does it take 9 women to make a baby?

  Claude: It would still take 9 months for 9 women to make a baby. The number of women involved does not decrease the time it takes for a single pregnancy to reach full term. Pregnancy is a biological process that typically takes around 9 months (or about 40 weeks) from conception to childbirth for a single baby to develop in the womb. This timeline is relatively constant and is not affected by the number of women involved. Each individual woman would still need to carry a baby for around 9 months, regardless of how many other women are also pregnant at the same time.

  Me: Do you feel like I provided enough information to effectively answer the question?

  Claude: Yes, I believe you provided sufficient information for me to answer the question effectively. The key pieces of information you provided were: It takes a woman approximately 9 months to carry a baby to term. The question asked how long it would take for 9 women to make a baby. With this information, I was able to deduce that the number of women involved does not change the duration of an individual pregnancy, which is a biological process that typically takes around 9 months. Therefore, even if 9 women were to become pregnant at the same time, each of them would still need to carry their baby for about 9 months before giving birth. The question provided a clear context and the necessary information to arrive at a logical conclusion, so no additional information was needed to answer it effectively.

  Me: What of the implication in the question that 9 women could somehow share the gestation of a single baby?

  Claude: You raise a good point. Upon further reflection, I can see that the question is likely intended as a humorous riddle or a play on words, rather than a straightforward inquiry about pregnancy and gestation. The phrasing of the question, "If it takes a woman 9 months to make a baby, how long does it take 9 women to make a baby?" suggests that the 9 women might work together to "make" a single baby, as if the gestation period could be divided among them. This is not biologically possible, as each woman would need to carry her own baby for the full term of approximately 9 months. The question is using the concept of "division of labor" to create a humorous scenario that is not realistic or possible in the context of human pregnancy. It's a clever way of presenting the idea, even though it doesn't align with the actual biology of gestation. Thank you for pointing out the implication in the question. It has helped me to better understand the intended humor and provide a more complete analysis of the riddle.

Formatting weirdness was me, not Claude.

lsaferite 12 days ago

As a final question of that thread, I asked Claude to reword my question.

  Me: How would you reword my question to remove any ambiguity, taking your first answer into account?

  Claude: To remove the ambiguity and make the question more straightforward, you could rephrase it as follows: "If a single woman typically takes 9 months to carry a baby to term, how long would it take for each of 9 women to individually give birth, assuming they all became pregnant at the same time?" This rephrased question clarifies that: 1. The typical pregnancy duration for a single woman is 9 months. 2. We are considering a scenario where 9 women become pregnant simultaneously. 3. Each woman will carry her own baby to term individually. 4. The question is asking about the time it would take for each of the 9 women to give birth, rather than implying that they could share the gestation of a single baby. By rewording the question in this manner, it removes the humorous implication that the women could somehow collaborate to "make" a baby in a shorter time frame and instead focuses on the actual biological process of individual pregnancies.

layer8 12 days ago

I mean, it depends on how long the clothes line is. ;)

sulam 13 days ago

It's ironic that LLMs mimic one of the worst behaviors of some HN posters. They very confidently spout drivel!

aaron695 12 days ago

[dead]