Fine tune LLAMA3 on million scale dataset in consumer GPU using QLora, DeepSpeed

This is a thorough "how to" but it is missing a "why for" about any of the chosen starting elements.

I don't understand why you would use an old dataset that worked for llama2 and just fine-tune llama3 on it. Isn't it most likely that the new model has covered off everything it missed last time around and now the last dataset is only valuable for the last gen.

factorymoo 11 days ago

This might be an unfair statement but it really feels like all of these blogs don't know why. They copy/paste each other (you often seem the same errors in multiple notebooks/blogs) and I have a feeling no one really deeply understands what they're doing.
- unraveller 11 days ago
  
  Found my answer for why thanks to the issues in latest dolphin fine-tune. They do these types of fine tunes mainly to reduce refusal rates and increase intelligence. They did the knee-jerk rerun of the same old data this time, as I suspected, just for lols to see where open-source is at.
  Spoiler alert, fine-tunes won't be better until the data quality is better than meta's instruction fine-tune. Give it some weeks.
  Why does [doplin-l3-8B] perform substantially worse in some tests?
  Essentially, it's trained like this:
  LLama-3-8B-base_model --> LLama-3-8B-Instruct LLama-3-8B-base_model --> dolphin-2.9-llama3-8B
  And not like this:
  LLama-3-8B-Instruct --> dolphin-2.9-llama3-8B
  https://huggingface.co/cognitivecomputations/dolphin-2.9-lla...
- jackblemming 11 days ago
  
  Most of the entire field of machine learning is “try shit and see what works”. So it seems like they’re par for the course.
  
  v3ss0n 11 days ago
  
  Same as software engineering field too.
  
  littlestymaar 11 days ago
  
  It's even worse for AI given that nobody really understands why anything works.
  
  sinuhe69 11 days ago
  
  I wonder what we don’t understand from the SE POV?
- ijk 11 days ago
  
  One additional problem with people who write breathless tutorials about doing things with AI is that they are more likely than average to have been written with ChatGPT. Which, given the knowledge cutoff for most models, is not where I'd personally turn for data on recent technical developments, but is par for the course for the kind of low-effort copy-paste bloggers doing it for attention.
  This particular one seems to be from someone who is documenting their learning process, which is a valuable contribution but, obviously, not a source of great authority on the how's and why's.
sa-code 11 days ago

Thank you for saying this! The number of people that would need to fine tune vs just using RAG is really small. People that are not familiar with the source often jump to fine tuning as an option
- Foobar8568 11 days ago
  
  I am still unsure where to stand on this fine tuning vs rag. I feel that for live data, rag would be preferable but for daily/weekly updated one, then fine tuning.
  Another aspect where I am unsure is multi user for a model e.g. can we have concurrency for a model or the queries have to be queued.
  
  bigfudge 11 days ago
  
  Fine tuning doesn’t ’add content’ the way RAG does though. They’re not really comparable in that way.
  
  Foobar8568 11 days ago
  
  So more to be optimized for specific tasks in a domain ?
blackoil 11 days ago

Dataset may not be public. All large companies have millions of internal documents. Internal LLM can be trained on them.
- bradfox2 11 days ago
  
  Qlora won't work well to add knowledge via private data.
  Parameter efficient methods are not useful for these cases at the 8b scale without a more complex training procedure that periodically merges back adapters. Maybe at the 70B scale.
  
  tpurves 11 days ago
  
  What scale of company do you need to be to actually be able afford and get return on investment on retraining base models with your own proprietary knowledge and docs? Considering also the implications of continually retraining?
  
  sdesol 11 days ago
  
  I was under the impression that you wouldn't. If you want access to proprietary knowledge, you would use RAG + LLM.
  
  bradfox2 11 days ago
  
  The only experience I have is first hand, what my company is doing for our client base. We are doing continuous pretraining and the rest of the alignment stack training on about 10B private tokens + private customer data to produce private custom models for companies in the 500 to 3000 employee range. We built and operate a single rack cluster that cost mid 6 figures in order to be able to do this.
  These models get combined with rag for highly specific technical doc authoring and other uses.
  
  tpurves 11 days ago
  
  This is very helpful context on what works right now, thanks for sharing.
  
  littlestymaar 11 days ago
  
  I don't think anyone has the answer to this question yet.
  
  anonymousDan 11 days ago
  
  Can you point to any literature on this by any chance? I would be really interested to see some in depth analysis.
  
  bradfox2 a day ago
  
  I don't have the arxiv link bookmarked, but there was a paper written on pretraining with Lora. It involved merging adapters back every n steps with good results.
imjonse 11 days ago

The "why for" is usually learning/gaining experience/FOMO.
- TOMDM 11 days ago
  
  For the human or the LLM?
  
  sumandas0 11 days ago
  
  [dead]
- sumandas0 11 days ago
  
  [dead]
sumandas0 11 days ago

[dead]

iAkashPaul 11 days ago

With unsloth's optimizations you can do llama-3-8b's QLoRA fine-tuning on your 8GB card(mine's a 2070S) with 900MB to spare with BS of 4.

em1sar 11 days ago

[dead]

SunlitCat 11 days ago

Since the crypto (currency) craze of 2017, every time I hear "consumer GPU" somewhere in a story that has nothing to do with gaming, it sends a chill down my spine.

j0hnyl 9 days ago

RIP your spine for the foreseeable future.