Show HN: Data Bonsai: a Python package to clean your data with LLMs

github.com

47 points by alvin_r_h 17 days ago

I've been doing some data cleaning for my fine tuning projects using LLMs, and decided to just build a package for it as a side project. Check it out here: https://github.com/databonsai/databonsai

Some features:

- categorization (labelling), transformation and decomposition (text into structured format) - validates llm outputs

- batch mode batches up the inputs/outputs so you don't send the prompt (schema, fewshot examples) for every row of data, saving a significant amount of tokens

There are some similarities to the Instructor repo, but this is simpler and made for datasets. Would love any feedback/suggestions (and a star if you like it!)

msp26 17 days ago

Oh I'm interested to see how your batch prompt works. I've used the idea for a while and feel that it's very underrated.

ShamelessC 17 days ago

Looks handy. How reliable would you say it is?

  • trehans 17 days ago

    Interested in knowing this as well

    • alvin_r_h 13 days ago

      GPT-4 and claude models work great, but these cost some money. Some users were very interested in running these on Ollama, but it didn't work very well for any batch methods.

2024throwaway 17 days ago

ExtractTransformer looks like it has a lot of potential. Going to try this out tomorrow at $DAYJOB.

flyingwheels01 15 days ago

Excellent, thanks for sharing! will definitely give it a shot!