Launch HN: Superb AI (YC W19) – AI-Powered Training Data
I’m Hyun, one of the co-founders of Superb AI (https://www.superb-ai.com) in the YC W19 batch. We use AI to semi-automatically collect and label training data for tech companies, and help them implement machine-learning based features faster.
Almost all the magical AI features that we see are actually built using training data powered by humans. Companies build software tools and farm out to a bunch of people to click and type things repeatedly to label the raw data. That's how training data is made, and that’s a very large portion of what an AI is, up to this point. It has worked well up to now, but the process is not very fast and often prone to error. Moreover, the size of training datasets has increased exponentially over the past few years, as it almost guarantees a higher AI performance. AI engineers are now handling datasets as large as tens of millions of images, and thus there is a great need for a better way to build training data.
We started out as a team of five — Hyun (myself, CEO), Jungkwon (CTO), Jonghyuk and Moonsu (AI Engineers), and Hyundong (Operations), and after about a year, we are now a team of thirteen members. We have backgrounds in robotics, computer vision, data mining, and algorithmic programming, and we all worked together at a corporate AI research lab for around two years. While working on various projects from self-driving to StarCraft game AI, we experienced first-hand how building the training data was one of the biggest hurdles for developing new AI-based products and changing people’s lives. We initially tried to solve this problem from an academic perspective and published a research paper on using coarsely labeled training data for machine learning . Soon, we decided to work on a more practical and widely applicable solution to this problem and that’s why we started this company.
So how do we use AI to solve this problem? There are largely two approaches we take. The first is, we try to automate as many pieces of the data building pipeline as possible using AI. To do so, we split the process into many smaller steps. If we take an image-labeling task, for example, putting bounding box labels around each object can be split into 1) scanning the image to find the type and location of each object, 2) drawing bounding boxes around each, 3) refining and validating all annotations. Some of these smaller task chunks can be completely automated using AI. For others, we build AI tools that can assist humans. And for really difficult ones, we do have human workers to do it manually. It’s a tricky problem because we need to understand what kind of tasks AI can do better than humans and vice versa, and carefully choose which tasks to automate.
Secondly, we try to improve our AI components throughout a given data building project using a human-in-the-loop AI approach. The key is that we bootstrap and feed portions of the training data we make back into our AI so that we can fine-tune them on the fly over the duration of a project. For example, we may start a project with a baseline AI (“version 1”), and for every 20% of a particular training dataset we make, we iterate and keep fine-tuning our AI so that by the end of the project we will have AI “version 5" that is specifically trained for the project. So as our AI components improve, human contribution gets smaller and smaller over time, and in the end, we will need very minimal human intervention. Ultimately, we want to make it like how humans learn. We see others do something for a few times, and we quickly learn how to do it ourselves. As our technology improves, our AI will be able to learn from only a few examples of human demonstration.
We found out that using AI in these two approaches not only makes the process faster but also more accurate. One of the reasons there is an accuracy problem with existing manual labeling services is that humans have to do too much. They spend hours and hours doing the same clicking repeatedly. But by us making it almost painless for humans to figure out what to do, they not only get through more data but they are not as exhausted or cognitively loaded.
Our current customers, including LG Electronics, are from industries ranging from autonomous vehicles and consumer to physical security and manufacturing. A large majority of tech companies have a shortage of AI experts and need to develop machine-learning based features with a very limited number of them. As a result, these companies do not have enough resources to build their own automated data building pipeline and often rely on outsourced manual labor. We can deliver training data much faster and better than these vendors that extensively rely on manual labor.
We are extremely grateful to have the chance to introduce ourselves to the HN community and hear your feedback. And we're happy to answer any of your questions. Thank you!