Launch HN: Superb AI (YC W19) – AI-Powered Training Data

47 points by hyunsoo90 9 days ago

Hi HN,

I’m Hyun, one of the co-founders of Superb AI (https://www.superb-ai.com) in the YC W19 batch. We use AI to semi-automatically collect and label training data for tech companies, and help them implement machine-learning based features faster.

Almost all the magical AI features that we see are actually built using training data powered by humans. Companies build software tools and farm out to a bunch of people to click and type things repeatedly to label the raw data. That's how training data is made, and that’s a very large portion of what an AI is, up to this point. It has worked well up to now, but the process is not very fast and often prone to error. Moreover, the size of training datasets has increased exponentially over the past few years, as it almost guarantees a higher AI performance. AI engineers are now handling datasets as large as tens of millions of images, and thus there is a great need for a better way to build training data.

We started out as a team of five — Hyun (myself, CEO), Jungkwon (CTO), Jonghyuk and Moonsu (AI Engineers), and Hyundong (Operations), and after about a year, we are now a team of thirteen members. We have backgrounds in robotics, computer vision, data mining, and algorithmic programming, and we all worked together at a corporate AI research lab for around two years. While working on various projects from self-driving to StarCraft game AI, we experienced first-hand how building the training data was one of the biggest hurdles for developing new AI-based products and changing people’s lives. We initially tried to solve this problem from an academic perspective and published a research paper on using coarsely labeled training data for machine learning [0]. Soon, we decided to work on a more practical and widely applicable solution to this problem and that’s why we started this company.

So how do we use AI to solve this problem? There are largely two approaches we take. The first is, we try to automate as many pieces of the data building pipeline as possible using AI. To do so, we split the process into many smaller steps. If we take an image-labeling task, for example, putting bounding box labels around each object can be split into 1) scanning the image to find the type and location of each object, 2) drawing bounding boxes around each, 3) refining and validating all annotations. Some of these smaller task chunks can be completely automated using AI. For others, we build AI tools that can assist humans. And for really difficult ones, we do have human workers to do it manually. It’s a tricky problem because we need to understand what kind of tasks AI can do better than humans and vice versa, and carefully choose which tasks to automate.

Secondly, we try to improve our AI components throughout a given data building project using a human-in-the-loop AI approach. The key is that we bootstrap and feed portions of the training data we make back into our AI so that we can fine-tune them on the fly over the duration of a project. For example, we may start a project with a baseline AI (“version 1”), and for every 20% of a particular training dataset we make, we iterate and keep fine-tuning our AI so that by the end of the project we will have AI “version 5" that is specifically trained for the project. So as our AI components improve, human contribution gets smaller and smaller over time, and in the end, we will need very minimal human intervention. Ultimately, we want to make it like how humans learn. We see others do something for a few times, and we quickly learn how to do it ourselves. As our technology improves, our AI will be able to learn from only a few examples of human demonstration.

We found out that using AI in these two approaches not only makes the process faster but also more accurate. One of the reasons there is an accuracy problem with existing manual labeling services is that humans have to do too much. They spend hours and hours doing the same clicking repeatedly. But by us making it almost painless for humans to figure out what to do, they not only get through more data but they are not as exhausted or cognitively loaded.

Our current customers, including LG Electronics, are from industries ranging from autonomous vehicles and consumer to physical security and manufacturing. A large majority of tech companies have a shortage of AI experts and need to develop machine-learning based features with a very limited number of them. As a result, these companies do not have enough resources to build their own automated data building pipeline and often rely on outsourced manual labor. We can deliver training data much faster and better than these vendors that extensively rely on manual labor.

We are extremely grateful to have the chance to introduce ourselves to the HN community and hear your feedback. And we're happy to answer any of your questions. Thank you!

[0] http://proceedings.mlr.press/v70/kim17a.html

patentatt 9 days ago

If your AI can accurately label my training data, why wouldn’t I just use your AI for my application?

  • hyunsoo90 9 days ago

    Great question! Our AI is trained specifically for data annotation and that leads to a few differences. For example, our data annotation pipeline is not "end-to-end" and would not be suitable for a real-time deployment -- it works on multiple stages and some stages are done by human workers (i.e. editing or verifying the output of the AI), while others are automated.

  • jonghyuk0605 9 days ago

    Hi, I'm Jonghyuk, one of the co-founders. One point I would like to add is that a 90% accurate AI model may not be very useful for an application, but with the right data pipeline and well-designed system, we can extract quite a bit of boost out of it for data annotation.

    • jeromebaek 9 days ago

      Numerically, how much is "quite a bit of a boost"?

      • hyunsoo90 9 days ago

        It depends on how accurate our AI performs on a particular task, but as a back-of-the-envelope calculation, if we had a 90% accurate AI that means human annotators only have to work on the remaining 10%, giving us 10x boost. Obviously, there is some overhead not accounted for in this calculation, but with our current technology we can boost up to 10x depending on the type of data.

        • treis 9 days ago

          How do you know which are the 90% it got wrong and which is the 10% it got right?

          • hyunsoo90 9 days ago

            We have both AI-assisted and manual inspections in the pipeline. A good analogy would be an assembly line where humans and machines collaborate not only for building things but also for the quality control (ie. vision inspection system + manual inspection)

            • thoughtstheseus 9 days ago

              Do other training data providers use ML/AI to do initial screens?

              • hyunsoo90 9 days ago

                As far as I know, some do but most don't.

urs2102 9 days ago

Congratulations on your launch! I think this space is going to be super interesting and allowing companies to build products like this by simply adding an API to their pipeline is awesome. I had a question, how would you point out the differences between you and Scale API[0]?

[0]: https://scale.ai

  • hyunsoo90 9 days ago

    Thank you! Scale is doing great especially in the autonomous vehicles vertical. We are different in a few ways from Scale or other data annotation companies -- one, we really focus on building AI tools that can automate data annotation all the way from label proposals to quality control, and two, all annotations are done in-house without relying on any crowdsourced labor. We think it's difficult to guarantee quality with crowdsourced workers and we see this a lot with Amazon Mechanical Turks. We can do this because our AI tools give a huge speed boost and our in-house workers can go through the data a lot faster. And also, we also collect and provide raw data (limited to images at the moment) to companies that do not have their own.

  • alphagrep12345 9 days ago

    Interesting. I didn't know about scale. How would someone figure out that these tools are needed by so and so companies => let's provide APIs. I was of the idea that all the self driving startups have their own people solving these challenges. Why would anyone use scale?

    • pwaivers 9 days ago

      It's really a matter of productivity. Creating training sets of data is a labor intensive process. It may not be worth the cost to do so, if they can outsource that work instead.

pwaivers 9 days ago

This is great. Thank you for posting.

What humans do you have tagging the images, after the AI portion? Are they the 13 employees you have now?

Do you think that you will need to focus on a couple of verticals, so that your AI will have more of an impact?

  • hyunsoo90 9 days ago

    Thank you for the interest! We have a team of ~50 in-house annotators that use our AI tools and we are looking to expand out to Southeast Asia, namely Vietnam or the Philippines, to set up our annotator workforce there in the near future.

    As you mentioned, we see that some data labeling companies focus on a few verticals like autonomous vehicles. These verticals are very data hungry and we actually do have clients in these sectors, but I also see a huge opportunity in the less "AI-savvy" verticals such as consumer electronics, physical security, factory automation and so on. They not only have a huge need for AI, but they also lack the AI talent to build AI themselves. So, one natural possibility is that we extend our service and actually deliver the AI built on top of the training data we make. To do that, we will need to automate that piece as well using AutoML or Meta-Learning which the co-founders already have experience with (AI building AI!). It's also possible that we stay focused on just the training data piece for a few of these verticals.

mukeshyadavnitt 9 days ago

Speaking from my previous experience of running a software testing company, I am skeptical about scaling In-house professional team. Is In-house workforce cost effective ?

  • hyunsoo90 9 days ago

    Thanks for pointing this out. Although using crowdsourced labor could be the most cost-effective, we don't think it can guarantee the level of quality we get from in-house team. We believe our AI will be able to automate a lot of the pieces and make our in-house team cost-effective. I'm sure you have a lot of experience in terms of scaling workforce, please let me know if you have any advice for us!

LeicaLatte 9 days ago

One of the more interesting startups to come out of W19 in my opinion.

How big is your human-in-the-loop piece? Staff size?

  • hyunsoo90 9 days ago

    Thank you so much! The human-in-the-loop piece really starts to kick in for larger volumes of data where we can iterate the fine-tuning process several times. As an example, we were able to achieve +30% speed boost for a client after a few cycles of the loop. We are a team of 13, including 9 engineers.

howon92 8 days ago

Cool! What are the differences between SuperbAI and ScaleAPI?