Ask HN: Any data scraping project ideas you can share?

32 points by dchuk 6 years ago

I've done a good amount of scraping over the years, but haven't done much recently. Getting an itch to do some side projects in this area as well, so interested if anyone has a need for data that they can't currently get, or can't get in a clean structured way.

One example I've thought of recently (because of my own 9-5 job needs) is to scrape all the heavy duty truck company's sites and expose make model data (and images) via an API paired with a VIN decoder. Each OEM obfuscates their vehicle data in one way or another (JS widgets, only in PDFs, etc) and as far as I can tell, there aren't any API-based data sources for heavy duty/commercial vehicles.

Any other ideas?

staticautomatic 6 years ago

Aggregate data on weather and soil composition to identify areas of land around the world most similar to famous wine producing regions.

richardknop 6 years ago

Well the obvious scraping business ideas would be hotel rooms and airplane tickets. You might think this area is already saturated but I think there’s still room for disruption and you can capture niche giants like Expedia / Skyscanner / Agoda / Booking don’t handle well. Or you could do b2b with these companies.

Also, what about scraping restaurant menus and offering a food search engine?

  • BjoernKW 6 years ago

    There's probably a reason those sites don't handle niches very well: There's no scalable business model for those.

    Travel / accommodation is a highly competitive as well as intentionally convoluted market. The participants don't want their customers to easily find the best offer. Sites like Expedia or Booking aren't complex and difficult to use because those companies don't know about UX. It's precisely the opposite. Problem is, the goals of their UX most of the time don't align with the users'.

  • dchuk 6 years ago

    Can you share any examples/ideas for niches that those sites don't already handle well?

    • anywherenotes 6 years ago

      I just booked a vacation on Expedia. What I really wanted was to find a list of rooms by price (airfare included) which can comfortably sleep 4 people - so I needed 4 actual beds. I tried looking at bigger rooms, but it looked like booking 2 cheap rooms was less expensive, but I'm still not sure if that's true. So basically you could see if you can accommodate large parties. Also, when I booked the rooms, I think it made me select one type, but in theory, i might have wanted one ocean front room and one without the view.

dchuk 6 years ago

Here's another idea I just thought of randomly:

Scrape and monitor a company's competitor's job listings for them. Some of that data might be difficult to get given the nature of job sites and craigslist and such, but could be interesting to accumulate all of that (including from the company's own site) so you can get an idea when they are hiring.

Maybe.

  • gyvastis 6 years ago

    Sorry if I've missed the main idea behind this, but why would that be relevant?

    • BjoernKW 6 years ago

      Looking at companies' hiring data is a great way to monitor competition.

      If they try to hire for more positions than in the past they're probably growing, conversely they might be stagnating if it's the other way round.

      If they hire people with specific skills it might also tell you what they're up to right now like going public or working on a new, supposedly secret project. Take Apple for instance. A notoriously secretive company, previous new projects like the iPhone, the Apple Watch and most notably a self-driving car have first been revealed by their own job postings.

      • peternicky 6 years ago

        This is in my opinion the same as buying at the height of a bubble; by the time you get this data on your competitor, you will be way behind. Why not spend resources on improving the offering?

        • BjoernKW 6 years ago

          True. Still, keeping tabs on the competition is big business.

    • ambivalents 6 years ago

      If you're unhappy with your current workplace and want to do similar work in the same industry, competitors are a great place to start.

remyp 6 years ago

Am I the only one that worries about licensing and legal issues when it comes to web scraping? I'd be terrified to build a product around it since one law suit would threaten the core business.

  • zapperdapper 6 years ago

    Me too. Just look at the legal battle between LinkedIn and hiQ.

    I have scraped, but not on the level where it would draw attention. Use the APIs if you can - Trip Advisor do offer an API, as do Bing (for search results).

    Unless you are scraping at huge scale I actually think the bigger problem is lack of semantics - people like diffbot[1] using AI/ML to try and solve this issue.

    [1] - https://www.diffbot.com

  • gyvastis 6 years ago

    Everything that you can open in the browser you can scrape without any problem. Though keep in mind the number of requests you send to those parties should be thought about as it shouldn't vary greatly compared to a regular user. A user doesn't open 1000 pages in 60 seconds.

    • anywherenotes 6 years ago

      don't most sites claim they own data? Like could you legally scrape reddit and make your own site?

zapperdapper 6 years ago

Some of the scraping projects I've done in the past have been where the article content I wanted was on a large site with great content but the site was awful to read - due to horrible combinations of pop-ups, colour schemes, adverts etc etc. I would spider and download content, process, and build my own database/simple CMS to make reading the content offline a much better experience.

Are you just looking for a personal project something like that might work for you...

joshribakoff 6 years ago

Its a huge undertaking. You'd be competing with ACES/PIES, those guys charge about $10k a year last I checked. So there's definitely room to undercut them, if you can somehow get all that data.

  • dchuk 6 years ago

    Thanks for the reply. I was thinking of just using the scraped data to create an API that can take a VIN and give you the specs of that heavy duty vehicle (engine, weight, body type options, etc). I couldn't even imagine how crazy it would be to try and collect all of the parts data for every vehicle.

howscrewedami 6 years ago

Scrape product information from ebay and other auction sites. Have a machine learning model that compares auction price vs. real price (or usual auction price). If the auction price is good... buy the products and flip them. In other words, you're basically building a system to help you find the best possible products to flip.

dhruvkar 6 years ago

Searching for flights using rewards miles.

I think United & American used to have APIs that were shut down, so you'd need to scrape account data and flights. It would work best as a desktop app. Other airlines have APIs, but not sure how deep they are.

Huge pain point, especially when trying to combine different rewards programs.

  • ezekg 6 years ago

    I had something similar awhile back but it was eventually shut down by the airline’s legal department. If it’s not provided via a public API, I doubt scraping will turn out any different than my project.

    • dhruvkar 6 years ago

      If it's a desktop app and the crawling is not centralized, I doubt they'd be able to do much about it.

      Lot of crawl-heavy SEO tools work this way.

      • ezekg 6 years ago

        Idk, I had a free open source command line project that scraped flight data and that got shut down. It might have been the particular airline, though, because they specifically disallow scraping from all third-parties including eg Google Flights, SkyScanner, etc.

        • dhruvkar 6 years ago

          Interesting. As a consumer, I'd love to see something in that space. I hate paying for most things, but a desktop app that allows me to search for rewards miles would be something I would pay a yearly fee for.

speps 6 years ago

Scrape websites like TripAdvisor, Amazon, etc. for the ratings and compute an actual rating not based on averages. I've seen a few articles on how ratings are shown on those websites recently and they never seem to actually reflect the truth.

gandutraveler 6 years ago

I have been trying to scrape travel places and recommendations data from TripAdvisor and other travel sites. Read about scrapy in another hn post yesterday and have been trying to get it running. Would appreciate any help on this.

sam.xenai [at] gmail [dot] com

  • gyvastis 6 years ago

    Did you check out GitHub for open-source solutions? I'm sure the biggest names in the market are already covered when it comes to scrapers.

skate22 6 years ago

Streaming sources for music from popular artists that are not available on spotify (like mixtapes)

uptownfunk 6 years ago

Good quality, accurate, high resolution stock / options / futures data

rosha 6 years ago

I recently had a similar idea which I ended up putting into real world, so I built a small search engine "UkookU" for used vehicles from scratch, it did not take me a long time to get the data and maintain it as I did not need to build any scrapers or care about sites blocking/abuse as I used a third party scrapping API called ProxyCrawl https://proxycrawl.com which allows me to make a proof of concept of the idea for free, so that shortened the time I needed to build the hadoop engine, etc.

I am thinking now recently to build a recruiting talent pool service, which is based on aggregated data from LinkedIn, Google, Bing, Yahoo, Facebook and many other sites and I am pitching something around it as I can get all the data with ProxyCrawl.

I am also thinking recently to do something about keyword ontologies, small markets around me using google/yandex data and offer it as a free helpful tool in the form of a mobile app. How do you normally get your data for your projects?

jonesgrant 6 years ago

PLEASE READ!!!! Hello Guys,This is a Life Time transformation !!!Am so happy I got mine from Kelvin Roland. My blank ATM card can withdraw $3,300 daily. I got it from him last week and now I have $16,500 for free for just a week. The blank ATM CARD withdraws money from any ATM machines and there is no name on it, it is not traceable and now i have money for business and enough money for me and my family to live on. I am really happy i met Kelvin Roland because i met two people before him and they took my money not knowing that they were scams. But am happy now. Kelvin Roland sent the card through DHL and i got it in two days. Get your own card from him now he is not like other scammer pretending to have the ATM card, he is giving it out for free to help people even if it is illegal but it helps a lot and no one ever gets caught. i'm grateful to Kelvin Roland because he changed my story all of a sudden. The card works in all countries except Philippines and Mali. Kelvin Roland email address is kelvinroland.blankatm@outlook.com!

JaneGrey2508 6 years ago

do custom electronics, robotics, and embedded software development - I specialize in quickly turning ideas into prototypes. I've built custom automation equipment for chemistry labs, sensors that are in use in household/utility applications, control circuitry for construction equipment, 3d printing electronics, data acquisition equipment. No project too small. Few projects too large. Deep discounts for open source hardware work. You can visit this https://goo.gl/29VL2t