Show HN: Checkbot for Chrome – web crawler that tests for web best practices

83 points by seanwilson 6 years ago

gerdesj 6 years ago

Very cool. I'm giving it a quick go whilst doing yet another patchathon on customer systems.

First impressions are that it is very quick and gives some great advice. I'm not really a web. dev. but I even I can see how this can make a good audit tool. Looks great as well.

I suggest caution against using the term "best practice" though. It's one of my pet hates - there is good practice and there is bad practice but its a brave person who claims to know best practice. I think my hatred of that term stems from seeing it plastered all over older MS docs and the usual crop of "me too" copy n paste blog postings that litter the web, not to mention various forum postings. We're all bloody experts who know best in this game 8)

seanwilson 6 years ago

> First impressions are that it is very quick and gives some great advice. I'm not really a web. dev. but I even I can see how this can make a good audit tool. Looks great as well.
Awesome, thanks for trying it!
> I suggest caution against using the term "best practice" though. It's one of my pet hates - there is good practice and there is bad practice but its a brave person who claims to know best practice.
Yes, I'll admit I don't love the phrase myself. I don't like when someone is pushing what's really just their opinion as the one and only way to do things but the phrase does get the point across when you've got a character limit.
I've been really careful with the rules in Checkbot so far and the guide (https://www.checkbot.io/guide/) has links to where web experts like Google, Mozilla, OWASP and W3C recommend each practice. For example, making sure every page has a title, HTTPS is enabled and compression is used is difficult to argue against. Let me know if you think any of the rules need some changes however.
keithnz 6 years ago

I agree, it runs nicely, and is nicely presented. The term best practice does irk me. But weirdly enough I think the use of that term is why I decided to give it a go.
- acct1771 6 years ago
  
  Love-hate relationship with marketing is probably pretty common in this crowd!

ratata 6 years ago

Nice. very similar to Lighthouse https://developers.google.com/web/tools/lighthouse/#devtools

wgjordan 6 years ago

Similar feature-set at first glance, but not open-source, not free after beta ends, and created/maintained by an unknown solo developer. Nice site design, but I think the cards are pretty heavily stacked against this one.
- seanwilson 6 years ago
  
  The big difference is Checkbot crawls whole websites as opposed to checking a single page at a time. The Checkbot interface is designed around helping you hunt down issues that impact groups of pages and pages you didn't think to check. For example, this lets you find duplicate title/description/content issues and root out pages with broken links and invalid HTML you don't look at often.
  
  wgjordan 6 years ago
  
  That does sound like a key differentiating feature, thanks for clarifying. While I'd probably prefer to hook up an open-source web-crawler to lighthouse (e.g., something like github/lightcrawler [1]), I could see SEO/marketing experts in particular paying for a user-friendly all-in-one solution like this versus cobbling something together from open-source tools.
  [1] https://github.com/github/lightcrawler
  
  seanwilson 6 years ago
  
  Yes, so I think the UI is a really important factor here in terms of productivity and ease of use. For example, after Checkbot has scanned your localhost/development site and identified a page has the same title as other pages, you can edit your site, hit the "recrawl" button for that page and confirm your fix worked in a few seconds. Users I've worked with so far have really appreciated this fast and simple workflow.
kaycebasques 6 years ago

Which is also available in Chrome DevTools via the Audits panel.
- PudgePacket 6 years ago
  
  I've never been able to use that, every time I run the test it says initial page load is 10s, but I'm running it against a static website hosted locally.. how is it taking so long? Even if I visited my site from the other side of the world over 3g it would be faster than lighthouse accessing it locally.
  
  pimlottc 6 years ago
  
  Part of the audit includes throttling your connection to simulate a device on a slower link.
  https://github.com/GoogleChrome/lighthouse/blob/master/docs/...
  
  PudgePacket 6 years ago
  
  Still couldn't account for 10s for 5kb of html, something must have been really wrong, maybe I configured it incorrectly.

queezey 6 years ago

Very cool.

I noticed that it's mangling some of my URLs, though.

`/!0ead1aEq` is getting turned into `/%210ead1aEq` (the exclamation point is getting percent-encoded), which leads to a bunch of spurious 404 errors.

https://tools.ietf.org/html/rfc3986#section-3.3

seanwilson 6 years ago

Ah, thanks a good bug report! I'll get this fixed.
seanwilson 6 years ago

Hmm, so when crawling, URLs are normalised and the URL library I'm using is normalising the "!" to "%21". Could you send me a working URL to test on? My email is sw@seanw.org if you want to use that.

seanwilson 6 years ago

Hi, I didn't get much feedback when I posted last time so I'm giving it another try. This is aimed at helping web developers follow SEO, performance and security best practices so I'd love to know what the community thinks. Can you think of any changes that would make Checkbot more helpful? Did you notice any bugs? Thanks!

idle_processor 6 years ago

- A regex blacklist for URL structure would be very helpful.
- Acceptable title length seems a bit short. 70 characters seems closer to what we're allowed today than the old 60 (per http://www.bigleap.com/blog/5-tips-take-advantage-googles-ne...).
Might also be worth segmenting URLs with query parameters in them into a low priority batch to check later (or skip).
(E.g., when spidering a WordPress site, the crawler wastes time on .../article/?replytocom=* URLs. URL filtering solves this, somewhat, but it might require multiple passes to identify all of the problematic query strings.)
cdawg0 6 years ago

Cool, I'm checking it out right now. Can you share what your tool provides that others don't?
- seanwilson 6 years ago
  
  The major one is that instead of manually checking pages one at a time, Checkbot lets you easily test 1,000s of pages in a few minutes to root out issues you'd normally miss. As you're doing web crawls from your own machine, you can also crawl any site you want as often as you want including localhost/development, staging and production sites. This lets you identify issues early and confirm fixes during development before problems go live.
  
  idle_processor 6 years ago
  
  > The major one is that instead of manually checking pages one at a time, Checkbot lets you easily test 1,000s of pages in a few minutes to root out issues you'd normally miss.
  ScreamingFrog does this too, but it's still nice that your work is out there to add to the suite.
  
  cdawg0 6 years ago
  
  Interesting. So the intent is to tackle problems before deployment. Do you plan on any devtools integrations so it can be used as part of an automatic CI/CD process? Also, does Checkbot dig into all dependancies or skip them like some others?
  
  seanwilson 6 years ago
  
  > Interesting. So the intent is to tackle problems before deployment. Do you plan on any devtools integrations so it can be used as part of an automatic CI/CD process?
  A lot of the time, websites owners don't know there's a problem until their search results or Google Search Console updates. So I'm seeing it being used by developers to check localhost/development sites, then on staging for other problems, then on production when changes are made there. A command-line version to support CI/CD is something I'm really interested in as well.
  > Also, does Checkbot dig into all dependancies or skip them like some others?
  Can you expand on what you mean here?
  
  ajeet_dhaliwal 6 years ago
  
  Who’s the customer you have in mind who will want to do this for thousands of pages?
  
  seanwilson 6 years ago
  
  People involved in web development, SEO and marketing. It doesn't need to be in the thousands but checking large number of pages for problems has obvious productivity benefits over checking one page at a time. It also lets you identify issues that impact multiple pages like duplicate titles, descriptions and content that you'd normally miss.

mkorsak 6 years ago

Seems really cool, I like this. One issue I'm having so far though is that after crawling my site it found one 404, and the link is (my domain)/page-not-found-test. It also says there's 0 inlinks to it, so I have no idea where it's getting the idea for this page from. It doesn't exist, but I've never linked to anything like it?

mahesh_rm 6 years ago

I rapidly tested it. It's staying on my chrome. Good Job.

seanwilson 6 years ago

Thanks, let me know how you get on!

santoshmaharshi 6 years ago

Very nicely thought through and implemented i agree with you on the best practice point. Already forwarded to few folks and going to stay with my browser.

seanwilson 6 years ago

Thanks! Let me know if you can think of any improvements I can make.

Raphmedia 6 years ago

Works pretty good. Incidentally, I just locked myself from my own server. This double as a security tester! :o)

seanwilson 6 years ago

On the left sidebar at the start you can modify the number of URLs crawled per second if that helps!

chasers 6 years ago

It's nice and fast! Are you actually using chrome to render each page or making requests some other way?

seanwilson 6 years ago

Thanks! All requests are done from your own browser if that's what you mean.
- chasers 6 years ago
  
  No, I mean like render the whole page with JS and all, but from your FAQ it seems like you're not. Which is why it's fast, ha.

rambojazz 6 years ago

Is there a version that works on Firefox?

wpasc 6 years ago

Awesome

scrollaway 6 years ago

Really cool stuff. Here's some initial feedback:

"Avoid internal link redirects" -> All the errors I'm getting on my site are due to the login-wall on some of the pages because it's detecting /account/login/?next=... links as internal redirects.

"Use unique titles" / "Set page descriptions" / "Avoid URL parameters" / "Avoid thin content pages" / etc -> Same problem as above with login walls ("Sign in to ..."). I get why, but it's adding a ton of noise. I added /account/login to URLs to ignore but it didn't achieve anything, I'm guessing I must have misunderstood the syntax or it has to be an exact match or some such?

"Avoid inline JavaScript" -> I'm making quite a bit of use of the <script type="application/json" id="foo"/> pattern, which allows me to declare json objects in the body I can later parse in my scripts. This pattern doesn't have all the issues carried with inline js. Can you ignore script tags where the type= is unknown or application/json\?

hsts preload: This is picking up individual pages on the checked site as errors, even though hsts preload really is a domain-wide thing.

"Hide server version data" -> This is picking up "server: cloudflare" as an error. Means no site behind cloudflare will ever pass this which seems overkill.

"Use lowercase URLs" -> So, on my site you can access objects with IDs like youtube's (/id/vZEz7JoNnfgVo...). It's picking up all those as errors. Feels wrong?

UI: Not a fan of the "x inlinks / y outlinks / headers / recrawl / html / copy" links below the URLs on the results page. Low contrast, unclear what I'm clicking and where it's gonna take me. The "copy" button: What am I copying? (Clearly the URL as I tried it, but that'd be more useful as a clipboard button next to the URL for example)

Finally, I ran it on my company's blog and it ended up crawling a ton of the company's various exosites on different domains which wasn't super useful, especially since none of it showed up in the final results.

Hey, this is a really great tool. Fast, slick UI and very clear what it does. I'll keep an eye on it and would love to see what else it can do in the future.

Pricing: It's hard to see myself pay for this; not because it's not worth it (I think it is easily worth a dozen USD / site checked), but because it's so easy to look at what it does and think "Yeah, but, I can probably do all that myself, and if I don't, it's not so important that I need to pay for a tool to tell me what to fix". I think this is the curse of developing products targeted at developers: Devs will tend to think "I can do this myself // I don't need this". In fact, if you hook me up with a free account, I'll use it a bunch ;)

Shoot me an email (see profile) if you want to talk through some more feedback (especially UX feedback). You just provided me with a pretty cool service for free so I feel I have to give back :)

seanwilson 6 years ago

Awesome, thanks for the detailed feedback! As you can probably imagine, tweaking the rules to work with every imaginable website configuration is an ongoing process so this is super helpful.
> All the errors I'm getting on my site are due to the login-wall on some of the pages because it's detecting /account/login/?next=... links as internal redirects.
> "Use unique titles" / "Set page descriptions" / "Avoid URL parameters" / "Avoid thin content pages" / etc
Allowing Checkbot to login could help but I'll look into how to improve this.
> "Avoid inline JavaScript" -> I'm making quite a bit of use of the <script type="application/json" id="foo"/>
Ah, thanks, this is an easy fix. I'm planning to add structured data checks in the future as well because checking you've configured these correctly on all your pages is cumbersome.
> "Hide server version data" -> This is picking up "server: cloudflare" as an error. Means no site behind cloudflare will ever pass this which seems overkill.
Yes, for what it's worth this is defined as "low priority" internally and the rule description is written to emphasise this. I could change it to only fire when there's version numbers in the headers perhaps. I agree knowing you're using Cloudflare isn't a big deal but some servers will advertise very specific OS and PHP versions for example.
> "Use lowercase URLs" -> So, on my site you can access objects with IDs like youtube's (/id/vZEz7JoNnfgVo...). It's picking up all those as errors. Feels wrong?
Yes, I'll need to think how to avoid that case. It's a good general rule when you're writing human readable URLs however so I wouldn't want to disable it completely.
> UI: Not a fan of the "x inlinks / y outlinks / headers / recrawl / html / copy" links below the URLs on the results page. Low contrast, unclear what I'm clicking and where it's gonna take me. The "copy" button: What am I copying? (Clearly the URL as I tried it, but that'd be more useful as a clipboard button next to the URL for example)
Hmm, any more suggestions on what to change here? I made these links prominent because they were common user actions and added tooltips to them to help describe what they do. I agree it's not completely obvious what they do at first but there's only so much space available. I experimented with only showing these when you hover over a table cell. I do want to add more shortcuts in the future such as a quick way to look up a URL on Google or on archive.org so I'll likely have a "more" button for extra options later.
> Finally, I ran it on my company's blog and it ended up crawling a ton of the company's various exosites on different domains which wasn't super useful, especially since none of it showed up in the final results.
Can you give more details here? Checkbot will probe <a href="..."> links to check they're working for example but shouldn't spider sites that are considered external. I originally had it crawling subdomains of the start URL but changed that default because it wasn't what most people wanted.
> Shoot me an email (see profile) if you want to talk through some more feedback (especially UX feedback).
Great, let's keep in contact (see my profile as well)! Hopefully it's obvious UX is important to me too. It's been challenging to find a balance in showing the right amount of information on the screen while battling with the horizontal space constraints you get with long URLs. The "Avoid temporary redirects" report is a good example of this e.g. for each row, you want to know the redirect status code, the start URL, redirect destination and redirect path.
- scrollaway 6 years ago
  
  > Allowing Checkbot to login could help but I'll look into how to improve this.
  Wouldn't help in my scenario FWIW, my site is oauth-only login.
  > Hmm, any more suggestions on what to change here?
  I move the copy link into a clipboard button next to the url (like Github's "clipboard" button next to URLs), and make the remainder of the links prominent buttons. I would also avoid taking users to a separate page when clicking any of them; rather, open a "sub view" below the url (eg a nested list).
  If you have less-often-used actions, you could also add a "..." menu on the right side, or next to the buttons.
  > Can you give more details here
  Try it against articles.hsreplay.net and look at all the URLs it ends up checking against, you'll see what I mean. It didn't end up spidering all of hsreplay.net, but it did go through a ton of it.
  
  seanwilson 6 years ago
  
  > Wouldn't help in my scenario FWIW, my site is oauth-only login.
  Would being able to set cookies or sending custom headers help?
  > Try it against articles.hsreplay.net and look at all the URLs it ends up checking against, you'll see what I mean. It didn't end up spidering all of hsreplay.net, but it did go through a ton of it.
  Hmm, so if you check "Explore" -> "External URLs" there's a ton of external links being checked for this URL. I'm not sure what you could do here except excluding hsreplay.net URLs from being checked.
  Thanks for the other tips and examples, I'm actively working on this.
  
  scrollaway 6 years ago
  
  > Would being able to set cookies or sending custom headers help?
  I don't think so. And if it's for SEO purposes, I don't care; these pages won't get crawled by Google anyway. I'm ok ignoring the URLs, but it'd be nice if maybe for example you detect there's a bunch of redirects to a pattern that contains "~login~" and ask the user if they want to add the login URL to the blocklist. I didn't have much success adding it myself.
  
  seanwilson 6 years ago
  
  > these pages won't get crawled by Google anyway.
  Hmm, how do you indicate this to Google? I'm thinking about how you could tell Checkbot to ignore pages like this.
  > I didn't have much success adding it myself.
  The "URL patterns to ignore" setting is just a JavaScript regex string if that helps. It needs some help text at a minimum.
  A common scenario I see as well is you start a crawl, see the URLs flying by and think "oops, don't want to crawl those URLs". A cancel button would help but hopefully more can be done.
  
  scrollaway 6 years ago
  
  > Hmm, how do you indicate this to Google? I'm thinking about how you could tell Checkbot to ignore pages like this.
  I don't. I'm not sure how Google picks up on the fact that they're login walls. Maybe it's heuristics? Someone better at SEO than me could explain.
  Re URLs to ignore: Ok, I see, that wasn't clear though. My two-fold suggestion is 1. Add a way to specify plain loose matching (eg. just /account/login/, skip the url parameters, etc) and 2. let me add that pattern after the urls have already been crawled (and make sure I can see which URLs are affected).
  Keep in mind that this is pretty raw feedback and you know your product better than me, but I definitely don't think the URLs to ignore is usable right now.
  I'm off to bed, I hope all that helped. :)
  
  seanwilson 6 years ago
  
  > I don't. I'm not sure how Google picks up on the fact that they're login walls. Maybe it's heuristics? Someone better at SEO than me could explain.
  Google is likely following the redirect to the login page and seeing that the login page text isn't relevant to your search results at a guess. It's not a big deal if Google hides those pages but in Checkbot perhaps those are pages you want to examine so I'll need to think about what to do here.
  > Keep in mind that this is pretty raw feedback and you know your product better than me
  This kind of feedback is amazing so keep it coming! Knowing the first thing you thought before taking the time to fully investigate a feature is super useful because most users would be gone already if they were confused.
  > but I definitely don't think the URLs to ignore is usable right now.
  Yes, fully agree with that. I think if you're dealing with regexes, you want a way to test them as it's too easy to make a mistake.

ramon 6 years ago

Hi, I have downloaded already but didn't test it yet. I will test as soon as I can.