bbarnett 16 days ago

I have no such examples for you, as I have no idea of your true intent, but I do have raw info.

You are creating a weird one-or-other scenario here.

For example, many people lease baremetal hardware, and use that. And that isn't "cloud". Nor is it on prem. Others still buy hardware, but still rack it at a colo... buying upstream egress and so forth.

Then there are things such as CDNs, so even if you host the primary site on prem, you might use a CDN for larger images or videos, which predates any use of the word "cloud".

My point is, it isn't as binary as one or the other.

  • mateo-maza 16 days ago

    The idea for this comparison came when I read in article that 40% of internet downtime is due to hardware issues (not sure if this is true). The experiment aims to compare the uptime between a website using a reliable cloud provider and one managed in-house due to compliance or other reasons. I understand your point about the variety of configurations that exist beyond simple cloud vs. on-premises. However, my focus is specifically on comparing traditional on-premises data center management against using established cloud services. Here's the repo of the project if that helps you believe in what I say https://github.com/mateomaza/cloudXground. I don't even have the knowledge to do something wrong with the info tbf, I mean just look at my GitHub.

    • LinuxBender 16 days ago

      One more factor to consider in your project is whether or not the cloud or non-cloud providers are doing live kernel patching. Most cloud providers are incentivized to do live kernel patching to avoid interrupting their customers, thus not rebooting the hardware. Full reboots at the hardware layer is where most borderline hardware issues are detected during power-on-self-test. Some on-prem companies are doing live patching now to avoid having to swap hardware as often.

      • mateo-maza 15 days ago

        Wow this is extremely good info, I'm going to investigate on this for the comparison. Thanks for the tip, your comment was just too helpful!

    • solardev 13 days ago

      FWIW, it seems to me that a test like this would more likely compare network architecture & topology (and plain dumb luck) rather than hardware failure rates. Since you're not controlling for other variables, and your sample size is likely to be tiny, it's probably just looking at the results of failover handling (or lack thereof) and regional network issues.

      A cloud, after all, is just a set of someone else's data centers, abstracted and managed for large-scale resale. In the cloud you can have load balancers and reverse proxies and CDNs and database clones and robust DNS setups all handling failovers across multiple data centers. You can do the same thing across several smaller data centers that you rent or colo in. You can also do the same thing across several bare-metal Raspberry Pis you install at friends' houses around the world. But none of those will really be a good test of hardware failure rates, because they're subject to so many other things that can go wrong: buggy code, a mistake in config, different kernel versions or distro patches, firewall settings, small ISP outages, cut wires... you won't really be able to tell what is hardware-related vs software vs environmental vs user error.

      It's not even really a good test of overall reliability because it's not an apples-to-apples comparison between host setups. It's entirely possible for a few homebrewed PCs set up by an experienced admin to be faster and more reliable than an unpatched EC2-micro setup in a single data center/availability zone. It's also possible for free cloud host to be faster and more reliable than some enterprise's botched $50k rollout that they didn't architect well.

      High uptime isn't (just) the result of reliable hardware, but good failover and route management. Hardware will fail no matter what, but good hosts -- whether it's a 100-person team at a big cloud or a single wizened beardadmin -- anticipate that, plan for it, code for it, replicate your data, and regularly test the whole stack (and charge accordingly, or at least use economies of scale to benefit smaller customers with the lessons learned from bigger ones).

      In a simple network test like yours, the geographic proximity (in network hops) between the source and destination are likely to have a far bigger impact (by reducing network variability, caching and firewall/anti-bot issues) than any actual hardware failures on either side. It's pretty rare for even cheap off-the-shelf hardware to fail out of the blue, and you're unlikely to catch it happening in real time unless the whole website goes down and the sysadmin just doesn't care enough to fix it, or it regularly breaks by design (like Low-Tech Magazine's solar server). On the other hand, a well-configured, redundant host probably has hardware failures all day long, every day, but you won't even notice because your traffic just seamlessly routes around it.

      What you're doing is like visiting a corner store and a big grocery store every day, looking for the same can of soda. Most days both will have it in stock, but some days the big grocery store might be out. But it doesn't really tell you much about either store's behind-the-scenes logistics. Maybe the big grocer has a vertically integrated international freight network that brings in 10,000 cans every week, but they're often out because of their 50,000 customers. Meanwhile the corner store guy just buys a few cases from Costco once a year, but you happen to be their only customer out of 50 or so who likes that soda, so they never have to restock. It doesn't really tell you anything except that the testing methodology isn't sufficient for measuring that particular failure mode.

bschmidt1 15 days ago

https://www.exactchange.network is on my Raspberry Pi in my San Francisco apartment.

The front-ends of the sites therein (e.g. https://www.playshadowvane.com/) are all on Vercel, with the backend/APIs running on exactchange on the on-prem Pi.

  • mateo-maza 15 days ago

    Awesome! Thanks, really appreciate it :)

    • bschmidt1 15 days ago

      Sure, curious to see your findings!

      • mateo-maza 15 days ago

        :D I'll make a post and hit you up when I have something. As someone adviced me here, it would be important for the comparison to consider if the on-premises example does live kernel patching. Could you share me this info please?

        • bschmidt1 15 days ago

          Who said that? I would ignore them. Live patching just means that your OS updates while it's still running. To answer the question: Raspbian doesn't do that out-of-the-box, you have to restart it after an update (I haven't updated or even restarted the Pi in several months).

          Ubuntu and a lot of Linux has the option if you use Ubuntu Pro, but I update my OS so rarely it's a non-issue. I wouldn't even consider that when comparing on-prem vs cloud.

          My hunch (and as someone who uses both daily) is that on-prem is 2-10x faster than affordable cloud options within a region, but in some cases like international traffic or high-end multi-core computing etc. cloud should pull away from on-prem - but curious to see what you find :D

          • mateo-maza 14 days ago

            I see, someone commented it on this post. His argument is that most hardware issues are detected in power-on-self-test, and live kernel patching stops this. To be completely honest, I'm too unexperienced to know if it's pointless to consider this or not. Your point does make sense to me, if you only update the OS after several months, then it can't be a crucial thing for the comparison.

            Really interesting that you have found on-prem 2x10x faster within a region, thanks for that info. I'll let you know how things go with the experiment, cheers :D