karakanb 5 years ago

I was trying to deploy a CI/CD pipeline for a small project and testing some changes in the test step before deploying. I wanted to collect env variables into a global location in the pipeline and put some defaults for the tests, but I have used the same variable names for them, so the global ones that were for production deployment were being used for the tests as well, and the tests were starting by wiping out the connected database.

As if this was not enough, I didn't understand what was going on because tests that used to take 5 seconds were taking 3 minutes at that point, so I ran them at least 10 times to make sure this was not a pipeline glitch. This also helped me to make sure that the database was completely wiped.

  • shoo 5 years ago

    how could we structure things differently to prevent anything remotely like this from ever happening?

    • gtsteve 5 years ago

      When we migrate databases, we execute a Docker container with a specific environment variable to let it know that it's time to migrate. The CI environment doesn't actually have a network connection to the DB; it just updates the Docker images and then when the deploy task executes it starts this Docker task remotely and polls for success/failure.

      I think Kubernetes has some nicer features that could make this easier but I haven't had a chance to try this out yet.

      In fact, the CI environment and production environments are on different AWS accounts. This way if you accidentally execute tests with the wrong connection strings (and we have done this before), they just fail because they can't connect.

      • shoo 5 years ago

        > the CI environment and production environments are on different AWS accounts

        yep, that's one way to structurally enforce isolation (unless you go out of your way to configure cross account permissions)

        there's probably a bit of a tradeoff between things being safe and isolated, and things being seamlessly automated.

        • gtsteve 5 years ago

          Indeed, I didn't find it all that hard to configure a role in the production account that has the permission to launch containers that the CI environment can assume. I locked it down as much as possible but if the CI environment gets hacked someone could launch a container in the production environment.

          I did think of a way to make this more secure but I haven't done it yet - you could write a script that waits for input from a website where you have to log in via SAML. The script then passes the AWS credentials for the role back to the deployment task. Effectively, it's a second step authentication process where you'd need to be authenticated using a security key for the deployment to proceed.

          I haven't quite figured out all the details yet but that might help you get the assurance you need.

    • chris11 5 years ago

      One safeguard you can add is to use a different database for the tests with different environmental variable names and db roles. So the unit-tests should never be able to access a non-test db.

photonios 5 years ago

We just did a large migration, transitioning our deployment pipeline and front facing web server to a new set up. I stayed at work till like 4 am. Everything went smoothly.

The next day I woke up and realized I forgot to enable password protection on the staging environment. I opened up my laptop, navigated to the dashboard, add the flag and reload the config. 5 seconds later, my phone starts beeping. Tons of alerts that the production website was down and responding with "403 access denied". Immediately tens of people jump on Slack "website down?!".

Turn out, with my sleepy head, I added the flag to the production website instead of the staging website. And thus enabled password protection on the live website. As soon as I noticed, I undid it and in the end, downtime was roughly a minute.

Suffice it to say, I got slapped for this fuck up.

  • exolymph 5 years ago

    Oh noooo! Can't believe you had to stay that late at work. Poor planning leading up to the migration, or just unexpected issues?

    • photonios 5 years ago

      No other time to pull it off. We run a large website in a different part of the world, so we're several time zones behind. We could only do this migration during low traffic, which for us meant the middle of the night.

      Sounds worse than it is. Got to work in the afternoon, got pizza and took a couple of extra free days exchange.

xtagon 5 years ago

Probably the longer I think about it the dumber and dumber things I will remember, but one that comes to mind is unintentionally configuring a test suite to use a live SMTP connection instead of a mock one. When the tests ran, thousands of e-mails to invalid e-mail addresses tried to send for real, bounced, and temporarily affected the "reputation" score on the transactional e-mail service.

Raed667 5 years ago

While inbording, i have been given shell script that is supposed to take a copy of the data in production, anonymize it, and set it up locally on my machine.

It took 2 parameters, the first one is the IP address of the production database and the second one my IP address.

I guess it was inevitable for those to get mixed up, and I guess no one did bother to prevent write access to production.

Torgo 5 years ago

I stored an entire 30GB PostgreSQL production database on an AWS ephemeral volume (gets automatically wiped on restart) because I mistook the device ID for one of a regular volume I thought attached.

Found my mistake months later, was able to fix it before disaster struck.

Adamantcheese 5 years ago

One time I tried to use Visual Studio's git integration and managed to delete the entire remote repository and all commit history. Good thing I kept religious zip backups because I was bad at git back then.

AnimalMuppet 5 years ago

Bringing up an embedded system that was its own development environment. I messed up and deleted the boot file.

The system still booted, because the boot sector pointed to the sectors that the deleted file had occupied, and the contents of those sectors had not been over-written. I ran that way for two days until I got it to a point that I could restore the boot file. I spent those two days afraid that any step would over-write the wrong sectors and the system would be dead.

sethammons 5 years ago

My very first project at my new job. First "real" programming job. Due to bad assumptions of dates in the db, I got a user stuck in a loop. We emailed them some 400k times: "Your credits are low, upgrade your account!" I felt terrible. Our only way to reach out? Email :(

  • exolymph 5 years ago

    400k emails?! Oh my god. Did you ever hear back from the user?

elken 5 years ago

Not my screw up but working for a fairly large e-commerce site we had a cron job to pick up 'stuck' orders and push them down to the warehouse every 30 minutes.

For a combination of reasons there was one order which was picked up every time the job ran.

This went on for a very long time and nobody in the company realised. The warehouse staff were all agency workers who dispatched B2B and B2C so sending a pallet wasn't unusual for them. It was only when the customer got in touch asking us to please stop sending the same thing that we were alerted.

Apparently she had ordered the items to her friend's office and they were getting very frustrated having to deal with them.

Even worse the customer was in China and the company had to order a shipping container to retrieve the stock.

ice303 5 years ago

After many hours trying to convince the IT director to do this in a off peak time, and with proper planning, he just wouldn't listen to me.

Hyper-V live migration from a HP EVA4400 to a 3PAR Storage. 45 hosts migrated without a hitch. The one that couldn't fail, a SAP production server, failed hard. The EVA crashed hard, both controllers went offline in the middle of the migration. After a couple of seconds later, one of the PSU's shut off. The other one was waiting for replacement part. My face turned white. Huge downtime to recover everything from a Tape backup. A couple of days later, I had a major burn out.

It was a really bad day at work :(

droptablemain 5 years ago

I once crashed a production server by accidentally running an infinite loop.

heelix 5 years ago

I misspelled pharmaceutical, which was part of the company name, and also appeared on the splash screen of the power builder application. Was not noticed by anyone before several cases of floppy disks were delivered to us. The App required a dozen floppies or so per set... so we had a lot of 'spares' laying around. My personal cone of shame.

AwesomeFaic 5 years ago

Formatting a USB stick from Terminal only to realize I referenced the wrong drive. Killed the process too late, the Macbook crashed and failed to boot. Apple couldn't save the drive or recover anything.

Thankfully all but the last day's work was on Git, but I was using the laptop as a temporary photo storage (only device with an SD card reader at the time) and lost 9 months of important photos.

potta_coffee 5 years ago

Transitioning from military to civilian life is difficult, especially right after a deployment. I shot myself in the foot a few times because I just didn't fit in. To be honest, I was a real asshole. For instance, shredding someone to pieces like a PFC who's just done something really stupid - it just doesn't work in the "real world".

jimrhods23 5 years ago

DELETE * FROM <table>

Yeah, I forgot the WHERE.

This was 15 years ago at my first job and I haven't done it since. I was lucky that we had backups.

  • sethammons 5 years ago

    I learned early on mutating queries: I type "WHERE" first, then go to the head of the line and type the query.

  • photonios 5 years ago

    I wonder whether tooling that is commonly used to query a database should detect these common mistakes and warn you. `psql` could easily pop up a warning "Are you sure you wanna drop this entire table?".

    An alternative would be for databases to add an option to prevent accidental deletion. When enabled, would make it impossible to truncate or delete entire tables. I would enable such a setting on my production databases.

    • Torgo 5 years ago

      JetBrains DataGrip does put up a warning like this before you can do a DELETE on a table, unadorned by a WHERE clause.

      • jimrhods23 5 years ago

        DataGrip is my #1 DB tool for my team for reasons like this.

    • dragonwriter 5 years ago

      A front end that (perhaps only had a mode that) required delete or update queries to be preceded by select queries, and then styled as “DELETE IT;” or “UPDATE IT SET ...;” which use the FROM and (if present) WHERE clause from the preceding SELECT would be interesting.

      • photonios 5 years ago

        This seems such a good idea that I doubt nobody ever tried to implement something like this.

        At least for PostgreSQL, someone tried:

        https://www.postgresql.org/message-id/12104.1150425319%40sss...

        But it doesn't seem like it got much serious attention.

        • dragonwriter 5 years ago

          That seems to be someone asking for a back-end change to prevent such queries, not a front-end change.

          I think front-end makes sense, where SQL is a UI, not back-end, where it is an API.

    • Jeremy1026 5 years ago

      We have code in place to reject any SQL query without a where clause.

      • seanwilson 5 years ago

        I'm always surprised this isn't default. Imagine "rf" deleted everything if you didn't specify a file.

turtlegrids 5 years ago

Not taking the 83(b) election...