Show HN: Smallest Node.js Docker images

120 points by astefanutti 5 years ago

tootie 5 years ago

The new game of Docker Golf. I once spent like a day trying to debug an issue with pruning dev dependencies from my prod docker image before I stopped to realized how much money I was wasting to save $0.0001 of cloud disk space. It is kinda fun though.

drinchev 5 years ago

That's something that I ask myself everyday. Is it worth it to push the big O notation to the limit, saving couple of megabytes RAM or I can simply deliver and have a happy boss.
I've always been fond of "premature optimisation is the root of all evil" [1], but still... wasting resources makes me feel bad. I feel like the guy that buys 10 plastic bottles of 0.5l water instead of 2x2.5 litres.
The guys building ( owning ) Slack, cryptocurrencies, e-commerce websites are on the other end though. I hardly can count a day when my MacBook's fan doesn't spin like crazy, because I'm just visiting an HTML page to read the text.
In other words. Yeah ... you can save some precious time to optimise that, because you don't want to be the person that starts the fans of colleagues' Macs ( having a docker image with your app running on my Mac ). Energy is what you are saving and that's priceless in today's polluted world.
1: http://wiki.c2.com/?PrematureOptimization
- hinkley 5 years ago
  
  Make it Work, Make it Right, Make it Fast.
  On several projects the management and sales team were at odds with the dev team about app performance. They wanted it faster and the dev team was full of premature optimization cargo cult members. By cult member, I mean people who say that phrase to get out of thinking.
  The line between right and fast is blurry. There’s a big chunk of refactorings that improve readability and performance and if you work with those you get better at your job, avoid the cult members and please the business.
  Now that I’ve typed that last paragraph I want to go through Refactoring and categorize the ones that fall under “right and fast”...
  
  jay-anderson 5 years ago
  
  Very much agreed. Like all things balance is needed. Often it's worth thinking about performance during initial design. There's the idea that fixing bugs earlier in development is cheaper. This also applies to performance issues. That's not to say you spend an inordinate amount of time focusing on just that, but it needs to be given its due.
- cyphar 5 years ago
  
  The thing is, Docker Golf is possibly the worst way to optimise the real costs of container images -- because it comes at the cost of maintenance (distributions exist for a reason).
  A much better solution is to fix the image format to better handle the usecases that it needs to handle. (This is what I'm working on at the moment.)
nine_k 5 years ago

Cloud disk space for an executable file costs nothing compared to potential losses if that executable is used for cracking into your installation.
It could be either via a hole in that executable, or just by using the executable to greatly simplify the penetration and / or privilege escalation.
weberc2 5 years ago

If there is a practical advantage, I think it's reducing the attack surface. I recently made an image that was a single statically-linked binary, some certs, and an /etc/passwd file all on top of a scratch base image. I did this because our organization's compliance team has only greenlighted Centos and scratch base images (no idea whether or not this is a particularly good policy) and I opted for scratch because it was marginally more effort for several small gains (reduced attack surface, faster deploys, faster builds, faster pulls, etc).
- tomcam 5 years ago
  
  Would love to see a more detailed writeup on this
  
  weberc2 5 years ago
  
  https://weberc2.bitbucket.io/posts/golang-docker-scratch-app...
schwuk 5 years ago

It's not (only) about the storage space; it's the upload too. That time investment will get paid back over multiple deployments.
That said, now I have "good" broadband, I pay a lot less attention to the image size.
CardenB 5 years ago

Amdahl’s law in practice
always_good 5 years ago

Reminds me of how I'll spend an afternoon avoiding a copy in Rust. It's kinda fun, but hard to remain deluded that you're doing it for serious technical purposes.
- heavenlyblue 5 years ago
  
  Then the next day you realise the next feature actually requires a copy anyway... :)

rubenhak 5 years ago

This is great! Thanks for sharing. I can recommend a small change to get most out of docker caching and reducing docker build times to the maximum extent.

In the "builder" you only need package.json and package-lock.json files to install dependencies. The rest of the sources can be copied in the "scratch-node" image. This would make caching work till the last line where only code changes are included. Code is modified much more frequently than dependencies.

You sample can look like this:

FROM node as builder

WORKDIR /app

COPY package.json package-lock.json index.js ./

RUN npm install --prod

FROM astefanutti/scratch-node

COPY --from=builder /app/node_modules /node_modules

COPY ./ ./

ENTRYPOINT ["./node", "index.js"]

martin-adams 5 years ago

That's pretty cool. What it appears to do is to build Node statically in one builder container, then exports a scratch container and copies the one binary and one user config into it. So what's built is extremely minimal.

web007 5 years ago

Multi-stage Docker builds are an underutilized pattern.
Go ahead, go crazy and add all of the dev dependencies you need to build your package. Once you've done that, take the built package and put it into another container that has only the runtime dependencies.
The ideal use-case for this is compiling Go, since you end up with a 1GB build container and a 12MB single-binary production container if you compile with static linking. Just beware when going the FROM SCRATCH route that you get nothing to go with it, you can't shell into the container or run "ps" or "lsof" for debugging because none of those exist.
- cyphar 5 years ago
  
  What's really frustrating is that this isn't a new idea at all. RPMs have had BuildRequires forever, debs have had a similar concept for a similarly long time.
  Sure, multistage builds are useful, but you could always get the same features (and more) by making packages for whatever distribution you use and installing it in your container. I get that's it's not as easy as writing a shell script in your Dockerfile to build all your dependencies, but sometimes better solutions aren't free.
  (Also this is one of the reasons why the layer model of deduplication is flawed, and why I'm working on improving it. You shouldn't have to care how large your logical image size is.)
- weberc2 5 years ago
  
  > Just beware when going the FROM SCRATCH route that you get nothing to go with it, you can't shell into the container or run "ps" or "lsof" for debugging because none of those exist.
  Your image processes run on the host anyway, so just `ps` or `lsof` from the host. I've never had to exec into a Go/scratch container.
  
  amazingman 5 years ago
  
  > Your image processes run on the host anyway, so just `ps` or `lsof` from the host.
  You can’t do this if you don’t have access to and root privileges on the host.
- snorremd 5 years ago
  
  Agreed. Makes the Dockerfile much more portable as you don't need build-tooling on the host to make a Docker image of the program. It is also really great when doing CI because you can use the Docker image layer cache to cache build/dev dependencies.

tzaman 5 years ago

Honestly, the size doesn't matter.

I was once in the camp of small Docker images, but realized it's simply not worth the tradeoff, since there's only one upside to them, and that upside is fast transfer of images.

However, that argument becomes pointless when using a proper CI/CD stack. As a developer, you don't normally upload images yourself, but push changes to GitHub, then Jenkins/Travis/whatever takes over, builds the image, and pushes it into production/staging/whatever. Since CD tool of choice is usually also on the cloud, we don't have to worry about image size, nor to any of the CD vendors charge for data transfer.

I'd rather have bigger images (I base mine off Debian now, used to be Alpine) and not have to worry with lack of ported tools and libraries, than vice-versa.

moltar 5 years ago

What if you need to release a hot fix and your image is 1Gb?
- tzaman 5 years ago
  
  1. I push a hotfix to GitHub. 2. Jenkins (which is on Google Cloud) builds it, and it already has all the Docker steps cached from previous builds, so it's fast. 3. Jenkins pushes the image to Google Cloud repo, which is almost instantaneous 4. Kubernetes (also on Google Cloud) pulls the image and makes a new deployment
  No big deal. :)
- cyphar 5 years ago
  
  And this is why we need a better image format so that people don't cripple their images to get around the misuse of tar archives.

WD-42 5 years ago

This is neat! However I don't think I've ever seen a node project who's node_modules wasn't at least 10x the size as one of these images.

vidarh 5 years ago

Size is only part of the equation. Fewer binaries you have to worry about security updates of is another.
- kpcyrd 5 years ago
  
  I see that argument a lot, but this assumes that vulnerabilities in binaries that are never executed are magically exploitable from the internet.
  It doesn't really matter if a container contains a 5 year old imagemagick binary if that binary is never used by anything. It's the equivalent of a bug in unreachable code.
  
  vidarh 5 years ago
  
  No, it assumes that there is a risk that other vulnerabilities may allow you to trigger local executables, and the less code is accessible, the more remote that possibility becomes.
  
  phamilton 5 years ago
  
  Unless the exploit makes unreachable code reachable.
  Security (and privacy) are largely about minimizing surface area.
  
  kpcyrd 5 years ago
  
  The "surface" in "attack surface" implies reachability.
  Your argument doesn't make any sense, how would "the exploit" make an unreachable vulnerability reachable without being able to execute the vulnerable code in the first place?
  Please don't say "using a different vulnerability that allows us to execute arbitrary code".
  
  phamilton 5 years ago
  
  > Please don't say "using a different vulnerability that allows us to execute arbitrary code".
  Sorry, but I'm going there anyway. Imagine two different exploits. One is a remote code execution exploit and the other is a privilege escalation exploit.
  Let's say your application has an exploit an attacker then manages to obtain a reverse shell (imagemagick, xml parsing, etc all have had multiple such exploits over the years). If you're running things correctly, that reverse shell is not privileged. It's the apache user or something. Not a good situation to be in, they can do a lot of damage, but at least some things are safe. They don't have root.
  Now the attacker finds that you have X11 installed. It's an old version that was installed by default. It happens to have a root privilege escalation exploit via fonts. Now the attacker has root.
  That's what I mean by surface area. Thinking in terms of "have we been compromised" isn't sufficient. Being able to contain the attack is important, and dead code lying around factors into how well you can contain it.
  
  vidarh 5 years ago
  
  Plenty of real security flaws have involved finding ways of obtaining the ability to execute binaries already present on a system.
  Such flaws have not even always required a direct connection - years ago someone found a flaw in common USENET software that let them execute command lines via specially crafted newsgroup posts, and effectively get a really slow (store and forward via multiple servers slow) interactive shell.
  Their ability to exploit it was directly dependent on what else was reachable from a shell. Run it in a chroot without binaries, and they could do quite little. Run it somewhere the attacker had access to tools and they suddenly had a shell behind your firewall.
  The increased risk from more binaries is not hypothetical, but something many of us have experienced the difference of first hand.
  I've personally reviewed more than one set of logs from intrusion attempts where the attackers had found a way to execute commands but were unable to do harm because they were fumbling around looking for ways to penetrate further but didn't find any of the tools they needed.

xaduha 5 years ago

Similar one here https://github.com/mhart/alpine-node, base-* versions are static.

styfle 5 years ago

That was my first thought too but scratch-node is a bit smaller than alpine-node. I created a ticket to see if alpine-node could get smaller: https://github.com/mhart/alpine-node/issues/133

ajuhasz 5 years ago

There’s also google cloud’s distroless project: https://github.com/GoogleContainerTools/distroless

We’ve used the Node.js containers in production and investigated the other languages, but never deployed them. We did have some issues with devops not being able to log into the running containers, but always found a solution that I believe in the end was a better long term pattern for ops.

mikepurvis 5 years ago

A scratch container doesn't even have busybox in it, does it? If not, this wouldn't be able to run npm, much less install anything which has bindings to other libraries.

Definitely a cute experiment, but probably of limited real-world use. I wonder what the smallest _practical_ node container would look like?

james-mcelwain 5 years ago

`npm` should be included in a final container anyway. Best practices for Docker are to have a separate build and run container. The build container can contain anything you want, but the run container should have only the bare minimum required to execute the artifact produced by the build container.
- weberc2 5 years ago
  
  s/should/shouldn't
  
  james-mcelwain 5 years ago
  
  Yup, thanks! Typing too quickly. ^^
philplckthun 5 years ago

I suppose you could apply the same builder/scrstch separation as this Dockerfile and install using npm/yarn/etc and copy over the result
It's certainly only practical when every byte matters. At that point it might also make sense to prebundle the dependencies and copy that over.
It'd be interesting to see whether this becomes relevant if serverless-like constraints suddenly apply to a Docker cloud service
- misterdata 5 years ago
  
  That's exactly what the README suggests to do under 'Usage'. First it says 'FROM node', which is the regular 'fat' image. In this image it then runs `npm install`. Further on there is another 'FROM' which starts from the minimal image and copies over all the JS code (including the node_modules directory which has been populated by npm) from the 'fat' image (this of course would fail for bindings that require libraries outside the node_modules directory, although it is relatively easy to add more COPY statements to also put these in the appropriate locations).
- orliesaurus 5 years ago
  
  wait, do you mean adding another layer for npm installing other deps?
  
  jopsen 5 years ago
  
  No, doing npm install in a different container and copying over the result. Docker has multi-stage build, which allows using a container to build and copying over in a different result container..
  I suspect this could be relevant many place, keeping images small also hardens security.
  
  orliesaurus 5 years ago
  
  Oh, what are the advantages of doing that in a different container as opposed to a different layer? Trying to understand the pros/cons.. Isn't it faster to add extra layers, than extra containers?
  
  heroic 5 years ago
  
  This is the builder pattern in docker. You do things in one "throw away" container, use results from this container and copy them into another. The final container gets only one layer now, instead of maybe 10s that your "throw away" container had.
  
  vsviridov 5 years ago
  
  Layers are add only. If you removea file, it still exists in the parent layer, propagating bloat
  
  wolfgang42 5 years ago
  
  Different layer means you have to carry around that layer in your final image. If you do a multi-stage build, you can copy over just the bits you need into the final image.

bloopernova 5 years ago

Figure I would ask here: Have any DevOps or build folks had to deal with compliance audits regarding their docker containers?

It's something certain developers I've encountered seem to ignore, even when creating something that might handle health or financial information.

Did you have to build your docker images from scratch, or did the security audit folks certify upstream images? What about updates?

choosegoose 5 years ago

We use Twistlock (https://www.twistlock.com/) as it does the CVE scanning but you can setup rules for compliance, binary monitoring and a whole plethora of other security/auditing type things. It also has a jenkins plugin so you can fail builds if a certain threshold of CVEs/compliance failures are introduced by developers (the only way to actually get the team to care about security).
Our security folks haven't really decided what to do with containers although some people are just using RHEL7 base images since its "enterprise-y". Our group personally uses alpine base images. If we have something like a java service hosted by Tomcat, we build alpine then build tomcat and then build our "service" container. While most people are fine pulling from Dockerhub, we do work in closed-loop environments and have a private docker registry where we host our "chain" of docker images which are versioned and updated regularly.
- bloopernova 5 years ago
  
  Thank you, I will check that out.
tonyhb 5 years ago

Docker actually does this for you via Cloud when you store your images. It performs static analysis of all files/libs in your container to check for vulnerabilities, as opposed to the simpler `dpkg` list check which is not accurate.
- bloopernova 5 years ago
  
  Cool, so at least some folks are thinking about this sort of thing. Thank you for the response!
FrenchTouch42 5 years ago

Assuming you have a shell within your container, InSpec can be a great tool (https://www.inspec.io/) as you can pass a "docker" target for your compliance profiles.
Works great for us :)
orf 5 years ago

Gitlab has a bunch of features around this: licensing, package scanning and docker CVE scanning.
- bloopernova 5 years ago
  
  Cool, I had no idea. Thank you!

tlrobinson 5 years ago

A 15MB Node.js Docker image, to which most people will add 200MB of dependencies from npm :)

alexnewman 5 years ago

Good this way I can't debug my docker container by jumping into it. Also I wouldn't actually want to call any OS features, so this protects me from that. Just kidding. This is a cool project, but probably isn't super productiony

vidarh 5 years ago

Pretty much everything you might need to do to debug it can either be done from the outside or can be achieved by copying binaries in temporarily,or execute the same image with a volume mounted.

moltar 5 years ago

Does it work with statically linked modules?

kowdermeister 5 years ago

What does scratch mean in this context?

granra 5 years ago

The scratch image in docker is an image containing nothing.

nurettin 5 years ago

but will it work on armv7 or later?

11235813213455 5 years ago

no npm, no shell, not practical :)

bloopernova 5 years ago

Why? Containers are supposed to be idempotent. They shouldn't need npm, a shell, or other stuff really.
(not looking to argue, if there's another reason to keep these tools around, please let me know, I'm always looking to learn more)
tlrobinson 5 years ago

See the example multi-stage Dockerfile in the README: https://github.com/astefanutti/scratch-node#usage