> When we generate a Tails ISO image, our source code and the Debian packages we include are assembled into a binary ISO image, much like when the ingredients of the recipe are mixed together, one obtains the meal.
Tails being based on Debian and Debian having put in a lot of effort in reproducible builds I guess it's thanks to Debian (and by that I mean everyone contributing to Debian) it was feasible for them to do this.
Happy to hear about this. All distros ought to strive for reproducible builds.
I studied this issue a bit several years ago, though I wasn't in contact with the toolchain developers. I think the simplest answer is that reproducible builds only became an explicit priority recently in the last few years, while many of these compilers have existed for longer than that. The metadata is also useful for debugging purposes and debuggability is traditionally a high priority for compiler authors.
People working on reproducible build projects have generally been able to modify compilers and/or create post-compilation tools to have the effect of removing the metadata. It's true that they could benefit from more help from upstream toolchains, though.
Indeed. Provenance is a more common concern than reproducibility. Consider, even once you have it reproducible, it will be important track down where a build came from if it does differ.
To that end, I would expect the tooling to be able to distinguish that the metadata about a build should not count towards its "byte similarity."
Most people working on reproducible builds would probably suggest using digital signatures for provenance purposes (and never using any binaries that didn't come with a digital signature). However, present-day digital signatures don't necessarily include some of the metadata you might want (such as which person and/or machine was responsible for the compilation, not just which organization or entity).
There are tools that can remove the metadata after the fact, but they aren't necessarily standardized enough to allow for a widely-understood comparable hash value. For example,
I get how this can accomplish the goal of validating that "Person X signed off on binary Y." However, it complicates the question of "looking at binary Y, how can I determine where/how it was built?"
Yes, trusting metadata is just trusting something in the binary. It is very convenient, though. (Well, where it is ever used, it is convenient.)
I think a compromise would be to combine them. Put an envelope over the binary that can have a "content" section and a "content-signature" section. The content should be reproducible, and the signature would vouch for it. You could then add "metadata" fields and each of those could additionally be signed, without having any change in the binary that is produced.
Honestly, I expect this is how most builds are done. I have not looked into it, though.
I don't see how that's a compromise. It just looks like an obvious solution to me. :) In short, you trust X to tell you about Y's provenance and include that in the signature. The rest is just details.
I don't know why you blame GCC. It doesn't stuff any such metadata into binaries on its own, but because it was asked to do so. Also, I'm pretty sure most reproducibility issues has nothing to do with a C compiler anyway.
> [GCC] doesn't stuff any such metadata into binaries on its own
Correction: when you build with debug information enabled (-g), GCC does put current working directory into the binary. This could be fixed with the -fdebug-prefix-map=… option, but when Debian folks tried it, they run into other reproduciblity problems:
Anyway, my point is that the road to reproducible builds is not obstructed just by one big thing (GCC). There are lots of obstacles of various shapes and sizes:
The parts of the compiler that kill reproducibility, I had thought, were the optimizations. Which can be a non-deterministic search of an optimization space nowdays.
Even if a compiler used "random" inputs for profiling, they could come from a deterministic pseudo random number generator that returns the same result on each compilation.
Please provide a citation. To my knowledge, that is not true.
Debian's reproducible build project has not encountered any issue with optimizations as far as I know, so it seems like your claim would need significant evidence.
I could have sworn I saw this discussed somewhere before, but I can't find an article now. I can find that some branch prediction is non-deterministic. Though, in all of those cases, you can specify the random seed, so not likely to be an actual problem.
> When we generate a Tails ISO image, our source code and the Debian packages we include are assembled into a binary ISO image, much like when the ingredients of the recipe are mixed together, one obtains the meal.
Tails being based on Debian and Debian having put in a lot of effort in reproducible builds I guess it's thanks to Debian (and by that I mean everyone contributing to Debian) it was feasible for them to do this.
Happy to hear about this. All distros ought to strive for reproducible builds.
This is a big step forward for free software and another great reason why open source is better than closed.
I can't fathom why compilers include so much metadata in builds and don't have a flag to disable it.
Why isn't there an option to completely leave out the build path, timestamps and anything else that isn't necessary for the program to function?
Then, besides having different compilers or versions thereof, reproducible builds would be a triviality.
Am I right in assuming that the developers of GCC have political reasons for it?
I studied this issue a bit several years ago, though I wasn't in contact with the toolchain developers. I think the simplest answer is that reproducible builds only became an explicit priority recently in the last few years, while many of these compilers have existed for longer than that. The metadata is also useful for debugging purposes and debuggability is traditionally a high priority for compiler authors.
People working on reproducible build projects have generally been able to modify compilers and/or create post-compilation tools to have the effect of removing the metadata. It's true that they could benefit from more help from upstream toolchains, though.
Indeed. Provenance is a more common concern than reproducibility. Consider, even once you have it reproducible, it will be important track down where a build came from if it does differ.
To that end, I would expect the tooling to be able to distinguish that the metadata about a build should not count towards its "byte similarity."
Most people working on reproducible builds would probably suggest using digital signatures for provenance purposes (and never using any binaries that didn't come with a digital signature). However, present-day digital signatures don't necessarily include some of the metadata you might want (such as which person and/or machine was responsible for the compilation, not just which organization or entity).
There are tools that can remove the metadata after the fact, but they aren't necessarily standardized enough to allow for a widely-understood comparable hash value. For example,
https://packages.debian.org/sid/strip-nondeterminism
I get how this can accomplish the goal of validating that "Person X signed off on binary Y." However, it complicates the question of "looking at binary Y, how can I determine where/how it was built?"
Yes, trusting metadata is just trusting something in the binary. It is very convenient, though. (Well, where it is ever used, it is convenient.)
I think a compromise would be to combine them. Put an envelope over the binary that can have a "content" section and a "content-signature" section. The content should be reproducible, and the signature would vouch for it. You could then add "metadata" fields and each of those could additionally be signed, without having any change in the binary that is produced.
Honestly, I expect this is how most builds are done. I have not looked into it, though.
I don't see how that's a compromise. It just looks like an obvious solution to me. :) In short, you trust X to tell you about Y's provenance and include that in the signature. The rest is just details.
Completely agreed. I just want all of this included as part of Y. That is, don't make me track two files, when one works. :)
Please tell me this is how things are done.
I don't know why you blame GCC. It doesn't stuff any such metadata into binaries on its own, but because it was asked to do so. Also, I'm pretty sure most reproducibility issues has nothing to do with a C compiler anyway.
> [GCC] doesn't stuff any such metadata into binaries on its own
Correction: when you build with debug information enabled (-g), GCC does put current working directory into the binary. This could be fixed with the -fdebug-prefix-map=… option, but when Debian folks tried it, they run into other reproduciblity problems:
https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00513.html
Anyway, my point is that the road to reproducible builds is not obstructed just by one big thing (GCC). There are lots of obstacles of various shapes and sizes:
https://tests.reproducible-builds.org/debian/index_issues.ht...
The parts of the compiler that kill reproducibility, I had thought, were the optimizations. Which can be a non-deterministic search of an optimization space nowdays.
Even if a compiler used "random" inputs for profiling, they could come from a deterministic pseudo random number generator that returns the same result on each compilation.
Yeah, that is what I guessed down thread. I just latched onto "non-deterministic."
Please provide a citation. To my knowledge, that is not true.
Debian's reproducible build project has not encountered any issue with optimizations as far as I know, so it seems like your claim would need significant evidence.
To be fair, I did have a qualifier there. :)
I could have sworn I saw this discussed somewhere before, but I can't find an article now. I can find that some branch prediction is non-deterministic. Though, in all of those cases, you can specify the random seed, so not likely to be an actual problem.