bringtheaction 6 years ago

> When we generate a Tails ISO image, our source code and the Debian packages we include are assembled into a binary ISO image, much like when the ingredients of the recipe are mixed together, one obtains the meal.

Tails being based on Debian and Debian having put in a lot of effort in reproducible builds I guess it's thanks to Debian (and by that I mean everyone contributing to Debian) it was feasible for them to do this.

Happy to hear about this. All distros ought to strive for reproducible builds.

chaz6 6 years ago

This is a big step forward for free software and another great reason why open source is better than closed.

madez 6 years ago

I can't fathom why compilers include so much metadata in builds and don't have a flag to disable it.

Why isn't there an option to completely leave out the build path, timestamps and anything else that isn't necessary for the program to function?

Then, besides having different compilers or versions thereof, reproducible builds would be a triviality.

Am I right in assuming that the developers of GCC have political reasons for it?

  • schoen 6 years ago

    I studied this issue a bit several years ago, though I wasn't in contact with the toolchain developers. I think the simplest answer is that reproducible builds only became an explicit priority recently in the last few years, while many of these compilers have existed for longer than that. The metadata is also useful for debugging purposes and debuggability is traditionally a high priority for compiler authors.

    People working on reproducible build projects have generally been able to modify compilers and/or create post-compilation tools to have the effect of removing the metadata. It's true that they could benefit from more help from upstream toolchains, though.

    • taeric 6 years ago

      Indeed. Provenance is a more common concern than reproducibility. Consider, even once you have it reproducible, it will be important track down where a build came from if it does differ.

      To that end, I would expect the tooling to be able to distinguish that the metadata about a build should not count towards its "byte similarity."

      • schoen 6 years ago

        Most people working on reproducible builds would probably suggest using digital signatures for provenance purposes (and never using any binaries that didn't come with a digital signature). However, present-day digital signatures don't necessarily include some of the metadata you might want (such as which person and/or machine was responsible for the compilation, not just which organization or entity).

        There are tools that can remove the metadata after the fact, but they aren't necessarily standardized enough to allow for a widely-understood comparable hash value. For example,

        https://packages.debian.org/sid/strip-nondeterminism

        • taeric 6 years ago

          I get how this can accomplish the goal of validating that "Person X signed off on binary Y." However, it complicates the question of "looking at binary Y, how can I determine where/how it was built?"

          Yes, trusting metadata is just trusting something in the binary. It is very convenient, though. (Well, where it is ever used, it is convenient.)

          I think a compromise would be to combine them. Put an envelope over the binary that can have a "content" section and a "content-signature" section. The content should be reproducible, and the signature would vouch for it. You could then add "metadata" fields and each of those could additionally be signed, without having any change in the binary that is produced.

          Honestly, I expect this is how most builds are done. I have not looked into it, though.

          • andrewflnr 6 years ago

            I don't see how that's a compromise. It just looks like an obvious solution to me. :) In short, you trust X to tell you about Y's provenance and include that in the signature. The rest is just details.

            • taeric 6 years ago

              Completely agreed. I just want all of this included as part of Y. That is, don't make me track two files, when one works. :)

              Please tell me this is how things are done.

  • jwilk 6 years ago

    I don't know why you blame GCC. It doesn't stuff any such metadata into binaries on its own, but because it was asked to do so. Also, I'm pretty sure most reproducibility issues has nothing to do with a C compiler anyway.

    • jwilk 6 years ago

      > [GCC] doesn't stuff any such metadata into binaries on its own

      Correction: when you build with debug information enabled (-g), GCC does put current working directory into the binary. This could be fixed with the -fdebug-prefix-map=… option, but when Debian folks tried it, they run into other reproduciblity problems:

      https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00513.html

      Anyway, my point is that the road to reproducible builds is not obstructed just by one big thing (GCC). There are lots of obstacles of various shapes and sizes:

      https://tests.reproducible-builds.org/debian/index_issues.ht...

    • taeric 6 years ago

      The parts of the compiler that kill reproducibility, I had thought, were the optimizations. Which can be a non-deterministic search of an optimization space nowdays.

      • madez 6 years ago

        Even if a compiler used "random" inputs for profiling, they could come from a deterministic pseudo random number generator that returns the same result on each compilation.

        • taeric 6 years ago

          Yeah, that is what I guessed down thread. I just latched onto "non-deterministic."

      • TheDong 6 years ago

        Please provide a citation. To my knowledge, that is not true.

        Debian's reproducible build project has not encountered any issue with optimizations as far as I know, so it seems like your claim would need significant evidence.

        • taeric 6 years ago

          To be fair, I did have a qualifier there. :)

          I could have sworn I saw this discussed somewhere before, but I can't find an article now. I can find that some branch prediction is non-deterministic. Though, in all of those cases, you can specify the random seed, so not likely to be an actual problem.