How Not to Release Historic Source Code

77 points by zdw 12 days ago

nialv7 12 days ago

This article irritates my in several ways:

* Complaining about someone doing a good thing

  Remember, Microsoft doesn't have to release this! Microsoft could choose to never release this and they will be completely in their rights to do so. Having something release at all is better then nothing. Calling it "not good enough", and asking for more sounds very entitled to me.

* Blaming the wrong tool

  Git doesn't care about file encoding, it can handle binary files perfectly fine. The most it will do is converting the line endings if you ask it to. The binary data is likely botched during encoding conversion, not by git.

* There are much better ways to achieve what the author wants

  I can see 2 possibilities. Either the files are botched when they were imported into the repo, in which case the original files are intact so nothing is lost. They could just asked for the original files, and they will most likely  get them. Or, the files were like this even before they were imported, perhaps corrupted at some point in the past. In that case writing an article to complain really won't achieve anything.

HeatrayEnjoyer 12 days ago

Legal rights maybe.

kazinator 12 days ago

> But please please don’t mutilate historic source code by shoving it into (stupid) git. First of all, git does not preserve timestamps, which causes irreversible damage.

Git isn't stupid. A version control system must touch files to the current time whenever it changes them, otherwise timestamp-based incremental builds won't be correct.

Git records dates on a commit (which is an entire baseline of files). You can set that to whatever you want.

You can use that to record the file times, by creating commits which add the specific files, using their exact times.

codetrotter 12 days ago

Sounds like uploading a zip to the releases page of the GitHub page, and linking it from the readme would be a good way to fix this. Then you have something close to the original that OP wants, and also have the git repo that people can clone and fork and experiment with.

AshamedCaptain 12 days ago

I just find it ridiculous that a "Github" git repo would be easier to experiment with than a zip file. They have not imported any history, so git brings zero benefits. And they're not going to accept commits nor issue reports, so github doesnt bring that much more either.
But this is Microsoft, so they'll feed you Github even when it just makes things more difficult.
- derefr 12 days ago
  
  > I just find it ridiculous that a "Github" git repo would be easier to experiment with than a zip file.
  I don't know if experimentation was the point, per se.
  Putting the files in a Github repo, at least puts the files on the web — where they're readable and linkable and syntax-highlit and internally searchable and web-spider indexable and Github-BigQuery-dataset-legible.
  I can read source code in a Github repo on my phone on the bus (and I sometimes do! In anger, in upstream repos when trying to debug sudden production issues while on my way home!) I can't really read code in a zip file until I get home to my laptop.
  
  AshamedCaptain 12 days ago
  
  > I can read source code in a Github repo on my phone on the bus (and I sometimes do! In anger, in upstream repos when trying to debug sudden production issues while on my way home!) I can't really read code in a zip file on my phone.
  I find this double ironic, because _every single phone_ I've ever had since the 90s has had no problem being able to download, unzip and view plain text files. Yet not a single phone I've ever had is able to decently browse the JS-loaded monster of a website that github.com is. Not even the one I'm using right now. Even on my desktop it still much easier to view a plain text file, not to mention more responsive and reflowable, than the Github website.
  
  Zambyte 12 days ago
  
  Ok I'll bite, what kind of phone do you have? I have a phone from 6 years ago and I use the GitHub web UI on it almost daily. It's far from great but it certainly is usable.
  
  AshamedCaptain 12 days ago
  
  Why it would be more troubling a phone that can't browse Github.com than a phone that can't open plain text files on a sub-megabyte .zip file as derefr apparently has ? Github.com frontpage itself is already several megabytes. By your own text your 6 year old phone (more than capable) "is far from great". That's the point...
  My phone can checkout this repository (or unzip the corresponding .zip file), and even under emulation layer build MS-DOS from it. Browsing Github.com is on the other hand a struggle. But it should be obvious; there is _no way_ browsing plain text files wouldn't be more far more responsive and pleasant than browsing such a loaded website. Github is just a disservice here.
  
  Zambyte 11 days ago
  
  > By your own text your 6 year old phone (more than capable) "is far from great".
  It's mostly far from great for design reasons than for performance reasons. I barely tolerate GitHub on my powerhouse of a desktop (I don't use it for any personal projects). If you want to download the repository to browse it locally, go for it.
  Still curious what kind of phone you have since you didn't answer that, and I still browse GitHub (reading issues, discussions, and some code) daily on my 6 year old phone.
  
  derefr 11 days ago
  
  I mean, I probably literally "can" read text files inside a .zip file on my iPhone. It's not that it's impossible for it to do that. But I don't think I currently have an app that knows how to do it. iOS Files.ipa doesn't do that.
  As such, starting from zero, to "explore" the code in a .zip file on my phone, I'd have to:
  1. download the .zip file in my browser;
  2. go into the App Store, seek out some alternative file browser app that supports archive extraction/browsing and text-file viewing and isn't some dumb scam, and install it;
  3. go back to the browser and/or the Files app, and "share" the downloaded archive file from there into this separate filer app I've downloaded (because such apps — other than Apple's specially-deigned one — can only see things handed into their specific sandbox.)
  4. open the filer app;
  5. open the archive inside the app (which would either extract it or traverse into it — really a roll of the dice†);
  6. open the file within the archive, and read it;
  7. remember to delete the archive, and potentially the extracted worktree from the archive, when I'm done.
  All this vs., clicking a link to the repo page; clicking a link on the repo page to one of the files in the repo; and reading. All just within the browser the phone already comes with. Maybe (in theory, if I didn't have a newer phone) chugging a bit as I scroll — but I'm reading the code, not searching it; I only need the page to refresh as quickly as I can take in the whole previous page.
  † And IIRC (from playing around with such filer apps with Android smart TVs), such filer apps are mostly going to only offer the ability to expand archives, rather than to traverse them. This is fine for tiny archives; but it's a very bad idea if you've got the sort of "several snapshots of internal builds/releases of the SDK scraped off of people's HDs" archive that you usually see in the corporate-software anthropology space. These snapshots usually each contain, among other things, binary tools and binary assets; and due to there being multiple snapshots in the archive, you've got a large number of redundant (often byte-identical!) copies of them (that due to the compression distance, haven't been compressed together.) Such archives therefore often add up to several GBs of data, and millions of files. You don't want to expand that onto your phone's teeny little 256GB disk just to poke around in one or two files!
  ---
  ...of course, that being said, when I'm on my laptop and on my home wi-fi, I don't read any more than trivial code on Github; that'd be silly.
  Instead, if I want to poke around the files of a repo, I just pop open a shell, clone the repo, and open the resulting worktree in an IDE.
  In theory, the best mobile workflow would do the same (but in a way that's more careful with mobile data. So maybe with a shallow git clone — or even a theoretical "thin client" clone that uses a FUSE-like layer in the app to fetch git objects one-at-a-time, just-in-time, as the IDE attempts to read them.)
  But note that the backend for that "best mobile workflow" would still be something that looks like https+git or git+ssh — not something that looks like an application/x-zip response from a webserver. So, in this ideal world, you'd still want the code to live in a hosted git repo somewhere!
  
  talldayo 12 days ago
  
  That sounds like a browser issue. I just whizzed through a 2000-SLOC file from llama.cpp on my 5 year old Android phone with no discernible hitching.
  
  ses1984 12 days ago
  
  Well at least from GitHub you can edit the url or click a button to download and view the raw text.
- novos 12 days ago
  
  It does make it easily browsable and linkable on the web. So you can't say it has no benefits.
- Zambyte 12 days ago
  
  I also found it confusing that they are doing version control via version numbered directories instead of using the version control system and tagging versions. If they don't think that's discoverable enough using the web UI... they should probably make it discoverable enough using the web UI.
  
  mynameisvlad 12 days ago
  
  Or they are trying to create point in time backups of disparate code bases that most certainly did not come from the same place and therefore have no notion of history. Using tags/branches does nothing but separate the versions needlessly.
  
  Zambyte 12 days ago
  
  Really they should just be separate repositories (because as you say, they are independent code bases), but I think if you want to force them to be in the same repository, they should at least be on different commits that are tagged instead of what they're currently doing.
  Whether or not you decide to base the commits on each other is something else that can be considered. If you really don't want the commits to be based on each other, you can just make new orphan branches.
1970-01-01 12 days ago

If this is going to occur on a regular basis, we will eventually run into copy protection mechanisms, which are also worth preserving. A bit-for-bit copy, uploaded to something like museum.github.com would be a much better archival record.
https://nerdlypleasures.blogspot.com/2015/11/ibm-pc-floppy-d...

jwnin 12 days ago

Mistakes were made, but articles like this don't incentivize volunteers to continue to spend more time on preservation work.

rnd0 12 days ago

I agree, reading it -especially in light that people are working on fixes (a buildable version of the code is already out there) seems a bit aggressive.
Getting corporations to volunteer to take PR and legal risks to share things with the public is hard enough. I understand his frustration (this is the third time it's happened) but he could really have stood to dial it down a notch IMO.
Descon 12 days ago

Yeah, this reads like an unnecessarily harsh cr of what was probably a really cool project for an intern. This kind of attitude can really turn people off of sharing at all.

gtirloni 12 days ago

> First of all, git does not preserve timestamps, which causes irreversible damage

> the misguided use of git not only made some comment lines too long for MASM, but it also actively destroyed the original source code.

Nothing was mutilated or destroyed. These are not paper records. Nothing was lost. Someone just have to also release as a zip.

LocalH 12 days ago

Original time stamps were intentionally destroyed. From a comment by starfrost on the article:
"The reason I cannot do timestamps is because data protection law mandates anonymisation of source files, at least that is the policy."
- skissane 12 days ago
  
  I don’t think timestamps themselves are PII, they tell you when but not who.
  I think the problem is legal is demanding they edit the files to remove names of former employees, since that is PII - and doing so destroys the timestamps (and can damage the file format if not done carefully with respect to encodings etc)
  Simple solution: upload a DIR of the original floppies. That will tell us the original timestamps, without revealing that PII
  Advanced solution: someone create a toolkit for redacting disk images which records the exact byte locations redacted. The whole “former employee names are PII so we have to censor them” thing is a bit stupid, but sometimes you get further by doing the best you can within legal’s constraints instead of fighting them
  
  theoldlove 12 days ago
  
  Seems weird to remove the names of former employees. What’s sensitive about that information? In lots of other industries people insist on proper credits acknowledging their work.
  
  skissane 12 days ago
  
  All it takes is one person in legal over-reacting
  And getting this done is relying on the good will of the corporation, a lot of other corps will just say point-blank “No way”. The risk with pushing back against legal, even if they are overreacting, is the end result could be no more releases like this
  
  tomjakubowski 12 days ago
  
  Personally I'd rather disavow at least 50% of the code I've shipped than take credit for it
  
  1970-01-01 12 days ago
  
  It was common for MS to include credits, but only as an Easter egg:
  https://en.wikipedia.org/wiki/List_of_Easter_eggs_in_Microso...
- userbinator 12 days ago
  
  That makes zero sense. Why erase the timestamps, but from the other comments on the source release itself here on HN, there are still author's names and initials in the files?
  It's not like we don't know the people who wrote MS-DOS either.
- autoexec 12 days ago
  
  I'm not sure what 'anonymisation of source files' even entails. We know what the files are, and where they came from. I could see scrubbing comments for specific names or something but what more would accurate timestamps tell us except when the files were originally last modified? That information is useful for historical interest reasons but what are the risks? Anyone know what the data protection law is that requires this?
kimixa 12 days ago

Also the files have been modified to remove some personal information - as discussed on the last HN post about this.
For "Purists" should timestamp now be the time those modifications were made? Surely backdating them with modified contents isn't correct.
And brave to assume modified times are actually stored somewhere accurately - how many files from over 30 years ago do you have that kept perfect metadata through multiple storage and filesystem changes?

accrual 12 days ago

I personally would like not just a .zip but images (.bin/.img/.ima) of the floppy disks. In my opinion that would be the best way to preserve the source, outside of maybe a flux image. Having everything in GitHub is naturally the easiest way to browse and interact with the source, but purely for preservation, I'd love to have the images.

saagarjha 12 days ago

Surely Git supports files with arbitrary bytes in them? Why would it cause issues with those files?

WirelessGigabit 12 days ago

Because Git sometimes tries to be too smart with line endings.
- Xelynega 12 days ago
  
  Which has nothing to do with utf 8 encoding as the writer of the article said, and is an optional feature you have to configure.