Parsing Excel Spreadsheets with Swift's Codable Protocols

103 points by maxdesiatov 5 years ago

sheetjs 5 years ago

First off: awesome work!

Correctly reading the data from XLSX is a lot more complex than described or implemented here, mostly because Excel is so robust in reading files and there are many sloppy writers. If you're interested, there's a many-thousand page ECMA-376 specification: https://www.ecma-international.org/publications/standards/Ec...

- to correctly get the first worksheet, you actually need to parse the workbook.xml file and look into the sheets array to find the corresponding relationship IDs. This is explained in section 18.2.20 (page 1579 in the part 1 PDF). iOS Numbers used to write worksheets in the opposite order, which messes up the naive attempt to read the relationships file in order.

- the attribute "s" in a cell is an index into the styles table, while the cell type "s" corresponds to the shared string table. If you're curious, its in section 18.1.3.4 (page 1604 in the part 1 PDF)

PS: We build and maintain parsers and writers for spreadsheets in JavaScript (https://github.com/SheetJS/js-xlsx/ is our most popular project), including a CLI script to convert files to CSV. Some of our users use JSC in the context of Swift applications, ingesting data in JS and returning a CSV for further processing in Swift.

ergothus 5 years ago

> there's a many-thousand page ECMA-376
Honestly, I'd love to see if there are organizational tips on managing (and using!) a document that large.
I feel like technical writing is the closest to coding we get in plain languages, but there are still critical differences. (technical writing is trying to give instruction to a human, while coding is giving instructions to code while giving a lot more context and description to a human). Specs cross this line a bit more - it's giving descriptions to a human with the intention of giving instructions)
Unfortunately, while I can find good code and bad code, I tend to find bad specs and WORSE specs. I can see progress (the various HTML5 and related specs are vastly better than previous versions, for example), but anytime I go in with a question (which is admittedly rare) I spend a lot of time finding the salient part compared to related-but-missing-the-vital-piece part, which is actually the exact same problem that I think the most common problem in maintainable code: making it easy to not only know how, but WHERE.
Are there lessons from specs we can learn? Do the good ones have some sort of "concept" section that makes the reading of it easier? Does each subsection do that?
- ken 5 years ago
  
  When I worked at Boeing, technical specs were built in a giant object/hierarchical database. There would be objects for Requirements, and in the design you could link Systems to the Requirements that they implemented. Then you could trace back from the design to see what requirements weren't implemented, or what systems had no apparent purpose, or who edited a particular requirement last.
  When they needed a hard copy of The Spec, they could export the hierarchy of requirements (automatically numbered, of course) as HTML or MSWord or whatever.
  It was, of course, a giant nearly-unreadable mess, like "Req 1.2.3.4.5: The Foo system shall have a Bar module.", and then 25 more sub-requirements that all start out with "The Foo module's Bar subsystem shall ___" -- and made no sense unless you had just read the previous 30 pages, anyway.
  It was not a good system, but it had hints of a good system in it. I think it was a good concept but the UI was terrible and people didn't seem to take much care when working on it.
  
  Serow225 5 years ago
  
  DOORS ?
  
  watersb 5 years ago
  
  Ahh! I had not encountered that one for 20 years. Wow.
- zimablue 5 years ago
  
  The advice is that it should only exist in machine-readable form and the reason why Microsoft does it this way is to ensure lock-in. See also MSSQL, for which there exists no complete machine-readable spec ANYWHERE (as far as I could tell when I spent 2 days looking a few years ago).
  
  13of40 5 years ago
  
  >lock-in
  A company that spends millions of dollars employing technical writers to publicly document a format probably isn't conspiring to keep that format secret. Maybe the macaronis aren't the shape you wanted but you got the macaronis.
  
  scarejunba 5 years ago
  
  IIRC it was in response to many government sources requiring open formats (a good instinct) so just because they did it doesn’t mean they wanted to. They may have been forced to, and done the minimum as a result.
  
  userbinator 5 years ago
  
  I believe it was the antitrust lawsuits that forced them too, and it shows; if you look at all the MS protocol/format documents and compare them to something like RFCs and ITU/IEEE/ANSI standards, the MS docs are noticeably harder to read with their verbosity, weird syntax notations and conventions, and almost look as if they were deliberately obfuscated.
- edrocks 5 years ago
  
  I wrote an XLSX(spreadsheet) writer in golang a few years ago and still maintain it. I also deal with a bunch of other several hundred to thousand page docs semi regularly and the best advice I have is to make use of the index, bookmarking pages, and a ton of cmd+f searching for various keywords.
  I was talking with a lawyer friend the other day who confirmed a good index is really a must for long documents.
  
  ergothus 5 years ago
  
  In my experience, a good index doesn't just show where any usage is, it also covers where it is significant (bolding, sometimes subcategorizing).
  In the big texts (granted, for me these are almost always RPG books, but I've read/used a lot of those) that difference is essential, and the difference between the good ones (e.g. GURPS) and the bad ones (most WW books...the ones that HAVE indexes) is very noticeable and definitely impacts not just ease-of-use, but effective-ness-of-use. Being able to really get/refresh uses of different bits has a direct impact on whether I apply those concepts consistently and correctly or whether I do something that works well enough.
  This definitely describes what happens when changing code too, but we don't have that same option for "significant usage". We can get every usage, usage numbers counted by file, and definitions, but not when a use is significant. We rely on tools to get those numbers, because any kind of manually supported index (of code) is doomed to failure, much as most texts fail to have indexes, or at least fail to have very useful ones.
  I wonder if that "significant usage" is something we can do something about. What does that even mean? I've been looking at code linguistically a lot and I definitely can see how we can use syntax to better indicate the focus of code (vs trivial but necessary side bits), is there a way to mechanically note that? It would still require coders to write that way, but frankly I feel we need better (or at least more clear) best practices on that front anyway. If we expand our lexicon of constructs, and mechanically make use of it, it becomes testable, reliable, and still more communicative that the current "this feels good to me, therefore I declare the code 'more readable'" approach.
- amaccuish 5 years ago
  
  > Are there lessons from specs we can learn? Do the good ones have some sort of "concept" section that makes the reading of it easier? Does each subsection do that?
  Look at the OpenDocument Format. Does the same as Office Open XML (the microsoft one) in only a few hundred pages.
  
  tzs 5 years ago
  
  ODF is a bit more than a few hundred pages. Here's the page counts for ODF 1.2:
  102 pages, OpenDocument-v1.2-os.pdf 846 pages, OpenDocument-v1.2-os-part1.pdf 234 pages, OpenDocument-v1.2-os-part2.pdf 35 pages, OpenDocument-v1.2-os-part3.pdf
  That's 1217 pages.
  It is quite a bit smaller than OOXML. OOXML is about 5-6 times the size of ODF. There are three reasons for this.
  1. ODF has cleaner and leaner markup. OOXML is uglier and more verbose, leading to more things that the spec has to document..
  2. The OOXML spec goes into more detail for a lot of things that the two have in common.
  3. The OOXML spec seems to have a lot more introductory or primer material.
  The impression I got from looking at both, but not actually trying to use either of them, is that if I had to implement a full featured office suite solely from the spec, with no reference implementations or example documents, it would be easier if I picked ODF, but would probably have better interoperability if I picked OOXML. The strongest impression I got, from both of them, though, is that there is no freaking way I want to implement an office suite!
maxdesiatov 5 years ago

I appreciate such detailed feedback, thank you! Parsing of workbook.xml is implemented in the library itself in `parseWorksheetPaths` function I mentioned it briefly in the article, but decided to omit it as main focus was on Codable protocols.
I will definitely update the "s" attribute parsing to have a more sensible name. Will also link to the standard from the README file, although not sure that will help with a document of this size.
marcruser 5 years ago

I laughed when I started reading this comment and then looked at your username. I've used your "js-xlsx" library and stepped through quite a bit of the code. I still can't really understand how you begun to write that library, and I'm curious how you approach reading the Open XML documentation. Do you have a large team of engineers maintaining that codebase?
Unknoob 5 years ago

Looks like this guy excels in his field.
Jokes aside, I can't fathom having to consult a thousand page manual to deal with this stuff, I can barely read the README.md for a framework that I want to include in my project.

conradev 5 years ago

The one thing I don't love about Swift's Codable is the lack of customizability in the "magic" part: the part where the compiler generates the Encodable/Decodable implementations. Most notably, the compiler can't generate implementations for enums. The only thing that Swift supports customizing without fully implementing the methods for Encodable and Decodable is the name of the keys, using a custom CodingKeys type.

Serde, an equivalent third-party crate in Rust, supports a lot of customization which I find invaluable. It can (de)serialize values like this with ease:

    {"type": "location", "value": {"latitude": 0, "longitude": 0}}

in a very small amount of code:

    #[derive(Serialize, Deserialize)]
    #[serde(tag = "type", content = "value")]
    #[serde(rename_all = "lowercase")]
    enum Value {
        Location { latitude: f64, longitude: f64 },
        String(String),
        ...
    }

Serde also supports customizing serialization on a per-field basis without having to implement the entire protocol, which is nice:

    #[derive(Deserialize, Serialize)]
    struct Record {
        #[serde(with = "chrono::serde::ts_seconds")]
        updated: DateTime<Utc>
    }

I really hope that Swift has better ways (like the above) to customize Codable in the future. I find myself implementing the protocol myself in 90% of cases, whereas I very rarely have to do that for Serde.

wild_preference 5 years ago

Serde offers more out of the box functionality with its annotations, but it's also bit uglier than Swift's Codable when you do need to implement it yourself.
If it wasn't for a blog post that gave some simple examples, I couldn't even figure it out myself.
Possibly not fair to compare them this way, though. No clue what the trade-offs are.
- conradev 5 years ago
  
  That's very fair criticism. Implementing custom serde serializers is painful, but thankfully I only have to do it in rare circumstances.
maxdesiatov 5 years ago

The main reason for that is lack of hygienic macros in Swift. Currently you can use code generation tools like Sourcery and SwiftGen, but I expect 1st-class meta-programming support to come after Swift 5.0 release. After ABI stability I imagine macros are pretty high on the priority list of the core team.
- conradev 5 years ago
  
  They might also go in the opposite direction and add runtime APIs to construct types dynamically. Various initiatives (like the Python interop for TensorFlow) are pulling Swift in that direction:
  https://github.com/apple/swift-evolution/blob/master/proposa...
  https://github.com/apple/swift-evolution/blob/master/proposa...
- slavapestov 5 years ago
  
  I have a hard time envisioning full-fledged macros coming to Swift, but moving Equatable/Hashable/Codable synthesis to a library should be possible with more restricted meta-programming capabilites. See Joe Groff's talk for details: https://www.skilled.io/u/swiftsummit/swift-s-reflective-unde...
  
  saagarjha 5 years ago
  
  > I have a hard time envisioning full-fledged macros coming to Swift, but moving Equatable/Hashable/Codable synthesis to a library should be possible with more restricted meta-programming capabilites.
  I've heard many people on the mailing lists talk about wanting to add hygienic macros–so when you're talking about "full-fledged macros", are you talking about unhygienic or hygienic ones?
  > See Joe Groff's talk for details
  I'm actually curious how much is possible, in terms of reflection capability, both currently and in the future. How close can I get to the dynamism of say the Objective-C runtime? Can I grab a function pointer knowing its mangled name? Can I list every class in a binary by consulting the runtime metadata? Can I "swizzle"?
  
  slavapestov 5 years ago
  
  > How close can I get to the dynamism of say the Objective-C runtime?
  Well, right now the compiler emits a lot of metadata that is mostly used for runtime generics, dynamic casting, and the (somewhat limited) Mirror type. But these features have enough generality that a surprising amount of stuff has to be encoded. However there’s no nice API for looking at it yet.
  > Can I grab a function pointer knowing its mangled name?
  That’s just dlsym().
  > Can I list every class in a binary by consulting the runtime metadata?
  The metadata is there, but there’s no exposed API for doing this.
  > Can I "swizzle"?
  There’s an experimental thing for this now: https://github.com/apple/swift/pull/20333
  
  saagarjha 5 years ago
  
  > > Can I grab a function pointer knowing its mangled name?
  > That’s just dlsym().
  Sorry, I should have been more clear: can I grab a function pointer to an unexported function?
  > > Can I "swizzle"?
  > There’s an experimental thing for this now: https://github.com/apple/swift/pull/20333
  This is interesting; I haven't been really been following the lists recently but I took the time to read the linked thread and the pull request. While the functionality contained in the pull request is interesting, it's not quite the same as swizzling since it's done unconditionally at load time rather than runtime (so it's much more similar to DYLD_INTERPOSE in that sense). While this covers many of the cases when swizzling is necessary, it leaves out a rather important one where the correct method override is selected at runtime, generally conditionally.
  Also, as a sidenote, it seems that there is some sort of motivation to have compiler type checks for the replacement, i.e. @_dynamicReplacement(for: bar())–how will this actually work in practice? If I compile a bundle without access to the source of the application I'm going to be loaded into, how would bar() be accessible to the compiler at all?

oflannabhra 5 years ago

The Codable protocol in Swift is game-changing. Max Howell (homebrew author) recently started a series of articles[1] in which he used Codable structs as the shared data model in both the backend and frontend of his new app Canopy. Even though communication is through HTTP and JSON, he never even has to touch it.

[1] - https://medium.com/@mxcl/server-side-swift-making-canopy-2ed...

mpweiher 5 years ago

> Codable protocol in Swift is game-changing
Which is really weird, considering Objective-C had automatic "activation/passivation" from the beginning (early 80s), and it was pretty trivial to adapt similar mechanisms later.
- maxdesiatov 5 years ago
  
  Except that in Objective-C it's a runtime feature, which is by definition slower. Swift's Codable implementation is generated by the compiler or is hand-written with an obvious benefit of a stronger type system.
  
  mpweiher 5 years ago
  
  Have you measured this? (I have)
aaaaaaaaaab 5 years ago

Data model != request/response structs
It’s a very bad idea to intermingle the two.