Writing parsers like it is 2017

blog.acolyer.org

259 points by ingve 7 years ago

fmap 7 years ago

Some people have mentioned parser generators, but so far nobody has mentioned Menhir (http://gallium.inria.fr/~fpottier/menhir/).

It's an LR(1) parser generator for OCaml and Coq with a lot of extremely interesting features, such as genuinely good debugging support for grammars and the ability of generating error messages by example.

What this means is that after you write down your grammar, Menhir will give you examples of all the possible syntax errors that could occur. You can then write error messages for each case and get a parser with built-in error reporting for syntax errors.

This works a lot better than you'd think and I really wonder why nobody else implements this feature. Or for that matter, why they're not advertising it on the webpage! If you want to know more, look in the manual, section 11.

jdf 7 years ago

There's a Rust parser generator called LALRPOP that is apparently inspired by Menhir.
https://github.com/nikomatsakis/lalrpop http://smallcultfollowing.com/babysteps/blog/2016/03/02/nice...
I've never used Menhir so I can't compare how similar they are in practice, but I've enjoyed the times I played with LALRPOP much more than the many times I've battled various yacc derivatives.
krat0sprakhar 7 years ago

+100 this! Menhir is really awesome and super usable parser generators. We used it to for JSJS [0], and you can see for yourself that the rules are quite easy to understand for a non-trivial grammar.
[0] http://jsjs-lang.org
[1] https://github.com/prakhar1989/JSJS/blob/master/src/parser.m...
bd82 7 years ago

This is a really great way to customize error messages and ensure all error messages have been customized (or at least handled).
I think I'm going to try and implement this in a parsing library I develop. :)
ch4s3 7 years ago

That's cool! I've never seen a parser generator that does that!

YorickPeterse 7 years ago

Having written a parser generator myself (https://github.com/YorickPeterse/ruby-ll), plenty of handwritten parsers, and having mess with parsing combinators I concluded I really don't like parsing combinators. Perhaps this was due to the APIs of the libraries I worked with (e.g. Nom in Rust), but it just lead to incredibly verbose and hard to work with code.

Since writing recursive descent parsers is not that hard (at least if your grammar is LL(1), see https://github.com/YorickPeterse/inko/blob/a76a8c23f901c5b2a... for an example) I would personally go with this approach whenever possible.

Parser generators definitely have their use cases (and allow you to write a parser much faster), but error reporting is often tricky to get right.

saltcured 7 years ago

I think parser generators are generally easier to use with dynamic languages. Most of the code-generation tooling can be hidden by interpreters or JIT compilers, and dynamic typing and generics can make the action code much more concise.
I learned and used bison+flex back in the 90s. More recently, I've enjoyed the PLY package for Python. It's essentially the same algorithms and abstractions, but it turns the usual bison grammar inside-out by embedding grammar fragments in your Python action code instead of embedding action code in a grammar file.
Your grammar is a Python module. You write Python function definitions which feel almost like you are writing the interesting subset of a recursive-descent parser, i.e. the important reduction steps, while hand-waving about all the input-testing and buffering you would have to do in a bespoke recursive descent parser. The code comment on each function attaches the grammatical rule which fires that action. The PLY parser-generator function you call at the end of your grammar module uses Python introspection to find all the production rules and stitch the parser together.
My only coding style complaint is how PLY abuses the function __doc__ strings to embed the grammar rules. I think using a decorator interface would have been a more "pythonic" interface to attach the grammatical rules to each action function...
- dnr 7 years ago
  
  I used PLY to write a parser for a moderately complex internal language a while ago at work and it was a great experience. I ended up with very readable code. (And with only a little hackery I was able to write a tool to perform source code transformations while preserving comments and formatting around the changes.) I'd definitely use it again. The only potential downside is speed: it's Python, so there's only so fast it can go compared to other languages.
bd82 7 years ago

I also slightly dislike parser combinators. Mainly because usually you cannot easily debug them by placing a breakpoint.
In the best case scenario the specific parser combinator has good tracing support to enable debugging. However, imho this is still inferior to simple breakpoints and debugging "directly" in one's favorite IDE.
- aidenn0 7 years ago
  
  The only parser combinator library I used (smug, for common lisp) had little trouble with breakpoints; (any parse with significant backtracking was annoying but breakpoints worked fine) is it common that one cannot use breakpoints for debugging in other parser-combinator libraries?
  
  bd82 7 years ago
  
  I'm not familiar with Smug, but I would guess that the combinator is implemented using LISP macros which are than expanded into "real" source code (lisp lists) which is evaluated directly and can halt on break points.
  In programing languages which do not have true macros The combinator API normally creates a data structure representing the grammar which is than interpreted Which usually makes debugging harder and the parsing slower.
- kmicklas 7 years ago
  
  I would argue that if you're tying to debug a parser with a tracing debugger you're doing something wrong.
abecedarius 7 years ago

A design I like combines a combinator library with an optional facade that gives you a convenient grammar syntax a la a parser generator -- for example, Lua's LPEG works this way. LPEG is a little awkward imo in how it integrates semantic actions, and I tried to do better in my own library (which unfortunately isn't really ready to share and it's been on the back burner).

gnuvince 7 years ago

I tried to use nom for one of my projects, but I found that it had two major problems. First, the macro syntax was hard to use, hard to learn, and hard to debug. Sometimes you had to put an expression in parentheses, even if it seems that it should be unnecessary, so that the macros would be able to parse your code. I also found the macros hard to compose, which is ironic for a parser combinator library.

My other problem was with error handling. I wanted to have my own `Error` enum to represent errors in my program. Nom supports a custom error type, but I now had to write out type annotations in many places with the fish syntax (e.g. foo::<&[u8], Ast, Error>()) and they are quite long annotations. In addition, some of the macros would not work—or at least, I couldn't get them to work—with my custom error type, and I had to revert to using functions in a way that was absolutely not compositional.

In the end, since the format I was parsing was simple and regular (Erlang's External Term Format), I wrote a parser by hand and removed the dependency on nom. The result is as just as fast and produces meaningful error messages. It's actually less tedious, because I am using simple functions that have the syntax and semantics that I already know; maybe if I had a more complex format to parse I would sing a different tune, but for the moment I prefer to avoid using nom.

oever 7 years ago

I've used nom to implement a Turtle parser. The result was fast (it runs the test suite in 50ms) and the code is pretty readable. Certainly much more readable then if I'd used a homegrown set of functions. I've documented the specification grammar rules next to the nom rules. They match up well.
Nom can be tricky to debug, but Rust can host tests in the same file as the code, so debugging by writing new tests is convenient.
The syntax takes some getting used to. The hardest part for me was learning to order the grammar alternatives correctly. The order matters for performance and correctness.

shakna 7 years ago

> Many low-level parsers ‘in the wild’ are written by hand, often because they need to deal with binary formats, or in a quest for better performance.

I think the main reason for hand-rolled parsers, at least in my experience, has been error reporting.

Most parsing frameworks make it difficult for you to tell the user simple things like misspelled variables, something does exist, but it isn't in-scope when its being called, and basically helpful messages like that.

Error management is only briefly mentioned, and sort of glossed over in Nom's [0] paper. Mostly because Nom was designed for efficiency... But efficiency is not why GCC rolled their own parsers.

[0] http://spw15.langsec.org/papers/couprie-nom.pdf

mikhailfranco 7 years ago

It's possible to provide rich information and flexible error reporting using error combinators, which merge results from failed parser combinators. The error combinators are in some sense dual to the parser combinators.
- shakna 7 years ago
  
  How could an error combinator possibly approach something like a misnamed variable? It would need access to the current scope's environment, or a version of it, to suggest what the spelling error might be. Not-yet-in-scope errors are similar, but need both past and future scopes as well.
  I don't know of any parser combinator that can handle those very human mistakes, because they require partial evaluation to understand.
  I might be mistaken, but if I'm not, parser combinators are great for saving time, but if you have the time to do it right, then a hand-rolled parser will give your users a much better experience.
  
  tomp 7 years ago
  
  Misnamed variables are actually one of the simplest problems to deal with ehile parsing. The solution is simple - you don't! The purpose of the parser is to parse the syntax, so you can simply use it to build a syntax tree and postpone the resolution of names until the next step in compiler's pipeline (also useful if you language supports macros).
  
  thesz 7 years ago
  
  In some areas of the world (VHDL/Ada), you have to maintain symbol table for correct lexing.
  These languages have constructions like OBJECT_NAME'ATTRIBUTE (object name can be anything with visible name in the scope and attributes depend on its kind) and RECORD_TYPE_NAME'SLICE where we create value of RECORD_TYPE_NAME using slice, which is collection of values of record fields.
  There are also character literals, which have form like 'a', '1', '#', ''''. Just as you can expect here.
  The combination of record value construction and lexer forms for character literals makes it hard to parse something like MY_RECORD'('1','0',false, 10).
  Symbol table and/or dynamic parse construction allows you to parse these texts easily and specify the syntax straightforwardly. The other approach with construction of separate AST complicates things a lot.
  
  pornel 7 years ago
  
  C is a well-known exception to this. The grammar depends on typedef name resolution. `foo * bar` and `(foo)-1` parse as different things depending on whether `foo` is a typedef or a variable.
gizmo686 7 years ago

>Most parsing frameworks make it difficult for you to tell the user simple things like misspelled variables, something does exist, but it isn't in-scope when its being called, and basically helpful messages like that.
I agree that generated parsers make error reporting difficult, but do not think these examples are relevent. If the problem is a misspelled or out of scope identifier, the parser should should still be able to parse the program, which would be syntactically valid.
- nokcha 7 years ago
  
  > If the problem is a misspelled or out of scope identifier, the parser should should still be able to parse the program
  That's not necessarily true for all languages. For example, in C, "(e1)&e2" is parsed differently depending on how e1 is declared. If e1 is declared as
  typedef int* e1;
  then the "&" in "(e1)&e2" is parsed as the address-of operator, but if e1 is declared as
  int e1;
  then "(e1)&e2" is parsed as a bitwise AND.
  
  swift 7 years ago
  
  This is a problem for a lot of languages, unfortunately, and it breaks the clean separation between syntax and semantics that we're hoping for when we write a formal grammar.
  I've had some success in the past with using GLR or another algorithm that can handle ambiguity, and then choosing among the possible parse trees in another pass that takes semantic information into account. How applicable that is really depends on the language, though; if things are so ambiguous that you're getting an exponential growth in possible parse trees, you may not want to use this approach.
- shakna 7 years ago
  
  > which would be syntactically valid.
  Not necessarily.
  Usually, we're used to seeing something like:
  print("Something")
  But there are statement based languages where instead you have:
  print "Something"
  Still simple enough to parse.
  However, what about user function calls?
  something 1 2 3 4
  If a newline doesn't break a call, and you don't depend on indenting, but actually the arity of the original function definition, accidentally writing:
  someting 1, 2 3 4
  Can be impossible to parse correctly, especially if said language also supports first-class functions.
  func apply with func:caller array:values do ...code... done aply + list do 10 10 10 10 done
  The syntax in the above example is still correct, but because the arity of 'aply' can't be determined, the parser can't actually continue.
  The grammar might be a little insane, but I've had to work with similar grammars in several financial languages. And the parser does poop out, and when it does, you really want a nice error to result.

nly 7 years ago

It's not just about the language or tools you use. You can produce a memory-safe, beautifully structured parser and use whatever framework you like, but if it doesn't do full recognition of your input before you do any processing then you're likely boned.

Not only is there a high risk that you'll still be passing bad input through to your database, another program, or an unsafe library deep down in your application (YAML in Ruby anyone?), but you get what Meredith Patterson calls a "weird machine" that is programmable by an attacker.

There's a lot more of this at:

http://langsec.org/

Stacks of papers, articles, videos, examples etc.

CarolineW 7 years ago

> ... if it doesn't do full recognition of your input before you do any processing then you're likely boned.
There are languages where you need to process some of what you've done before a later part of the source can be dis-ambiguated. How does that sit with what you've said here?
- cyphar 7 years ago
  
  The argument that langsec.org makes is that any such languages are missing security properties. In fact they regard most ad-hoc languages as insecure due to the lack of verification, which they believe to be the root cause of many security problems.

bd82 7 years ago

The article mentioned using Parser Combinators as the sweet spot between Hand Written Parsers and Parser Generators.

Another possible sweet spot would using something I call a "Parsing DSL" which is a sort of a cross between a parser combinator and a parser generator.

TLDR: See the JavaScript Parsing DSL library (Chevrotain) I've authored: https://github.com/SAP/chevrotain

Details: A Parsing DSL means using API similar to hand building a parser but without a-lot of the cruft associated with hand building while enjoying higher level abstractions from the Parsing DSL library such as: 1. automatic ambiguity detection. 2. lookahead calculation. 3. Grammar diagrams 4. auto-complete. 5. automatic error recovery 6. and more...

Under V8 (Chrome/Node) this is much faster than any other library tested and even substantially faster than a naive hand built parser. http://sap.github.io/chevrotain/performance/

(Benchmarked using a simple grammar [JSON])

e12e 7 years ago

This seems to share some ideas with ometa[1], but perhaps arriving at those ideas from another angle?
[1] http://www.tinlizzie.org/ometa/
https://github.com/alexwarth/ometa-js
See also ohm:
https://github.com/harc/ohm
- bd82 7 years ago
  
  I'm less familiar with ometa.
  Chevrotain shares two main ideas/concepts with Ohm.
  1. Separation of grammar and semantics, but in a less opinionated manner as it does not enforce the separation as Ohm does (it is still possible to embed actions directly in the grammar).
  2. Grammar Inheritance.
  Although while those ideas are not common they are also not that rare (For example the same concepts exist in Antlr). I think there are three big conceptual differences.
  1. In Chevrotain Performance is considered as a major feature. Which results in it being two orders of magnitude faster (in the benchmark linked above)
  2. Chevrotain attempts to provides capabilities relevant for writing IDEs, for example automatic error recovery/tolerance and syntactic content assist.
  3. Internal vs External DSL -
  From an implementation perspective there is a vast difference as Ohm is an external DSL while Chevrotain is an internal DSL. In practical(user) terms this means that you can place a breakpoint directly in a Chevrotain grammar, but you cannot do so in Ohm. Or that you will need a separate editor to edit an Ohm grammar while you can use any JavaScript editor to create a Chevrotain grammar.
  It also means that Ohm could be ported to different target runtimes (Like Antlr actually is) while Chevrotain can only run in an ECMAScript engine.

AgentME 7 years ago

Parser combinators are awesome! I've played with them just a bit. They seem to match my intuition of how a parser's implementation ought to resemble an EBNF grammar.

Just for fun, using the Bennu library[0] I wrote a JSON parser[1]. (Not intended for production use of course; well-optimized JSON parsers exist and browsers kinda ship with them now. If you're looking for a Bennu example or are thinking of experimenting with extending JSON in wacky ways for fun, then it's neat to mess with.) With that specific library, I seemed to create some messy parts when shuffling values through the library's stream abstraction, but I got the hang of it and the parts I thought were messy at least were straight-forward and didn't have issues like hidden edge cases. Something cool is that parsers made with the library and its stream abstraction automatically work incrementally too.

[0] http://bennu-js.com/

[1] https://github.com/AgentME/bennu-json/blob/701d17bc4872469dc..., right on my favorite part.

yorwba 7 years ago

The pdf has been posted previously (https://news.ycombinator.com/item?id=14655528) and the discussion has some comments by the authors.

laydn 7 years ago

Is there a tool out there that generates grammar by analyzing source code? For example, I'd like to feed the linux kernel source code and get the C grammar as the output?

corndoge 7 years ago

recurrent neural networks, kind of
- joeyo 7 years ago
  
  That gives me a kind of funny idea: train a sequence-to-sequence LSTM network on code written in two (or more) programming languages that implement the same functionality. You'd need a big corpus, and the could would probably would have to be a little more complicated than "hello world, but I don't see why it wouldn't work, in principle.
  
  p1esk 7 years ago
  
  Why would you want to mix two languages?
  
  joeyo 7 years ago
  
  Ah, sorry, I meant two codes in two languages that implemented the same functionality. I was thinking of a NN version of say, python 2to3.

pjmlp 7 years ago

For me writing parsers like it is 2017, means actually using parser generators like JetBrains MPS or ANTLR, instead of using bison and yacc as some kind of quality measure.

UK-AL 7 years ago

Honestly, well written recursive descent parsers are often far easier to understand than parser generator stuff.
There's a reason nearly every production parser is hand rolled, and its not simply for performance reasons.
- pjmlp 7 years ago
  
  Visual debugging tools help.
- soberhoff 7 years ago
  
  If you're generating the parser then you only really need to understand the grammar. Nothing can beat that.
  
  UK-AL 7 years ago
  
  And error messaging, being able to place breaks points, incremental parsing, etc etc
  
  soberhoff 7 years ago
  
  I don't see how a better understanding of the generated parser helps improving those things. Either you can't improve these, or you improve these by modifying the grammar.
  
  angstrom 7 years ago
  
  Having used ANTLR recently I would tend to not only agree, but say most cases of disagreement are circle jerking instead of getting to a quick solution and shipping. I just updated our grammar 2 weeks ago. Effort? Update the *.g4 grammar file and run a build target in ANT. Done. Next problem. Performance wise it's parsing deeply nested SQL queries in sub millisecond time which meets our need. If someone requires better performance than that I'd be curious what they're doing.
  
  swift 7 years ago
  
  This is true ideally, but there are several factors which make it not always true in practice.
  (1) Often your grammar admits inputs which are not in your language. Less often, it forbids some inputs which are actually valid. This is usually because an accurate grammar would be extremely complex to write and inefficient to parse using a generic algorithm. Languages which combine a context-free grammar with parsing decisions based on semantic information fall into this category (think C++); the "true" grammar would usually be context sensitive and quite hard to work with.
  (2) You are sometimes compiling a layered language. Again, consider C++. The program you're compiling is generally actually written in the C preprocessor language! Conceptually, it's translated to C++, and then the C++ code is compiled in a separate phase. In practice, that layering will make it difficult to produce good error messages, and a formal grammar for the combined language would be a mess to say the least.
  (3) The structure of the grammar may just be awkward for producing good error messages, and refactoring it to eliminate the problem may not be possible, or it may only be possible by making problem (1) even worse.
  In general, the decision of what goes in the lexer, what goes in the grammar, what is handled in semantic actions, and what is dealt with at a later semantic layer is always a trade off. Each layer has different characteristics in terms of computational power, performance, static analyzability, programming flexibility, composability, modularity, and maintainability. It doesn't make sense to insist that you solve all problems at one particular layer. You look at the problem you need to solve and decide how it'll be partitioned between these different layers to maximally benefit from their strengths and minimize their weaknesses.
  That's engineering.
  
  UK-AL 7 years ago
  
  You can't do those things using a parser generator. They not a part of the grammar.
  
  CalChris 7 years ago
  
  In ANTLR4, they are most definitely not part of the grammar which they were in previous ANTLRs. That makes the grammar easier to read/change/reuse/... Everything else is listener and visitor methods which are easy to write, debug, error check, ....
  
  soberhoff 7 years ago
  
  You could theoretically put all kinds of meta-information into the grammar, even arbitrary code snippets that are pasted into the generated parser. I'm not saying that that's a good idea - it undermines much of the value of a parser generator - but generally speaking it can be done.
pjc50 7 years ago

The article makes no mention of bison (which is a parser generator) and yacc, and presents a rationale for using parser combinators?
I agree that bison/yacc is a terrible user experience in a modern C++ or even modern C environment.
- pjmlp 7 years ago
  
  It was more a kind of heads up, because I think pure coding is too low level as parsing tool and most still think the alternative to manually parsing is only bison/yacc.
- to3m 7 years ago
  
  I like bison and I never minded flex. (ANTLR? Did not enjoy; would not use again. But it was a while ago now and maybe newer versions are improved.)
  I wonder if I'm the only one...
  
  CalChris 7 years ago
  
  I've used yacc/lex and bison/flex but ANTLR4 is the bees knees. It is a considerable improvement on previous ANTLRs. Separating out grammars (ANTLR grammar file) from actions (Java files) makes everything better.
  
  pjc50 7 years ago
  
  Error handling involves quite a bit of manual work, memory management if you're allocating nodes during the parser and also trying to do error recovery is even worse. Best dealt with by a slab or pool allocator. It's not very OO-frinedly and you can only have one instance running at once. That was my experience of trying to manage several parsers for Verilog, VHDL, DEF and similar ecad formats.
vram22 7 years ago

On a side note, I once met and chatted a bit with Dr. Terence Parr (creator of ANTLR) at a talk he gave in Mumbai.
He had come there both to talk about ANTLR and to recruit students for his course at the University of San Francisco (IIRC).
Here's a photo I took of him with a possible future student:
https://www.flickr.com/photos/vram/31351410/
Nice chap.
lower 7 years ago

In what way is ANTLR better than bison?
- pjmlp 7 years ago
  
  LL(*), with IDE tooling and visual debugger for all parsing phases.
- CalChris 7 years ago
  
  ANTLR combines lexical analysis and parsing, flex+bison, into a single tool. To me, ANTLR4 sets the bar pretty high. I'm not opposed to learning a new tool like parser combinators but I'd need an article that showed the differences between ANTLR4 and PCs, showing how it was much easier in PCs rather than it was just 2017.
  I'd add that ANTLR has really good documentation.
  
  yorwba 7 years ago
  
  ANTLR is a parser generator, isn't it? Do you mean parser combinators where you write generators?
  
  CalChris 7 years ago
  
  Ugh. Fixed. Thanks.
  But I really would want to read the ANTLR4 vs PCs article. I'm very happy with ANTLR4. But tools is tools.
- wslh 7 years ago
  
  Multiple language targets
991982391 7 years ago

PostgreSQL uses bison and has a higher quality than all that C++ bloat that you are constantly hyping here.

tankfeeder 7 years ago

this is a real CSV parser on PicoLisp which understand everything: https://bitbucket.org/mihailp/tankfeeder/src/default/csv.l Real of 2017 parsing.

burntsushi 7 years ago

I don't think so? That doesn't appear to support escaping quotes by doubling them (instead, it only supports the \" variety), but doubling quotes is the standard escaping mechanism for CSV data.
- jwilk 7 years ago
  
  The problem with CSV is that nothing is standard, not even the eponymous comma as separator.
  Seriously, stop using CSV.
  
  burntsushi 7 years ago
  
  > The problem with CSV is that nothing is standard, not even the eponymous comma as separator.
  So? The parent claimed it understood everything. I was pointing out that it doesn't. There are plenty of CSV parsers out there that can pretty much handle anything you throw at them, and they work pretty well in practice.
  > Seriously, stop using CSV.
  I'm a consumer, not a producer. So I'll continue right on using it, thank you very much.

rtpg 7 years ago

Integrating with existing C tooling on larger projects is still one of the hardest parts of introducing Rust into a project.

It would be amazing if someone wrote a gcc/clang wrapper that could detect rust files and "inline replace" them with C file equivalents

pjmlp 7 years ago

Which is why Bjarne went with C compatibility in C++, even though he would rather have something more Simula like.
C++ would never had survived in AT&T if it wasn't zero friction compatible with C.
Cyclone, C+@ and Limbo are sadly three examples where things did not went that well at AT&T.

joeyo 7 years ago

It's a bit of a special case, but I am huge fan of Kaitai [1] for generating parsers for reading binary data formats. You declaratively describe the format in a YAML-based DSL, and "compile" it into a parser that can target a variety of languages (C++, Java, Python, etc).

1. http://kaitai.io/

ioquatix 7 years ago

I'm surprised no one has mentioned Ragel: http://www.colm.net/open-source/ragel/

Ragel generates very fast code, is modular by design and has good error handling. It supports both compiled and scripted languages (e.g. can generate both Ruby and C) which is useful if you need a fallback.

I've really enjoyed using it to implement the following:

- A template language for Ruby: https://github.com/ioquatix/trenni/blob/master/parsers/trenn...

- A SGML parser for Ruby: https://github.com/ioquatix/trenni/blob/master/parsers/trenn...

- A HTTP V1 protocol parser: https://github.com/kurocha/async-http/blob/master/source/Asy...

- A URI parser: https://github.com/kurocha/uri/blob/master/source/URI/RFC398...

I'll admit, it does take a while to understand how to correctly handle ambiguity when dealing with callbacks/events during parsing, but generally speaking, once you get a bit of experience with how things work, it becomes a very powerful tool in your toolbox.

vram22 7 years ago

IIRC, Zed Shaw used Ragel to create a parser for Mongrel or Mongrel2 (the web servers).
Update: Checked it, he did:
https://en.wikipedia.org/wiki/Mongrel_(web_server)
https://en.wikipedia.org/wiki/Mongrel2
allengeorge 7 years ago

I thought that Ragel was no longer open-source, and supported C/C++ only (admittedly, the second point isn't that big a deal, given the article and its audience).
- justincormack 7 years ago
  
  No. It changed from GPL to MIT license, but did not go closed source. http://www.colm.net/open-source/ragel/
jwilk 7 years ago

Cloudbleed was caused by a bug in a Ragel-based parser.
- rhencke 7 years ago
  
  It was not caused by a bug in Ragel, though.
  
  staticassertion 7 years ago
  
  This is always going to lead to the same place - whether it's the developer's fault or the tool's.
  The author of this post wants uncompromising safety, ragel does not allow for that - programmer error can introduce memory unsafety. They even explicitly call out Cloudbleed.
  
  wuch 7 years ago
  
  To provide a little bit more context, here is detailed description of this bug [0]. Curiously, they suggest that problem could be avoided by changing the check for EOF to ">=". This is not true at all in case of C, at least as far as language semantics is concerned. When pointer goes two past the end of array you are already in undefined behaviour land.
  [0] https://blog.cloudflare.com/incident-report-on-memory-leak-c...
  
  ioquatix 7 years ago
  
  I think it's valid in C/C++ to go 1 past the end of an array: https://stackoverflow.com/questions/988158/take-the-address-...
  
  striking 7 years ago
  
  More specifically:
  > The Ragel code we wrote contained a bug that caused the pointer to jump over the end of the buffer and past the ability of an equality check to spot the buffer overrun.
  
  ioquatix 7 years ago
  
  A simple `assert(p<pe);` in a debug build would have caught the issue. I think you could argue that ragel should include something like this by default.

tbabb 7 years ago

> An unreliable programming language generating unreliable programs constitutes a far greater risk to our environment and to our society than unsafe cars, toxic pesticides, or accidents at nuclear power stations. Be vigilant to reduce that risk, not to increase it. – C.A.R Hoare

No. There is no programming language that prevents the programmer from writing bogus code. Blaming software instability on the programming language would be like blaming unsafe building designs on the architect's drafting tools. Yes, shitty drafting tools can make certain kinds of mistakes easier, but the task of having to design something with careful thought and good engineering principles does not ever go away, no matter what kind of compass and ruler you're using.

Also, unsafe software IS dangerous, but putting it next to those actually life-threatening things actually undermines the message by the contrast.

nickpsecurity 7 years ago

In the comments, there's a link to this timeline of parsing techniques with one or two things I hadn't seen before:

https://jeffreykegler.github.io/Ocean-of-Awareness-blog/

Author then modernizes one of the better ones.

wslh 7 years ago

Nobody mentions OMeta? Not the best performance but fast for playing.

bd82 7 years ago

Is this what you are referring to? https://github.com/alexwarth/ometa-js
Is it still relevant if it has not been updated for several years?
Perhaps Ometa's younger's brother (Ohm) should have been referenced instead: https://github.com/harc/ohm Unfortunately while it seems to have some very nice features, particularly the separation of grammar and semantics.
Its performance is underwhelming. See a benchmark I've created of JSON parsers implemented using many parsing libraries.
http://sap.github.io/chevrotain/performance/
On V8 (Chrome 60) It is the slowest by far, in most cases by two orders of magnitude...
carapace 7 years ago

I'd like to mention Grako/Tatsu: https://pypi.python.org/pypi/tatsu
> TatSu (the successor to Grako) is a tool that takes grammars in a variation of EBNF as input, and outputs memoizing (Packrat) PEG parsers in Python.
(I'm familiar with Grako, haven't used Tatsu yet.)

gens 7 years ago

How about using a parser generator ? Yacc, ragel, bison, etc[0] ?

Nooooo, we have to use rust, because we are not "2017" if we don't.

[0] https://en.wikipedia.org/wiki/Comparison_of_parser_generator...

PS I use C. I say this to align my post with the "Rust is the best and i use rust" mentality of the article.

EDIT: Also, this article says absolutely nothing about actually writing a parser.

pornel 7 years ago

The article mentions Cloudbleed, which did what you're suggesting (used Ragel), and still managed create a serious vulnerability.
- gens 7 years ago
  
  Yes, i remember that. Ragel is still better then a generic programming language for writing parsers, by a long shot. (and cloudbleed was their fault, not ragels)
  Writing a parser in a memory-safe language does nothing for it's correctness, as memory safety is not the only security problem. One should use a domain specific language, as simplicity is very important for safety.
  That said, THIS ARTICLE DOES NOT SAY ANYTHING ABOUT HOW TO ACTUALLY WRITE A PARSER !
  edit: Rust is the new systemd, full of fanboys pushing it forcefully on others (or, as they say, "gently nudging"). I have not much to say against rust itself (except that it is not for "systems programming"), but the fanboys are idiots.
  
  staticassertion 7 years ago
  
  > and cloudbleed was their fault, not ragels
  Meaningless statement - 'fault' in these discussions is impossible to objectively attribute. Is every vulnerability the fault of the developer writing the code, or the tool that facilitated the code? Some would argue the former, some the latter.
  The end result is the same - using these tools you can produce vulnerabilities, and your users get exploited all the same. So is fault really important?
  > Writing a parser in a memory-safe language does nothing for it's correctness, as memory safety is not the only security problem. One should use a domain specific language, as simplicity is very important for safety.
  You have two statements here - one is that memory safety is no the end of safety, one is that simplicity is important to safety beyond memory safety.
  The first feels... weak. No kidding - security encompasses more than memory safety. But eliminating memory safety issues limits what you have to worry about.
  As for simplicity leading to security, this isn't a supported argument. But as for encoding semantic security issues into your program (security issues outside of memory safety), I would argue a strong type system is the right way to go here, something Rust certainly provides.
  > edit: Rust is the new systemd, full of fanboys pushing it forcefully on others (or, as they say, "gently nudging"). I have not much to say against rust itself (except that it is not for "systems programming"), but the fanboys are idiots.
  blah blah blah, literally just vapor at this point, not worth addressing.
  
  gens 7 years ago
  
  You did not address the actual points i made.
  >Ragel is still better then a generic programming language for writing parsers, by a long shot.
  >That said, THIS ARTICLE DOES NOT SAY ANYTHING ABOUT HOW TO ACTUALLY WRITE A PARSER !
  And, if i can dare to comment on your statement..
  >As for simplicity leading to security, this isn't a supported argument. But as for encoding semantic security issues into your program (security issues outside of memory safety), I would argue a strong type system is the right way to go here, something Rust certainly provides.
  This is, for one, a bastardization of what i stated, that was that a DOMAIN SPECIFIC LANGUAGE IS (usually) BETTER FOR A DOMAIN THAT IT IS SPECIFIC TOO THEN A LANGUAGE THAT IT IS NOT SPECIFIC TO THAT DOMAIN. This is particularly evident in parsing, and to say otherwise is dishonest.
  As for simplicity being related to security, that should also be self evident. If not from common sense, then from examples of more complex (needlessly complex or otherwise) programs having more faults (on average).
  >blah blah blah, literally just vapor at this point, not worth addressing.
  Your post is exactly what a fanboy would write. A post that goes point by point, "bashing" them for being "wrong", "imprecise", or "not strong enough of an argument", all the while ignoring the actual "meat" of the argument (the meat being that A LANGUAGE FOR WRITING PARSERS IS BETTER FOR WRITING PARSERS THEN A GENERIC LANGUAGE). It is the equivalent of "link plz" (meaning "scientific paper of GTFO", as if actually using your own brain is a bad thing). In short, you sir are a fanboy.
  
  gens 7 years ago
  
  I would like to apologize for calling people idiots. It is not a nice thing to say.

andreasgonewild 7 years ago

Rust still means too much ceremony; there is a line where specifying more details makes the whole thing more difficult to understand and maintain; and that's still besides taking more time to write in the first place. Not everyone is into checking all the boxes, all the time; for good reasons; otherwise we'd all be coding Ada by now. Trying to shame others into adopting that mindset can only cause more division.

burntsushi 7 years ago

No, I don't think so. I maintain a popular command line tool that needs to be extremely fast, and Rust was an excellent choice for that. It's made it so much easier to maintain than equivalent C tools, and has made it easier for me to reach feature parity by nearly effortlessly depending on the work that so many others have done in the ecosystem.
It's not bug free mind you, but that class of bugs doesn't include things like buffer overruns, double frees or memory leaks (note: Rust does not guarantee absence of memory leaks, but RAII pretty much makes sure it doesn't happen in practice).
- andreasgonewild 7 years ago
  
  I would argue that writing it in C++ would have gotten you there faster, assuming a decent level of experience with the language. There's no ecosystem in the world that beats seamless integration with C and no language that even comes close to being as pragmatic.
  
  burntsushi 7 years ago
  
  > no language that even comes close to being as pragmatic.
  That's what I'm disagreeing with. I find Rust to be extremely pragmatic. There are other competing tools written in C++, and from where I'm standing, their maintenance story is quite a bit harder. Conversely, my tool works seamlessly on Windows, Mac and Linux.
  
  andreasgonewild 7 years ago
  
  If multi-platform is your biggest problem, I can see how that helps. I don't really care that much myself as long as my code compiles on the unixes. I sadly agree that static linking is a pain in the behind in C/C++ land. Still, the language I'm using interacts with the way I'm thinking; if that part isn't working out, all the features in the world isn't going to help.
  
  burntsushi 7 years ago
  
  Indeed. Windows support was critical, and I have a ton of users there as a result of it being so easy to distribute a Rust application to them. Microsoft even picked it up and is now shipping it in VS Code. :-)
  > Still, the language I'm using interacts with the way I'm thinking; if that part isn't working out, all the features in the world isn't going to help.
  Sure, and there was a lot of friction between me and Rust in the beginning. But my experience---and a lot others' experience as far as I'm aware---is that the friction settles down quite a bit. But yeah, experiences can vary there!
bschwindHN 7 years ago

What's your alternative when writing parsers which need to handle untrusted input? This isn't a general "use Rust everywhere" article, it's focusing on these kinds of parsers and how we can do better against very real issues.
- andreasgonewild 7 years ago
  
  But that has nothing to do with Rust; this post and others is more or less claiming between the lines that the only thing that can save us from buggy software is the Rust way, which is bullshit. It didn't work for Java, or Erlang or Haskell; and it won't work for Rust; simply because the solution is experience, not more rules.
- pjmlp 7 years ago
  
  Make use of parser generators like JetBrains MPS or ANTLR.
  The generated parsers should take the use case of untrusted input into account.
  Even if they don't do it currently, it should be easier to extend a generator that will take into account while using the same grammar specification.
dochtman 7 years ago

I disagree with your "too much ceremony" premise in general, but in particular I don't think it holds true for nom parsers. Here are two examples I've been working on:
https://github.com/djc/askama/blob/master/askama_derive/src/...
https://github.com/djc/tokio-imap/blob/master/src/parser.rs
If you look at the parsing logic that makes up most of those files, I don't think you can reasonably argue there is a lot of ceremony going on.
- andreasgonewild 7 years ago
  
  But I can, and I do; that's still too much noise for me. I'm sure the code is fine by Rust standards, but to me it looks like Perl with control issues. I don't need any help from my programming language with micromanaging the code I write or the way I write it. I'll ask when I want assistance, and until then I prefer if they stay the hell out of my way. I won't claim to write bug-free software; the only way to write bug-free software is to not write it at all; spending half my energy on sweet talking a psychotic compiler around every single detail is a perfect recipe for missing the bigger picture and ending up solving the wrong problems.
naasking 7 years ago

> Not everyone is into checking all the boxes, all the time; for good reasons;
Pray tell, what are these reasons?
> otherwise we'd all be coding Ada by now
That's not why we're not all writing Ada now.
- IshKebab 7 years ago
  
  > what are these reasons?
  Because it takes longer, is more complicated and more difficult to design. You can't tell me that writing Rust is as easy as writing Go for example.
  I'm not saying Rust's heavy typing and borrow checker don't have advantages. Of course they do. But I do think too many people pretend it is always an obvious choice to take them. In many situations a simpler, less 'perfect' language like Go, Python, or even C++ is better.
  For example I don't think anyone is going to be writing AAA games in Rust any time soon, and not just because of language momentum.
  
  naasking 7 years ago
  
  > Because it takes longer, is more complicated and more difficult to design.
  So you mean it requires properly solving the hard problems implied by your application and your desired solution, instead of ignoring them and permitting latent safety and security violations.
  > You can't tell me that writing Rust is as easy as writing Go for example.
  GC will always be easier than lifetime checking. But that's not what you claimed: you said people have good reasons for not wanting to check all the (safety) boxes. Go also requires you to check all of its safety boxes too, so it isn't an example of what you claimed.
  Only languages like C and C++ which permit violating type safety are easy are examples of being justified in not wanting to check the safety boxes. And you have unsafe Rust for when you really need it.
  > In many situations a simpler, less 'perfect' language like Go, Python, or even C++ is better.
  C++ is not simpler. It's hilarious that some people think "familiar" somehow means "simpler".
  
  riwsky 7 years ago
  
  you present a good argument for ticking all the safety boxes, but the original comment you're citing doesn't specify safety; you're refuting an argument they didn't make. To count Go as an example, one could count "zero-cost abstractions" or "no runtime" as a box that Go doesn't check.
  
  naasking 7 years ago
  
  Rust's "boxes" are there for safety reasons. I don't see what else could have possibly been meant.
  And Go has a runtime. The fact that it has green threads and runs a GC makes for a heavier runtime than C or Rust.
  I'm not sure I get your argument about Go' boxes. The original post was clearly talking about the "extra" work a developer has to do to "satisfy" the compiler and run their code. It's a common complaint from those coming from less safe languages. I don't see how I could interpret the meaning as you suggested and still make sense of the post.
  
  burntsushi 7 years ago
  
  > You can't tell me that writing Rust is as easy as writing Go for example.
  Sure I can! I've been writing Go and Rust daily for years. Both come pretty easy to me. But that's because I've been using both for quite a long time now. Do I sometimes stumble and wonder, "How do I represent my data in the cleanest way in Rust?" Sure! But the same happens in Go too.
  Rust was definitely harder to learn, and it took me longer to reach proficiency. With Go, I can't recall much of a learning period. It was pretty much off to the races on day 1. With Rust, there were some growing pains for a few weeks before things clicked.
  However, I don't actually disagree with your larger point! I am only one person, and what I find easy or hard might be completely different for another person. For example, I had written quite a bit of Haskell and C before coming to Rust, which turned out to be pretty good preparation. Not everyone has that background, which might make the learning experience worse (or better) than mine.
pjmlp 7 years ago

Ada compilers targeting enterprise and military customers with deep pockets, alongside the industry's adoption of UNIX, where Ada compilers where something extra to buy instead of available for free C compiler[0] required for any UNIX related systems programming, did not help making the language more widespread.
[0] Until they started selling the SDK as developer's edition addon
Peaker 7 years ago

performance and safety are just 2 checkboxes.
They're trying to establish that having both of those requires Rust's approach - and I think they're making a convincing argument.

lihaoyi 7 years ago

I wrote a short tutorial [Easy Parsing with Parser Combinators](http://www.lihaoyi.com/post/EasyParsingwithParserCombinators...), in case anyone's interested in seeing what combinator parsing looks like in practice (though in Scala, rather than Rust)

bbmario 7 years ago

If you guys were going to write an interpreted language AND/OR a transpiler, which toolset would you use?