What can we learn from how compilers are designed?

225 points by jsnell 6 years ago

Bizarro 6 years ago

That's why I like Clojure's philosophy of a few data structures and many functions operating on them.

Of course Alan Perlis wrote "It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." in 1982.

But pipelining outputs into inputs (you'll see it a lot in F# too) is just a simple, flexible concept.

As Rich Hickey once said, "We're data processors".

I think we somehow took a wrong path when OO got embraced as wholly as it did back in the 90s or so.

sharpercoder 6 years ago

Maintaining state and mutating state is useful and neatly done in OO. That is, when done right, which is hard. I certainly agree that for processing data, good functional languages reign supreme. They are usually terser while more expressive.
It seems that solving problems need their own tool. Compilers typically fit a data processing pipeline. Yet, a use case such as editing an AST (e.g. a text editor showing incomplete syntax state) probably could better use OO as a tool.
Most current software then simply uses the wrong tool for the job; they do data processing with OO, and state representation and mutation using functional pure idioms. A problem that is not yet solved very well in the industry is interplay with these idioms. .NET with C# and F# comes to mind, which allows for easy workflows leveraging both OO and functional.
- agumonkey 6 years ago
  
  The "best" design in stateful parts of systems has yet to come to my eyes. IMO FP gives you better foundations to start stateless then see where you can group things into private mutable memory. Not that I know how to do it but at least I can reason instead of invoking innate black art or long experience.
- mbrodersen 6 years ago
  
  > Maintaining state and mutating state is useful and neatly done in OO
  I started laughing when I read "neatly done". I assume you don't mind long range state mutating effects making your code a lot more complex to reason about than it needs to be. Or the "grab a banana and you grab the whole planet" dependency issues OO almost inevitably introduce. Or ... (I could go on). OO with mutable state is anti-modular.
simonbyrne 6 years ago

> I think we somehow took a wrong path when OO got embraced as wholly as it did back in the 90s or so.
I think the particular paradigm that was adopted by C++, Java, etc. was a bad one. Having now worked quite a lot in a language with multiple dispatch (Julia), I find it really cumbersome to work in class-based single dispatch languages.
- Bizarro 6 years ago
  
  Yep, because it's forcing you to do awkward hierarchies that most people get wrong. And even if you get it right, it tends to be awkward a lot of the time.
  I was always a big fan of CLOS (Of Common Lisp) and Dylan (poor Dylan).
  By the way, the spiritual successor to Dylan (IMHO) Stanza http://lbstanza.org/ doesn't seem to get much love.
- ZephyrP 6 years ago
  
  Multiple (virtual) dispatch is a central feature of C++ (and I suspect Java as well).
  
  groovy2shoes 6 years ago
  
  Virtual dispatch in C++ and Java is dynamic dispatch, but it is singly so rather than multiply so as found in CLOS, Dylan, Julia, etc. If you want multiple dispatch in C++ and its kin, you need to jump through hoops with visitor patterns and such.
  
  foota 6 years ago
  
  The difference confused me for a bit, but consider the following:
  Print(Vehicle v){ print("vehicle"); } Print(Car c){ print("car"); } Print(Boat b){ print("boat"); } Vehicle test = Boat(); Print(test);
  In a language with multi dispatch I believe this prints boat, whereas with single dispatch it prints vehicle.
  
  lmitchell 6 years ago
  
  Not quite. Multiple dispatch (as mentioned in sibling comments) is basically 'virtual on both object and argument types', so like:
  class Vehicle { virtual void Collide(Vehicle other); } class Car { void Collide(Truck other) { /* hit a truck */ } void Collide(Car other) { /* hit another car */ } } class Truck { void Collide(Truck other) { /* hit another truck */ } void Collide(Car other) { /* hit a car */ } } Vehicle c = Car(); Vehicle t = Truck(); c.Collide(t); // with multiple dispatch, calls Car::Collide(Truck)
  
  seanmcdirmid 6 years ago
  
  You are using static overloading on the argument rather than dynamic dispatch. If the static type of truck is obscured to a super class, your behavior won't be as expected. In general, overloading shouldn't be used as a replacement for dynamic dispatch, rather it should be used to accommodate multiple incompatible types (e.g. foo(X) and foo(Y) but never foo(Z) if X <: Z and Y <: Z unless foo(Z) maintains the behavior of foo(X) and foo(Y) given an x or a y under subsumption).
  Multiple dispatch is typically meant to be purely dynamic in meaning, as wiki puts it:
  > Multiple dispatch or multimethods is a feature of some programming languages in which a function or method can be dynamically dispatched based on the run-time (dynamic) type or, in the more general case some other attribute, of more than one of its arguments.[1]
  (https://en.wikipedia.org/wiki/Multiple_dispatch)
  
  yorwba 6 years ago
  
  The static type of both c and t is Vehicle, but by dynamic multiple dispatch Car::Collide(Truck) is selected. That's not how C++ works, but the example was exactly about illustrating how C++ doesn't work.
  
  barrkel 6 years ago
  
  Multiple dispatch is overriding virtual member functions on argument types, not just receiver type.
  Overloading and overriding are duals; the former is statically determined, the latter dynamically determined. Just like you can overload on the full argument list, multiple dispatch lets you override on the full argument list. This lets you do things like implement the visitor pattern without double dispatch. And C++ doesn't support it without libraries.
  
  zenhack 6 years ago
  
  Unless it's via some new-fangled C++ feature I haven't heard of yet, I'm pretty sure C++ doesn't have multiple dispatch. Can you elaborate on what you're referring to?
  
  bstamour 6 years ago
  
  C++ doesn't have multiple dispatch. I think the OP was referring to multiple inheritance, which is not the same thing.
  
  ZephyrP 6 years ago
  
  I was thinking of runtime single dispatch with compile-time type of the arguments to that function.
  
  bstamour 6 years ago
  
  So virtual function overloading?
  Can you give a small bit of sample code to demonstrate what you mean?
  
  ZephyrP 6 years ago
  
  just ordinary virtual function that discriminates upon arguments (at compile-time).
  
  zenhack 6 years ago
  
  Yeah, multiple dispatch implies more than one argument involved in the dispatch decision at runtime. So that's not the same thing.
stochastic_monk 6 years ago

Similarly, I like Python and C++’s emphasis on duck-typing. I use STL containers often, and provide similar interfaces to my custom data structures for flexible, general, fast code.
- jonathanyc 6 years ago
  
  I wish there were tools to auto generate documentation on what interface a duck-typed consumer expects, e.g. what std::vector<T> expects of T.
  
  stochastic_monk 6 years ago
  
  Well, C++17 concepts enforce requirements, but hopefully metaclasses will do more in 20.
  
  bstamour 6 years ago
  
  Concepts are also C++20. They were voted in, however, shortly after 17 shipped.
  
  stochastic_monk 6 years ago
  
  Good point. I was thinking of the rudimentary, template-based use of concepts, not the polished, C++20 version.
  
  bstamour 6 years ago
  
  Ah gotcha. Yes in the current situation we can do nearly everything that the concepts proposal for C++20 does with enable_if, void_t, and a few other helpers. Though concepts have the benefit of letting you "say what you mean", instead of relying on template magic.
  
  orf 6 years ago
  
  This should be more than possible with Pythons type annotations, when done right. I.e 'iterable[str]' can accept a generator, list, tuple, any class with the right __iter__ and of course a string itself.
  
  wirrbel 6 years ago
  
  This is what I like about Rust and Rust's traits. You write a generic and specify the properties of the generic type parameter, for example that it must implement another trait or something.
mbrodersen 6 years ago

> That's why I like Clojure's philosophy of a few data structures and many functions operating on them.
Which in practice means that a few data structures that are not perfect for solving a specific problem are forced (with repeated hammer blows) to bend to the problem. Like the old style LISP AList's used as a map. Probably the worst possible structure for representing a map. But yeah it kinda works as long as you don't have much data.
- kazinator 6 years ago
  
  The assocation list is a poor data structure for representing a map, if the most important requirement is to have fast access to any key of a large map.
  However, the association list is an excellent data structure if an important requirement is to be able to cheaply extend a map with new keys functionally: without altering the original. It also provides the requirement that duplicate keys added to the front shadow existing ones. It is also very simple; it is easy to verify that operations are correct.
  With the above properties, association lists, directly provide an easy and correct implementation model for lexical environments.
  The empty top-level lexical environment is an empty list. Adding a frame with new variables, including ones that shadow existing variables, is just consing pairs onto the list. The assoc operation correctly implements lexical variable lookup. The environment for making a lexical closure is just a copy of the assoc list pointer that you have now. Because the bindings are independent pair objects in the association list, individual bindings can be located and selected, and combined into another association list. A newly created association list environment can trivially capture an arbitrary subset of bindings from another one without mutating it.
  Even if we don't use this representation in production use, it provides an important reference model. Anything more complicated has to produce the same results; we can start with assoc based environments, build a lot of code, and then use that as regression test cases for going to a different environment representation.
  It's good enough for pattern matching and logic languages where the number of variables isn't great, but the search has to do things that are complicated enough without a heavy-weight environment structure.
- kbp 6 years ago
  
  > AList's used as a map. Probably the worst possible structure for representing a map.
  They're more flexible than hash tables and often faster; what makes them so awful? They both have trade-offs that make them better and worse in different contexts, depending what you're optimising for.
- lispm 6 years ago
  
  > Probably the worst possible structure for representing a map.
  Not really.
  But native hashtables had been added to Lisp in the 70s...
  
  kazinator 6 years ago
  
  Under the right circumstances, a lookup table could be worse. In a simple interpreter, I would hate to pass down a vector of pairs for the environment representation; it would have to be wholly duplicated to extend the environment.
marcosdumay 6 years ago

I remember my undergrad classes where the types of software engineering diagrams were presented. There were some very informative useful data transformation diagrams that were just glossed over in order to reach the useless UML shit.
I took a very important lesson from those classes, but not the one the professor was trying to explain.
kazinator 6 years ago

It's worth reading all of those Perlis' epigrams:
103. Purely applicative languages are poorly applicable.
Give me a hundred functions on one data structure, but make thirteen of them useful mutators. :)
skybrian 6 years ago

If you want to avoid OO you'll also want to avoid treating functions as first-class values. As soon as you have structs that contain functions, you can build OO.
- adrianN 6 years ago
  
  OO is a way of designing programs. You can avoid it in most languages if you want to and you can do it in most languages if you want to. The language only determines the amount of boilerplate you need.
- skybrian 6 years ago
  
  Since I'm getting downvoted:
  Folks, this is Chapter 3 of Structure and Interpretation of Computer Programming. Technically you don't even need a struct; closures and mutability will do.
  And how do you serialize data that function pointers in it?
  
  coldtea 6 years ago
  
  You're not downvoted because people ignore that fact. You are downvoted because people disagree that it's important.
  Yes, you can create objects with closures, but when people talk against OO programming, they mean the paradigm, not merely the possibility that a language can do objects. Forth and C can do object oriented programming just fine too, if one is so inclined.
  
  HumanDrivenDev 6 years ago
  
  I am almost positive Rich Hickey hasn't read that book. The only thing more stunning than his ignorance when he talks about OO is the amount of young programmers who believe everything he says.
  It blew my mind when I realised I could write objects from scratch in languages with closures, even statically typed ones! Of course each object has its own v-table in the SICP code, so probably not something you should use in production.
  
  pjmlp 6 years ago
  
  I always find ironic when such young devs bash OOP, given that Clojure has protocols and multi-methods.
  Or bashing Java and the JVM, without which Clojure would not exist in first place.

evincarofautumn 6 years ago

For my main compiler project I’ve actually been moving away from this “pipeline” style because it’s inflexible and not great for incremental operations.

Inspired by the Forth dictionary, gradually I’ve arrived at a design where the compiler behaves like a database server—it takes requests in the form of program fragments and produces responses in whatever form you want by querying the database and generating intermediate results as needed. The pipeline is still there, it’s just not hard-coded, and modelled in a more relational way. (I plan on doing a detailed blog post on this after I land the code.)

Ordinary compilation? Push a bunch of files and pull the generated code of anything referenced transitively from the program entry point or library exports. Documentation or TAGS generation? Pull the appropriate metadata. Syntax highlighting? Pull the lexical classes and source spans of the tokens—no parsing needs to happen. In general, if you want an intermediate result, you can just ask for it—there’s no hard-coded code path for “run the pipeline up to this point and bail”.

The advantage is that this is amenable to producing partial results even when there are program errors—a lexical error doesn’t prevent parsing, it just produces an “error” token that can be displayed in an editor without compromising higher-level features like syntax highlighting, autocomplete, navigation, &c.

seanmcdirmid 6 years ago

Many pipeline approaches include error tolerance. I specialize in incremental lexing and parsing for IDEs, and even have experience with some state based error recovery (so, for example, unbalanced braces remember previous brace associations, or a bad transient parse error doesn’t destroy the good trees on the precious run). I don’t see how a database approach would help, you just need to incrementalize the structures in your pipeline (or have reproduceable persistent markers like Eclipse’s elastic positions).
- evincarofautumn 6 years ago
  
  Ah yeah, you can have incremental structures without an incremental pipeline or vice versa, they’ve just worked well together for me.
  
  seanmcdirmid 6 years ago
  
  You can make your pipeline mostly oblivious to the incremental nature of each stage, which has some SE benefits. However, I like to run all my phases as damage/repair work lists after the initial incremental lex stage: token changes dirty trees associated with them, putting them in a list for releasing, which then can cause retyping of those nodes, causing those to go on a work list, though the parsing worklist is cleaned before the type checking one before the execution (for incremental live programming) one.
  I’ve never seen two IDE-oriented incremental compilers use the same approach actually, there is a lot of diversity here.
  
  evincarofautumn 6 years ago
  
  That’s an interesting approach. Are you generally working in imperative languages? Last I recall you were using…Scala?
  I’m new to it, but it seems like there’s a lot of innovation quietly happening in this space.
  My approach is more like a declarative build system, I guess. To add TAGS support, for example, you add an output that can generate a TAGS file, and declare that it depends on a resolved AST (one with fully qualified names). Then you can make queries like “pull tags from {this set of files, the whole database, everything in this namespace, &c.}”. Dependencies are resolved, and errors reported, mostly automatically unless you want to change the default behaviours.
  If you want to regenerate tags after changing some input, then as an optimisation you can compute only the diff resulting from whatever dependency was changed upstream (usually a file, ultimately) at the memory cost of caching intermediates (which right now means leaving the server running).
  
  seanmcdirmid 6 years ago
  
  I haven't done anything with Scala in a long time.
  Tree incremental is important for large projects with < 50 ms response times. But it does require memoizing lots of information.

vicpara 6 years ago

It's a good article. Indeed, compiling a programming language is such a complex job that is frankly quickly dismissed and taken for granted. My view is that learning and writing a simple compiler will make most of developers on notch better. The very least you'll understand why parsing HTML or XML with regex is such a bad idea :)

Another golden resource is the Dragon book: https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniq...

mojuba 6 years ago

> My view is that learning and writing a simple compiler will make most of developers on notch better.
Agreed. One of the interesting insights I got while building my own compiler was that despite what every compiler book says, there are languages that can be compiled on the fly without building an AST. I'd say for most languages it's an unnecessary step that makes compilation slower and more resource hungry. However, compiling without AST is a next-level skill, another notch I think.
- justinpombrio 6 years ago
  
  > One of the interesting insights I got while building my own compiler was that despite what every compiler book says, there are languages that can be compiled on the fly without building an AST.
  PL (but not compiler) researcher here. What you're saying sounds to me like "One of the interesting insights I got while programming my own game was that despite what every game programming book says, there are games that can be programmed without defining any functions." I mean, yes, you can do that, and it may in fact be more efficient, but it's going to cause trouble down the line if you ever want to do something more complex.
  
  munificent 6 years ago
  
  > What you're saying sounds to me like "One of the interesting insights I got while programming my own game was that despite what every game programming book says, there are games that can be programmed without defining any functions."
  Ex-game programmer and current PL programmer here. I think a better analogy is that you can make a game without having a scene graph. And, indeed, you can. And it works fine for many simple games, though as some scales it starts to be a hindrance.
  
  compiler-guy 6 years ago
  
  Yeah, the grandparent is only true if you don't want to do interesting things with the language, like nearly any but the most basic peephole optimization.
  But a fun fact, if you know about single-pass compiling, you can figure out a lot of why the original C syntax is what it is. A declaration is a "reserve some memory" statement. Local variables had to be declared at the beginning of scope so that the compiler could deallocate them at scope close. Without an AST, the compiler had to produce instructions immediately.
  With a true AST, you don't need such a restrictive syntax.
  There are many other possible examples.
  
  skybrian 6 years ago
  
  Yes, there's a reason compiler books start with parsing and building an AST.
  But I think studying Lua [1] or the Wren codebase [2] is still worthwhile, provided you keep in mind that they're relatively simple languages that are designed to be both fast and embedded.
  [1] https://www.lua.org/doc/jucs05.pdf [2] https://github.com/munificent/wren
  
  HumanDrivenDev 6 years ago
  
  well there's no need for an AST in a concatenative language - that might be what he's talking about.
  
  mojuba 6 years ago
  
  No, it's not the same thing. A compiler without the AST phase itself is more compact and arguably more beautiful, but again not every language is compile-able this way (C++ being a notable example).
- robertelder 6 years ago
  
  My experience has been that you can get incredibly far (much farther than you'd think) without any form of AST and just iterate over the raw parse tree and tokens, but the problem you eventually encounter when you do this is that if you want to support all the different cases in a 'real' and messy evolved language like C is that you get a combinatorical exposion when you try to support all the corner cases of the language. A great example of this would be if you try to figure out how to write a function to iterate over the tokens that make up a struct definition in C:
  http://blog.robertelder.org/magical-world-of-structs-typedef...
  This is the reason for needing the extra 'abstract' syntax tree layer that lets you generalize over all those different cases and avoid having 20-level nested if statements in your compiler.
- seanmcdirmid 6 years ago
  
  Compiling without an AST is actually more like an old school skill than a next generation one: very common in the 60s and 70s, very rare today.
verletx64 6 years ago

Can I ask, as somebody that hasn't dived into Compiler's yet; not out of not wanting to, but more a 'so much to learn' scenario
How does it make a developer better? I see this repeated often, but I don't really see how.
- slx26 6 years ago
  
  For example: once you see what the compiler does, you can write readable code without sacrificing efficiency.
  When you don't really know what's going on under the hood, consciously or unconsciously you write code based on assumptions, which a lot of the time are wrong. It's very hard to exemplify because the mental model programmers have about programming can be very different, and therefore the benefits of a better understanding provided by learning about compilers can have different effects on different people. And still, everyone who has tried recommends learning about compilers.
  Programming languages are our most fundamental tool as programmers, and yet it's pretty much impossible to make a fair assessment of why they are designed the way they are, why they work the way they do, why they provide the abstractions and structures they provide, until you try for yourself. Understand programming languages better => use them more naturally.
- magpi3 6 years ago
  
  Compilers are a great example of taking a very complicated task and breaking it down into simpler (though not necessary simple) steps. Just going through the practice of building one might change how you look at other projects you've worked on in the past.
  And I think the insights you take from designing a compiler translate into data processing as a whole, as other commenters have noted.
- Ono-Sendai 6 years ago
  
  You get an insight into why current languages and compilers are designed the way they are. For example, why does C have #include? Probably because it's the easiest possible way of including/referencing source code from another file.
- kqr 6 years ago
  
  It's surprisingly practical to view a lot of data mangling tasks through one's compiler glasses.
mbrodersen 6 years ago

The dragon book is not recommended. It is very out of date. I write (among other things) high performance compilers for a living. And the Dragon book is not useful at all.
kazinator 6 years ago

Sunday before last (twelve days ago) I started working on a VM, assembler/disassembler and compiler for TXR Lisp, just early morning and evening spare time.
Assembler and VM handle lexical closures, exception handling, unwind-protect, function calls (of course) and global variable access, and dynamic environments (special variables). Compiler handles a significant number of important special forms: let, lambda, function calls, assignment and others. lambda handles optional parameters with default values, and any parameter can be a special variable, correctly saved and restored, and excluded from the lexical environment.
VM closures are hooked into the rest of the environment; they are represented as a new kind of function and can be used in the normal ways.
https://www.reddit.com/r/lisp/comments/84j8l0/txr_lisp_vm_co...
This stuff is not terribly difficult.
Compiling Lisp is in many respects easier than interpretation. A huge simplifying factor is that in the compiler, you can mutate an environment object as you compile: that is to say, add a binding, compile something, add a binding, compile something ... The compiled something, even if it contains a lexical closure, does not see any of the bindings you didn't add yet. Can't do that in the interpreter; not with the same environment representation. If you interpret something which captures a closure and then extend the captured environment, you've screwed up.
I have to grapple with a couple of messy special forms. That's how it goes when you've been interpreting for years and now you're introducing a compiler. One form is the one which underlies the evaluation of the string quasiliteral syntax. Unfortunately, the special form is deeply interpretive; it passes the environment into some functions which walk the syntax of the quasiliteral doing eval here and there, which come up with the pieces of strings that catenate together to form the output. I'm unraveling that stuff for compiling; coming up with a way to hoist the evals out into static code, and some function API that can then just work with the results of the evaluation, yet maintain compatibility with the interpreted semantics.
The annoying thing is that just the partially complete transformation code for this crud is 60 lines of Lisp so far, whereas the entire compiler (so far) which accomplishes so much more is only 378. One is a big percentage, in other words. It's an annoying thing in programming that sometimes code which is very specialized and achieves little in the "big picture" burns more LOC's then you'd think it deserves.
- dfox 6 years ago
  
  One thing I tried to do in dfsch is to use the same IR format (in the sense of being interpretable by the interpreter) accross all the stages. What I found out is that this is surprisingly hard to do in Lisp-1 language while preserving the semantics to the extent of somehow abandoning the language and thinking about making it CL-style fully lexically scoped (ie. with suport for labels and similar special forms) Lisp-2.
  (CPS-style IR does not make much sense for dfsch, as it explicitly and intentionally does not support call/cc, as the main use-case is easy and straightforward C interoperability)
  Edit: dfsch distinguishes mutable, immutable (which are somewhat redundant and in fact consume one combination of tag bits which can be used for something more useful, like strings) pairs and CDR-coded lists, which are always immutable and CDR-coded lists have additional slot that records where they came from in case of things that either was parsed from source or is result of macro-expansion, which in my opinion meets the intention of R5RS. On the other hand the very aggressive constant folding done by the macro-expansion stage probably does not.

tomxor 6 years ago

I find this interesting from the perspective of the design mentality. I've read so many comments of people here starting with "I am a PL researcher" and "I am a compiler designer" often with a qualification of "but not the other one", it seems like unless you are doing a Forth style language you're almost never both...?

My interpretation of the authors main points are: breaking compilation down into a pipeline of blocks has obvious advantages, but much like when we break any large implementation problem down into blocks, we can loose sight of the whole - in short, it's important to maintain a holistic view while simultaneously making decisions of where to "cut" your problem. But that holistic view to me seems to mostly end at the intersection between compiler and language.

When I started my job it involved a lot of one way design implementation, as these became more adventurous I gradually found myself questioning the designs more, until now where I see basically no division between design and implementation - when a "design" for something comes my way I basically pull it to pieces and interrogate the author in order to understand all of the decisions behind the design and then rebuild it (with the author) with implementation and practicality in mind, it always benefits everyone involved. I feel like this previous division that i destroyed is quite similar to this division in compiler and language, but I am mostly an outsider to that world so i could be wrong.

skybrian 6 years ago

Compiler design is somewhat unusual just because there are so many different cases to deal with. An AST might have 20 different kinds of expressions, and anything traversing it needs to deal with them all.

This means that whatever solution you come up with needs to scale to lots of cases, more than you can comfortably remember. And for a language that's not stable, you need to update all the tools when the AST changes, to handle the new construct.

(Lisp isn't immune; it calls them special forms and they're uniformly represented as lists, but you still have to deal with them for any kind of deep analysis.)

seanmcdirmid 6 years ago

You can traverse the AST via internal functions (in OO terminology, virtual methods that make the case matching implicit on this). It really isn't much of a problem if the functions are typecheck, codegen, execute, etc...
I don't really get the appeal of using giant mega case matches in a compiler.
- qznc 6 years ago
  
  AST transformations are a exemplary instance of the Expression Problem. Solutions in mainstream languages, eg the Polyglot framework for Java, lead to ugly overengineered code. Writing AST transformations with multimethods might be nice.
UncleEntity 6 years ago

There's an essay on this: https://www.gnu.org/software/dotgnu/treecc_essay.html

archi42 6 years ago

I think the OO part is a bit ranty, but other than that it's a nice article: Software engineers can really learn a lot from compiler design. Never regret I did take that lecture (we excluded C typedefs and some other meanies). If you're bored and want to learn, write a LLVM frontend, and maybe a simple optimization (e.g. constant propagation or dead code elimination).

There is a tutorial for a functional language in the LLVM docs, or just do a subset of C, or your own DSL ;)

Veedrac 6 years ago

This all assumes that you want most code to have the properties of compilers, but this seems misguided to me. Compilers don't handle complexity well, they're unreliable and not stable against small random purtubations, naïve approaches tend to be inefficient and complex ones global, etc. A lot of this is just that they're solving a difficult task (converting terrible C code into less terrible assembly), admittedly.

runevault 6 years ago

I wonder if there are any ways to build simple to compile but still useful languages outside of just doing Lisp or Lisp-like.
- munificent 6 years ago
  
  Almost every language is less thorny than C because of C's context-sensitivity and the preprocessor. Pascal is a classic language that's useful but easy to compile. Any of Wirth's successor languages are too (occam, etc.). SML is a step up, but still not intractable. Java 1.0 might be fun.
  
  runevault 6 years ago
  
  Hm fair point. BTW I'd just like to say thank you for Crafting Interpreters, been going through it (in c#) and enjoying it immensely.
  
  munificent 6 years ago
  
  Thank you! That's great to hear. :)
- shakna 6 years ago
  
  FORTH is probably the classic "easy to compile" language, and is quite a decent language in and of itself.
- UncleMeat 6 years ago
  
  The hard part of writing a compiler is not the lexing, parsing, type checking, or code generation. The hard part is the optimizing bit.
  There do exist languages that are more amendable to whole-program analysis but the things that cause problems are usually not obvious. Java, weirdly, is among the easier languages to analyze because its memory model happens to be very easy to work with (excluding nulls).
vicpara 6 years ago

Language design goes hand in hand with compiler's design. No wonder we have so many languages being born every year. Different mix of ideas being tried in each of them or forgotten lessons being rediscovered.
Just taking a look at what D, Rust, Lisp and Scala proposes will give one the chance to ask crucial questions about what are the implication of that guarantee, pros, cons and how it will all work out.

evmar 6 years ago

Another place where lexing and parsing interact is JavaScript. Consider the lexer encountering the text "/x;".

If it occurs in code like:

   let a = /x;

then it needs to continue lexing a regular expression.

If it occurs in code like:

   let a = b /x;

then it needs to lex it as division sign, identifier x, semicolon.

comex 6 years ago

At least it's not bad as Perl, where the same code can be parsed as division or a regex depending on the 'type signature' of a function it names:
https://www.perlmonks.org/?node_id=663393

chubot 6 years ago

Regarding splitting lexing and parsing, I argue the opposite point here:

When Are Lexer Modes Useful? http://www.oilshell.org/blog/2017/12/17.html

I assert that scannerless parsing is a bad idea, for reasons of efficiency and code complexity. He's making some argument around generated parsers which I don't understand.

I believe he's conflating the issue of whether you generate your parser with whether you have a separate parser and lexer. (The OSH parser is hand-written, while the lexer is generated. This is actually the opposite of what you see in many languages -- e.g. in Python the parser is generated, while the lexer is hand-written.)

There's no reason that you can't produce good error messages with a separate parser and lexer. In fact I believe you'll get better error messages. I use a "lossless syntax tree" representation which allows good error messages:

From AST to Lossless Syntax Tree http://www.oilshell.org/blog/2017/02/11.html

(On the one hand, it's like a parse tree, but less verbose. On the other, it's like an AST, but it doesn't lose information.)

----

Regarding intermediate representations, I had a lot of success with Zephyr ASDL, which is a DSL for using algebraic data types in procedural/OOP languages:

http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL

I also don't really see the strong contrast betweeen OOP and FP he's making. Oil uses both styles -- the IR is at the center, using a functional style (data without methods), but the other components are objects/modules/interfaces/encapsulated state/data hiding.

IMO OOP and FP are the same thing when done right: they're about being principled about state. The way I see it, a pure dependency injection style is pretty much equivalent to functional programming.

Python is described with ASDL: https://github.com/python/cpython/blob/master/Parser/Python....

ASDL has some relation to "Stanford University Intermediate Format", which sounds like it could have been a precursor to LLVM IR. In other words it happens to be used in interpreters right now, but it was designed for compilers.

https://www.cs.princeton.edu/research/techreps/TR-554-97

https://suif.stanford.edu/suif/

---

I also wrote about "the Lexer Hack" here: http://www.oilshell.org/blog/2017/12/15.html

Bizarro 6 years ago

The use of middleware seems to embrace this concept too for very flexible systems. Just pass it along the assembly line and you don't care who's next in line or who was before you.

blt 6 years ago

The parser/lexer separation feels especially useless in a Lisp interpreter, where the parsing is trivial.

kazinator 6 years ago

It's there. E.g. ANSI Lisp has the concept of a token. Reading multiple objects until a closing ) via read-delimited-list is like parsing. The actions which accumulate token constituent characters into a token and produce some object from it, like a number or symbol, are like lexing.