May 20, 2008

Fear of parsers?

Martin Fowler posts a lot about DSLs these days. In a recent post about ParserFear, he comments on an apparently typical reaction against creating one's own DSLs. That reaction seems to be that parsers are hard to write, and that it's easier to use XML instead because with XML, you get the parser for free. Martin Fowler then continues to explain why in his experience, parsers are not hard to implement, by contrasting a specific XML case with an alternative design using Antlr.

I think he misses the point, though. It's indeed the case that, taken by themselves, parsers are not very hard to write, especially if you stick to simple grammars. However, they could just be a too high investment for too little return.

This reminds me of a different story: A couple of years ago, Erich Gamma answered a few questions about patterns in one of his talks. One question was about which patterns he would not include anymore in the Design Patterns book. Among others, he mentioned the Singleton and the Visitor pattern, and his explanation for not including them was that he deems them too complicated.

Most people react puzzled when they hear this story. Yes, everybody who has tried to implement visitors knows that they are quite complicated, but in contrast, singletons seem extremely simple and straightforward to implement. However, the major point here is that they are too complicated for what they achieve: A singleton only guarantees that you get exactly one instance of a class, not more, not less. You might as well just introduce a global variable with that one instance and don't bother going through the minutiae of implementing the Singleton pattern correctly (which has border cases that you can get wrong after all, depending on what language you have to implement it in).

That's the major point: The effort has to be compared against the benefits you achieve. The same holds for writing parsers for DSLs. A domain-specific syntax simply doesn't buy you that much, but just creates another layer of code that needs to be maintained and can create follow-up problems, for example, when the syntax you designed happens to be too inflexible to be adapted to change requests in future versions of your code.

This is also the main reason why Lispers like s-expression. The rules for s-expressions are extremely simple, but at the same time also very flexible: The first element in a list determines the meaning of an expression, and all other elements are interpreted in terms of that first element. The same in XML: The tag determines the meaning of an expression, and everything that is nested inside it is interpreted in terms of the tag. An advantage of Lisp over XML is that you don't even need separate reading and processing steps of DOMs, since s-expressions are seamlessly embedded in the language itself.

So the main point of "parser fear" is not that parsers are hard, but just too hard for what they buy you.

15 comments:

evrim said...

Dear Mr Costanza,

After reading your post, I thought you might be interested in Core-server parsers. We have implemented several RFC's and HTML parsers using parser/render grammar and composable stream.

Streams:
http://labs.core.gen.tr/repos/core-server/src/streams.lisp

Parser Grammar and Parser Expander:
http://labs.core.gen.tr/repos/core-server/src/grammar.lisp
http://labs.core.gen.tr/repos/core-server/src/parser.lisp

Several Parser Examples:
http://labs.core.gen.tr/repos/core-server/src/rfc/

Kind regards,
evrim _at_ core.gen.tr

elarson said...

My take is that the effort for others to use some DSL is also too expensive. When you account the implementation, docs and maintenance, it is very hard to justify using a DSL.

ferruccio said...

With XML, you don't really get the parser for free. The parser just moves upstream. Instead of having to parse a character stream, you wind up "parsing" the XML parser's output, so your parser is operating a slightly higher level of abstraction.

The real benefit of XML is that because it's so well known, new users can starting working with your data immediately. There's no learning curve for the syntax.

Casper Bang said...

What's the alternative to using visitor though? Seems to me it's the best way to work on a hierarchy without actually having to modify the individual classes.

Pascal Costanza said...

The alternative to using a visitor? Using a programming language where "modifying" a class doesn't hurt that much. ;) CLOS comes to mind. But even Smalltalk and Ruby are not too terrible.

The technical term is "open classes", but it seems hard to google for it. There are a few papers on MultiJava, though, that discuss open classes. C# also should have "partial classes", right?

For CLOS programmers, that's pretty much a given, since it's based on generic function, and so methods are not tied to classes, but are defined outside of them. So adding new functionality to existing classes is trivial.

Peter Seibel has a pretty good introduction on generic functions, which you might be interested in.

Casper Bang said...

Every time I mention partial classes or extension methods or similar stuff, I am immediately put in place by Java purists.
So for those of us confined by corporate language/tool policies (and thus, the will and capabilities of Sun), there seem to be few other options than stay in the existing pattern soup land.

grant rettke said...

The worst part about DSLs is the fantasy that gets peddled regarding them. For example, the business will write all of the logic in their little language that you came up with!

No one could ever justify the amount of work required to realize such claims.

The rest of the applications of DSLs, of course, would make sense, and require not so much work that folks will just use them.

Pascal Costanza said...

It's easy to stay in Javaland and not have to put up with the silly Java language itself. Just use a 'scripting' language. (A very effective tactic for getting a different programming language in is to not call it a programming language.)

Apart from that, even within the confines of Java, visitors are too complicated for what they achieve. Modifying classes manually is not so bad after all, especially with proper IDE support.

Steve Yegge said...

I was a big fan of DSLs a few years ago. Then I wrote a very small one, and was kinda shocked at how much work it was. The ANTLR grammar wasn't too hard, and it was fun to write, but documenting the damn thing was incredibly time-consuming. And it wound up being a significant chunk of code: some 6000 lines of it. And this was a pretty teeny DSL: http://www.cabochon.com/player/autobag

As a direct result, nowadays I'm much more cautious about creating a DSL. I'm always on the lookout for places to use them, but my bar is higher.

RegExps and XPath are the canonical examples of how DSLs can have truly dramatic impact. They give me hope that there are other domains for which "Universal DSLs", so to speak, could drastically simplify everyone's code.

But I think Pascal's point is pretty valid: before you embark on a DSL you'd better be darn sure it's going to have a massive payoff, since it'll be more work than you expect. Educating your users may be your biggest obstacle - they need to see the benefit or they won't bother to learn it.

Most of the time I think you should just be using a better general-purpose language.

I still wish people weren't afraid of parsers, though. Parsers are empowering. Writing a DSL for others == lots of work. Writing a code-processing tool for yourself == easy productivity boost.

grant rettke said...

Do macros in Lisp count as DSLs or are those too easy? ;)

Pascal Costanza said...

Macros by themselves are just a tool. But, yes, you can build DSLs with macros. See this post about DSLs in Lisp, and especially, watch Rainer Joswig's quicktime movie demonstration.

You will see that DSLs in Lisp are pretty much non-invasive and seamless.

samantha said...

XML? XML was designed to capture and transmit hierarchical data relatively transparently. That's it.

A DSL on the other hand is a language designed to be much more expressive than raw code or data sets in a particular problem domain. Yes you may be able to express the problem domains semantics somehow encode in this XML data language. But you would be imposing a large impedance mismatch. In principle you could write any imaginable program in XML. But who could say that it read as easily as the original language, was easier to understand or maintain? XML is not a panacea. Just the fact that it is widely known does not make it automatically a prized tool or likely good implementation technology for all purposes.

Steve Knight said...

I'm not sure if it's entirely appropriate but I'm going to invoke
Greenspun's Tenth Rule of Programming: "Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp."

Which almost sums up what I want to say. It seems that where I have seen DSLs implemented as grammar-food for lex/bison/yacc they have had significant overheads to learn and to maintain.

Conversely from what I've seen (and done) using DSLs in CL, both learning and maintenance overheads are greatly reduced.

Steve

Jerry Boetje said...

A long time ago, when a graduate student of Eugene Charniak, I was having difficulty getting started in building a complex project. He suggested that I devise a language to describe the project and then implement the language. Of course, I took his advice only after exhausting all alternatives. But once I took his advice and create a DSL in CL, building the project was much simplified. Using DSLs are a great way to help understand the system under design. But as others have noted, it's much easier in CL.

Aidan Kehoe said...

XML has the advantage, usually ignored by people who dislike it, that it has well-defined, widely-implemented support for non-ASCII data. JSON doesn't have that. S-expressions don't have that.