2006

Java and XML: Let's not use them together

My dissatisfaction with both java and XML is fairly well documented in previous posts. To recap two major points:

  • Java isn't flexible enough, both syntactically and with respect to it's type system. (Let's leave aside the lack of a reasonable lambda-style syntax for the moment, which is higher on my List of Things That Make Me Cuss When I'm Programming In Java, but is not as relevant for this article.)
  • XML is, by design, horrifically redundant. (This is almost acceptable for what it was intended to be: a non-human readable data format that didn't have the negative connotations associated with s-expressions. It is totally unacceptable now that people are forced to look at it all day long.)

What I'd like to look at in this post is why I think XML has become such a huge part of java development and why I think that is unfortunate. The short answer to the first part is:


People need Domain Specific Language (DSLs)



Why do people need DSL's? Because there are whole chunks of applications that don't need a full, general programming language, and for which a full, general programming language is poorly suited. O/R mapping is a good and common enough example. Build tools are another: you want a syntax that encapsulates the common operations so you don't end up generating reams of general code to do basic activities.

As noted in
this paper, there is a continuum between libraries and DSLs. So, why have DSLs at all? Why not simply design libraries?

The answer to that question is: syntax. As much as academics might scoff at it, syntax matters, and it matters
a lot.

This is precisely why ruby is enjoying so much success right now: ruby's syntax and evaluation rules are so flexible that it allows you to create minimal, expressive DSL's with very little effort. You simply have to get your head around how meta-programming in ruby works, and you are off to the races. Ruby on rails is a DSL for building web applications, and a pretty darned good one.

Java is, of course, much more locked down than ruby. And this isn't necessarily a bad thing. The java designers were coming from a world replete with horrible C macro-kludges, so it's understandable that they decided to leave out syntactic extensions. If every man is a language designer, you end up with a ton of badly designed languages. But you also end up with a few very well designed ones. And I'm not entirely convinced that the vast majority of useful, small DSL's aren't simply badly designed languages that answer a particular specific need, akin to German's relationship with soldiering.

In any even, that's wandering a bit off point. The facts on the ground today are that java developers have been in need of a way to design and implement DSL's for a while now (even when they don't call it that) and the accepted way to do it has become XML. Why?

My theory is this: in java, DSL's usually start out as a library, then progress to a library with a smidgen of configuration. XML became the de facto standard for config files during the .com boom, property files apparently not being cool enough, so config information ended up in XML files. Additionally, XSD's give us a rudimentary language syntax (though not semantic) verification tools.

All fine and well. I might pick another syntax for structured configuration (say,
YAML), but whatever. XML is reasonably suited for simple declarative programming.

But then we java developers started doing more and more in those config files and, at some point, they began to cross over that invisible line and become
semantically crucial parts of our applications. They no longer simply contained a few flags used to slightly modify runtime behavior. They became an XML-based programming language for crucial subsystems.

This is unfortunate, for many reasons. Among them:

  • We have traded java, a language that, while certainly not beautiful, is at least plausible for one that was never designed for human consumption. XML is utterly miserable to use in large quantities. See ant build files, and weep.
  • We now have to think in two different syntaxes. I maintain that this is a difficult transition for a significant portion of the programming populace.
  • We cannot have any sort of locality with related java code. Again, the syntax is so utterly foreign that it is like mixing Japanese and English. Even if we could put it in the same file, or add IDE support to navigate from one to the other, it wouldn't work well.
  • And, most interestingly to me, it becomes difficult to communicate whatever type information we have built into our DSL to Java. We have two choices I can see:
    • We can do java code generation off of our XML-based DSL's, which everyone hates. Among other things, it takes time, requires a lot of infrastructure work and can introduce some nasty build dependencies. We do a fair amount of this a Guidewire.
    • We can communicate with our DSL via non-typesafe mechanisms (usually hashes and strings). This is the preferred mechanism because it is the easiest. Simply do nothing! But then one wonders why we spend so much time crucifying ourselves on the cross of type safety in java, when increasing amounts of our application code reside across this great type-unsafe divide.


So, that outlines why I think we ended up with so much XML in our java applications, and why I view that as an unfortunate thing. Now the hard part: what can be done about it.

Frankly, I have no idea.

My first reaction is that we need to open up java with a type-safe macro language to allow for syntactic extensions. But as nonchalant as ruby has made me about language extensions, it still seems insane in java. The macro (meta?) language needs the ability to communicate with the java type system easily, making it easy to generate coherent error messages.

I realize, of course, that I may simply be saying something as absurd as "let's make hard problems easy," but I have to believe that there is a better way than the current state of things.

I'm going to spend some quality time with O'caml/camlp4 over the next month and see if I get anything out of it.

Related Links:
  • Wikipedia link on DSLs
  • Camlp4 - O'caml's DSL support framework
|

Mail, an Odyssey

A longtime member of my list of Things To Do has been to move my home email server from my UW-IMAP/mbox based server to a Courier-IMAP/Maildir based system. Earlier this weekend, I came to the realization that I am no longer interested in maintaining a linux server. The bloom is off the rose. I have a perfectly good (in fact, amazingly good) hosting company in Dreamhost. I'll let the pros, who I'm paying anyway, worry about server uptime, security patching, backups and all the rest. Dreamhost already has Courier set up so all I need to do is get my existing mail onto their servers.

One small problem:

I have on the order of 10,000 emails going back six years, all in mbox format.

Attempt 1:
Oh, I'll just run my home imap server and my dreamhost imap server side by side in Mail.app, and copy the emails over.

Er...

Boy that's slow.

And boy is Mail.app bogging badly.

Holy CRAP that's slow!

And now my dreamhost terminal connection is barely responding.

ABORT, ABORT, ABORT!


Deep breath. OK.

Attempt 2:
So clearly IMAP to IMAP communication isn't going to work well. So, what else can we do? Well, I can just scp the mbox files to my local machine,
import them into Mail.app as local messages, and upload them from here to dreamhost...

Hmmm. Well, mbox import sucks in Mail.app. But I'll wait it out, because I'm sure it will be a smokin' upload.

taps fingers
...

goes off to do other things
...

And we're back. OK, everything is loaded into mail. Admittedly, the CPU is pegged and the machine is barely usable, but that's because Spotlight is indexing all this temporary content like a coked up rabbit.

So, kill Spotlight's various processes and hope that the machine doesn't hang. Well, it doesn't
hang per se, but it has stopped accepting click events.

Expletive Removed
..

OK, log out, and log back in. Miracles of miracles, the UI works and the email is locally available. Ready to try to upload. Here we go...

Err....

It's
slower than the IMAP to IMAP copy?!?

I'm struggling with how this is even technically possible...

ABORT, ABORT, ABORT!


Deep breaths. Deep, Deep breaths. OK.

Attempt 3:
OK, Mail.app clearly sucks at this sort of thing. Let's download Thunderbird, which makes my eyes bleed, but has a reputation for a pretty solid IMAP implementation.

Local mbox import is a dream, performance-wise, but pretty technically obscure. Always knew that CS masters would come in useful at some point.

Try uploading again.

After about 10 minutes, my head is in my hands, and I'm quietly sobbing. I hit the cancel button.

Attempt 4:
Now, it's pretty clear that we've established that IMAP, whatever it's virtues, is not a protocol that sings when dealing with bulk operations. So why don't we use protocols that are. I scp my mbox files up to the dreamhost server.

Er... Now what? Part of the whole point of moving them over via IMAP copy was to convert from mbox to Maildir.

Well, there is an mb2md perl script for converting from mbox to Maildir. But first things first. If I'm going to have to deal with all this stuff locally, it's time to do some research on exactly how the Maildir file format works...

2 hours of googling


OK. Now I that I have a vague understanding of how it works, let's run the script.

nothing happens


Hmmm. Is it working? It's clearly doing
something, since my shell is bogged down, and top says it isn't dormant. But I'm a funny sort of user: I like to have feedback.

ctrl-c


I dug around and found the inner loop in the perl script, and stuck a print statement in, which made me feel better.

The entire conversion took about 40 minutes, making me more sympathetic to the plight of IMAP, and causing me to pause and reflect on the confounding usefulness of the perl programming language.

So, now I've got my email in the right format, but not in the IMAP Maildir directory. So let's just copy them in.

Er...

Holy crap that copy is slow.

Is this stuff network mounted or something?
shrug Well, whatever the case, it's not getting faster than this.

An hour later, the copy is complete, and I'm even more willing to let IMAP's problems slide.

OK, now that everything is in, its time to log in with Thunderbird and start reorganizing my imported email...

Select 100 mails, create a new folder, and move them over.

And wait.

And wait.

And...
server time out.

G*DD**MIT!

OK, OK. Look at the folders. Well, the messages were moved (at least some of them), but they weren't deleted from the original folder.

Now I'm in a really terrifying state, where I don't know how inconsistent the data has become.

ABORT, ABORT, ABORT!


Attempt 5:
The key realization hits me: mutt, my favorite term-based email client, speaks Maildir! Dreamhost has it installed, of course. I put the appropriate hieroglyphs in my .muttrc, and blam, I'm slicing and dicing my email locally, with only minimal bog.

Furthermore, mutt has built in functionality for removing redundant emails from email directories, so I can just organize as I wish, and not worry about creating duplicates since I can open the folders up and remove them easily.
Beautiful.

Now if I had just realized all that back at attempt 1.
|

Greenspan's legacy

http://tinyurl.com/bzrr8

Nothing to see here. Everything is fine. Inflation is low and the economy is humming along.

Can I offer you a credit card?
|

All Ordered Combos

Got a bit obsessed with this last night. The wife is gone, so I have to find something to do with myself.

Pasted Graphic

It's a bit convoluted, although not bad and nowhere near what the java implementation would look like.
|

OK, OK, OK, last one, I promise

Pasted Graphic 1

Nowhere near as elegant as the inject method below, but this covers both assignment and block usages, so you can pick your poison. If you pass in a block then you have linear rather than exponential memory usage, although the run time is of course equivalent.

OK, I'm done. This is my final answer.

Wait... Maybe we should add an optional argument with a default value that limits the output, to prevent inadvertently calling it with an array that will take forever to return...

|