So, last time we talked about the parser, and that last paragraph noted that the parser was actually done. I’ve since cleaned it up, rewrote it in applicative style, (though it still has the two caveats I mentioned), and clocks in at around 40-50 LOC. Not bad for a semi-not-so-simple format. I promised last time we’d try to talk about the parser proper, but I started working on the GMane aggregator and simply _had_ to talk about tagsoup and the utter insanity of parsing unstructured HTML.
It’s such a god-forsaken mess, I’ve decided to forego even trying to write GMane and go with an option I had aimed to implement in a far later version, using a email account and HackMail to manage the thing.
Now, it’s not TagSoup’s fault that parsing unstructured HTML is so hard, in fact, it’s mostly HTML’s fault. The format is annoying (angle brackets, divs and spans, all sorts of ugliness) which makes the format hard to read, especially when it’s unstructured and thus poorly indented. By and large, you can’t count on TagSoup to be consistent, because it’s input may-as-well-be line noise. I really like the false hope that TagSoup gave me, but good god, it’s so horrible.
Now, I told you my plan is to use HackMail, well, there’s a caveat here too. I left that project in a state wherein the parser wasn’t parsing anything, since I was trying to migrate to the hsemail package. However, this package is fully RFC compliant, but RFC compliance means that trying to parse emails from the file system (which is roughly how HackMail works. One uses getmail to pipe email into HackMail which dumps email into a file system/maildir/mbox/whatever) doesn’t really work. Because the RFC says that an EOL is "\r\n" (for instance) but *nix systems store files with just a "\n", Mac stores them differently, etc. Also, certain nonstandard, common practices have popped up in email, namely quoting names in from fields. Eg "From: Fredette, Joe <jfredett@domain.tag>" is sometimes written as ‘From: "Fredette, Joe" <jfredett@domain.org>’. All these things can be dealt with easily, but require some extensive modification to the parser. Since parsers have been what I’ve been doing lately, I think that I’m going to dive in and do a RFC-semi compliant `hsemail-nonstandard` package (name may be changed later) which will be used with HackMail to solve these issues.
The benefit of using HackMail for all of this is that, using HackMail as a library, we could simply download the actual mail messages directly, use HackMail’s built-in filtering mechanisms to sort out and even format many of the various emails automatically. Since we’re parsing Emails, which is a much more sane (though not much more) format then HTML, we should be able to parse this stuff with minimal difficulty.
For reference, HackMail provides a Filter monad, in which you are given several simple combinators which you then use to build up filtering systems to sort email, ideally in a simple, flat file system. Why a file system? Because why reinvent the wheel? A file system is an excellent way to store large amounts of data — that’s what they’re designed for! Most people who really sort their email tend to sort it into a mail-client’s internal pseudo file system anyway, the only real downside is that it becomes somewhat more difficult to search in the usual way through one’s emails. However, this problem is not only mitigated by the fact that you can filter email in various very powerful ways (you have the entirety of Haskell to help you with that), but also a simple rethinking of how you search for email actually reveals this method of storing email to be better for searching, since you can now easily parallelize any search algorithm to simple run over each of the read-only files in the file-system at the same time.
HackMail does have some support (in the form of an extensible type class for specifying storage formats) for more than just the flat file-system approach, but these are not recommended, especially in light of HackMail’s once and future cousin, Mailhack. Mailhack is (whenever I get to it) going to be the Mail client/sendmail bundle to HackMail’s getmail[1]/procmail replacement. More about that another day.
In any case, all that said, I hope to get the HackMail part of HWN2 working in short order. ATM HackMail also uses Hint for dynamic recompilation of it’s config file. I’m thinking I might need to rework that as well, fortunately there is "Dyre", which looks to be like an actual framework for what I want to do. Mostly I need to clean out the bit-rot I’ve let grow in the old project. I’ve neglected her too long.