The Lowly Mathematician

September 22, 2009

HWN: Tagsoup, Hackmail, and my now persistant headache.

Filed under: Uncategorized — jfredett @ 1:26 am

So, last time we talked about the parser, and that last paragraph noted that the parser was actually done. I’ve since cleaned it up, rewrote it in applicative style, (though it still has the two caveats I mentioned), and clocks in at around 40-50 LOC. Not bad for a semi-not-so-simple format. I promised last time we’d try to talk about the parser proper, but I started working on the GMane aggregator and simply _had_ to talk about tagsoup and the utter insanity of parsing unstructured HTML.

It’s such a god-forsaken mess, I’ve decided to forego even trying to write GMane and go with an option I had aimed to implement in a far later version, using a email account and HackMail to manage the thing.

Now, it’s not TagSoup’s fault that parsing unstructured HTML is so hard, in fact, it’s mostly HTML’s fault. The format is annoying (angle brackets, divs and spans, all sorts of ugliness) which makes the format hard to read, especially when it’s unstructured and thus poorly indented. By and large, you can’t count on TagSoup to be consistent, because it’s input may-as-well-be line noise. I really like the false hope that TagSoup gave me, but good god, it’s so horrible.

Now, I told you my plan is to use HackMail, well, there’s a caveat here too. I left that project in a state wherein the parser wasn’t parsing anything, since I was trying to migrate to the hsemail package. However, this package is fully RFC compliant, but RFC compliance means that trying to parse emails from the file system (which is roughly how HackMail works. One uses getmail to pipe email into HackMail which dumps email into a file system/maildir/mbox/whatever) doesn’t really work. Because the RFC says that an EOL is "\r\n" (for instance) but *nix systems store files with just a "\n", Mac stores them differently, etc. Also, certain nonstandard, common practices have popped up in email, namely quoting names in from fields. Eg "From: Fredette, Joe <jfredett@domain.tag>" is sometimes written as ‘From: "Fredette, Joe" <jfredett@domain.org>’. All these things can be dealt with easily, but require some extensive modification to the parser. Since parsers have been what I’ve been doing lately, I think that I’m going to dive in and do a RFC-semi compliant `hsemail-nonstandard` package (name may be changed later) which will be used with HackMail to solve these issues.

The benefit of using HackMail for all of this is that, using HackMail as a library, we could simply download the actual mail messages directly, use HackMail’s built-in filtering mechanisms to sort out and even format many of the various emails automatically. Since we’re parsing Emails, which is a much more sane (though not much more) format then HTML, we should be able to parse this stuff with minimal difficulty.

For reference, HackMail provides a Filter monad, in which you are given several simple combinators which you then use to build up filtering systems to sort email, ideally in a simple, flat file system. Why a file system? Because why reinvent the wheel? A file system is an excellent way to store large amounts of data — that’s what they’re designed for! Most people who really sort their email tend to sort it into a mail-client’s internal pseudo file system anyway, the only real downside is that it becomes somewhat more difficult to search in the usual way through one’s emails. However, this problem is not only mitigated by the fact that you can filter email in various very powerful ways (you have the entirety of Haskell to help you with that), but also a simple rethinking of how you search for email actually reveals this method of storing email to be better for searching, since you can now easily parallelize any search algorithm to simple run over each of the read-only files in the file-system at the same time.

HackMail does have some support (in the form of an extensible type class for specifying storage formats) for more than just the flat file-system approach, but these are not recommended, especially in light of HackMail’s once and future cousin, Mailhack. Mailhack is (whenever I get to it) going to be the Mail client/sendmail bundle to HackMail’s getmail[1]/procmail replacement. More about that another day.

In any case, all that said, I hope to get the HackMail part of HWN2 working in short order. ATM HackMail also uses Hint for dynamic recompilation of it’s config file. I’m thinking I might need to rework that as well, fortunately there is "Dyre", which looks to be like an actual framework for what I want to do. Mostly I need to clean out the bit-rot I’ve let grow in the old project. I’ve neglected her too long.

September 16, 2009

HWN: A Simple, Extensible format for stories.

Filed under: Uncategorized — jfredett @ 11:28 am

When I start to design a DSL for storing data, like most people, I look for the least unit of information that "means" something. In the case of HWN, that means a "Story". In this post, I will share my notes on the DSL that will be used to store stories for the HWN. Be warned, I’ve not edited this too much for popular consumption, so it may reference something as being the case without ever really mentioning it to be the case. Just bear with it, I’m pretty sure it all makes sense. :)

A "Story" is defined as a Tag, a Title, some header information informed by the tag, (things like url of the story location, author’s name, etc), and a summary in some format (format being stored as another header tag). To wit:

Tag: Title {
    $url = http://www.wherever.com/
    $author = Joe Schmoe
    $format = markdown

    Lorem Ipsum Summarium ovus Textus ...
}

or, a more practical example:

Blog: HWN: A Simple, Extensible format for stories {
    $url = http://www.lowlymath.net/?p=46
    $author = Joe Fredette
    $blog_title = The Lowly Mathematician
    $format = markdown

    {$author} of {$blog_title} {$url wrote} about the format of his new DSL for HWN.
    It was a _great_ post.
}

In the first example, we have an arbitrary story followed by some summarizing text. In the second (more interesting) example, we  have the same setup, but also some extra formatting options piled on top of markdown in the summary. Namely the {$hdr: as-text} where $hdr is placed unless as-text is given. So, for instance, in the {$author} case, "Joe Fredette" is actually placed. However, in the {$url: wrote} case, a link to whatever the text of $url is is placed. So {$author: foo} would create a link to "Joe Fredette" (which is obviously invalid). This allows for a very simple transformation before parsing the summary as markdown or whatever $format is. The benefit of this format is multifold. For one, it’s easy to add new stories to a .hwn file, it boils down to just adding another atomic unit to the list. Similarly, it’s easy to add new headers to any given story, since each should only use certain headers and ignore the others. So, for instance, during the formatting phase, when each story is turned into, say, a `wiki` format, each tag will require certain headers, but it will ignore the others. The phase before that will replace all of the {}’s in the summary with the appropriate html/whatever so that there is no need to keep the unused headers around. All told, the parsing->printing process looks like:

  1. read file, parse
  2. transform summary {}-tags (brace-tags) to internal representation as part of the summary.
  3. parse summary format, so that summary is an internal representation (possibly pandoc-related stuff for both steps 2 and 3).
  4. determine output format, pretty print to this format.

That roughly covers it. It’s a little bit complicated, since we’re allowing for some complexity within the summary, but the overall format is simple, which is another nice selling point. Furthermore, it’s pretty easy to write, since indentation won’t matter, the only important whitespace is EOLs after the header and the double EOL between the headers and the summary.

So, to recap, we have the following nice features for the DSL:

  • Easy to extend to add new tags
  • Easy to extend to add new headers in existing/new tags. They will show up as X_headers "header_name" "field" until the parser is appropriately extended.
  • Easy to add new stories to the issue, just append to the file.
  • (Relatively) easy to parse.
  • Fault tolerant (adding a header that isn’t used in the summary or in the formatting of a story doesn’t break things. Adding a story tag that isn’t recognized simply doesn’t render that story, could throw an warning/error if desired).
  • Composable, two .hwn files cat’d together are still a valid hwn file.

The downsides are:

  • You actually have to write a parser for it, it’s not something you can just tweak a little and get a haskell datatype.
  • You can’t add multiple stories at once (that is, it’s not threadsafe, since we’re appending information to the file. This could be addressed in the future, but it shouldn’t be a big problem for now since only one person will be adding to the file at a time).

There are likely to be more points where this methodology fails or is not optimal, but I think this is a good start. Next time I’ll talk about some TH code I use to generate headers and some parsing stuff for them, hopefully the parser will be done (or mostly done). Also, I’ve set up a repo on patch-tag for this stuff, so feel free to poke around, send patches if you like. etc, if there is a lot of interest, I’ll try to set up a Trac somewhere.

Currently, the parser is done with two caveats:

  1. You can’t end a summary with a substitution, it causes an error due to the double "}".
  2. You can’t do the {$foo} substitution just yet.

September 15, 2009

HWN, we can rebuild it. We have the technology.

Filed under: Uncategorized — jfredett @ 1:41 am

So, as many of you may know, I recently took over as the editor of the Haskell Weekly News. The HWN is a pretty popular newsletter in the Haskell community, it aggregates posts from Planet Haskell and the myriad mailing lists associated with mainstream Haskell stuff. One of the benefits (and burdens, in many cases) of the job is the editing of all this data. With so much content floating around, we are left with a situation in which we not only have to organize and present data in a meaningful, clean, precise way, but also present enough data so as not to be boring or irrelevant, but also not aggregate too much so as to make mountains of data which no one really cares about.

This newsman gig is a tricky business.

Luckily, as I mentioned, there are tools for this sort of thing. Now- I don’t begrudge my tools, they are reliable — if quirky — and they get things done. However, they are also fickle little beasts, they bleat at the slightest hint of misuse. They are on the older side, the oldest being `publish.hs` which I’m fairly sure was created along with the HWN proper. Later tools such as the gmane and planethaskell scripts, as well as the quote scripts were all written later. Much work has obviously gone into these tools and I don’t begrudge the authors any of the credit, but it’s time for a change.

Among the issues with the tools is it’s dependence on Haskell for syntax. "What?" you shout, "but Joe, I thought you loved the EDSLs. Said they were the greatest thing since Monadic IO." The short answer is I do, but this EDSL is not so great. The long answer is that I think this particular task is better suited to it’s own DSL. To this end, I’ve come up with the following things I want to fix.

Firstly, the current system is a mess of individual files, which must be merged (by hand) into a single file. What would be lovely is to create a file to which we simply append too to add content, rather then having to generate new files each time I want to add more content.

Next, the current "main" file, `content.wiki`, is more or less a series of constructors for a Haskell Record. It stores stories as elements of a list written as a simple Haskell datatype. While this has the benefit of being easy to parts (it’s just a couple of substitutions with regexes), it does have the downside of being poorly extensible, touchy, and generally hard to use with large amounts of data.

Finally, the current method of selecting stories has the same problem, it’s simple, and it works. The problem is we’re at a local optima with all of these things, the former editors didn’t have time to really work on this stuff, which is understandable, they have jobs. Jobs are more important than community newsletters. However, I do not have a job, I go to school. Therefore, I have time (other people call it "Class time" I’m still not sure why… and there always seems to be someone talking, very distracting when you’re trying to write code or do math) to make this stuff awesome.

For the first iteration, my goal is to get a parser running and port the current tools: `gmane`, `planethaskell`, and `quotes` to dump to that format. After that, I intend to either rewrite the `publish` script to effectively be a prettyprinter for the AST of my DSL (I love TLAs), or to just write a conversion function from my AST to the current record type. It all kind of depends on whether I really want to get v1.1.0 out or what. The rough plan is as follows (this is right out of my notes).

  • For v1.1

It should work, be able to do everything the current tools can do, but use the new `.hwn` format.

It should be Cabalized, Haddockized, and have some basic correctness testing in place (samples for the parser, etc.)

  • For v1.3

New versions of the `gmane`, `planethaskell`, and `quotes` tools which are built with a cleaner interface, more features, etc. Details TBD

More testing, better documentation (Nice READMEs, HACKING guides, etc.)

  • For v1.5

Clean, consistent command line interface. Nothing fancy, just aggregate the interface of all the tools into one, more general ncurses-esque tool.

Generalize the `gmane`, `planethaskell` tools to work with any gmane list, and any RSS (Atom and others will come later) feed (in the hopes of eventually making this a generally useful tool, rather than a haskell-specific one).

  • For v1.7

An ability to specify new output formats/layouts/etc, effectively some kind of stylesheet.

An ability to specify different formats for the summary fields in a given tag.

  • For v1.9

Uses Hackmail (if I ever finish it) as a mailinglist aggregator, so that any ML (not just those indexed by gmane) can be used, assuming a email account can be furnished. This will take over for the `gmane` part of the project.

Can deal with RSS, Atom, similar feeds in the aggregator portion (formerly the `planethaskell` part of the project).

  • For v2.0

Good documentation, good test coverage, and final touches to the CLI. Then let the public at it.

  • For post v2.0

A GUI

The ability to aggregate reddit (which technically comes for free with the RSS/Atom stuff, but perhaps) using the proper API to allow in-HWN voting, etc.

More automation. Announces should be part of the HWN by default, and only culled in the event of an error. (This is because I’ve never seen a package announcement not included in the HWN). Particularly lively discussions could be tagged somehow (eg, if a thread goes past <threshold> replies, stick it in the "Discussion" section.) etc.

Some of this stuff won’t make any sense, but as I continue to write about the progress and future plans of the HWN software and newsletter proper, it will start to make more sense. Next post will have some stuff about the DSL format I’m using, why I think it will work better, why it looks the way it does, etc. Hopefully you’ve all had fun listening to me ramble. Stay tuned, it’s going to be a fun time.

August 11, 2009

Dwarf Fortress: A Review

Filed under: Code, FP, Functional, Haskell, Imperative, Random — jfredett @ 2:59 am

I’ve been playing Dwarf Fortress, lately, it’s a very odd game. You can find it here. Or, if you prefer (as I do), the wiki. It’s a bit roguelike, a bit dungeonkeepery, a bit of everything — and quite immersive. Read on for my review and critique.

(more…)

April 7, 2009

Card games, scorekeeping, and … Associated Datatypes?

Filed under: FP, Functional, Haskell, Polymorphism, Type Classes, Type Theory — jfredett @ 10:34 pm

(more…)

March 28, 2009

Basic Math: Algebra 102

Filed under: Math — jfredett @ 4:18 am

Last time we got all the way to very basic quadratic equations. I didn’t tell you they were quadratics, we just did them. This week, we’ll take about quadratics in much more detail, describing three different methods to solve them, and why you might want to use any given method at any given time.

(more…)

September 8, 2008

Grad School Tags

Filed under: Uncategorized — jfredett @ 12:39 am

You can see all the various sites I’ve bookmarked about Gradschool @:

http://feeds.delicious.com/v2/rss/jfredett/gradschool

 

It’s an RSS feed, but also the only sane url-link to the specific tag. It’s just my del.icio.us space under the tag “gradschool”

Down the Graduate School Rabbit Hole

Filed under: Uncategorized — jfredett @ 12:31 am

So, I know my posts are quite sparse to begin with, and sporadic at that. But, sadly, I’m here to tell you, my faithful and equally sporadic readers, that problem will be getting worse.

Heres my situation, I transferred last year from one school (WPI) to another, cheaper school (WSC)[1]. In doing so, two things happened.

  1. I had to redo a fair deal of work. Many credits didn’t transfer, weren’t applicable, or became electives due to requirement differences
  2. I suffered a bit of a blow to my now fragile GPA, I’ve had 10 classes so far, and due to the rapid change of material and teaching style, as well as the new stresses and frustrations of repeating much of what I already knew, or taking classes I had no interest in, I did not do very well in most of them. My GPA is a pitiful 2.6… (I got a fuckload of C’s.)

Now, fortunately, I have good grades in my Core courses, most of the outliers (Philosophy excepted, I am stunningly good at that, it seems) are the ones that caused the brunt of the average-destruction. My hope is to pull these grades up with all my vigor. I have calculated that — in the 10-12 classes I have left, I need to get consisent B’s and B+’s to recover to a “safe” 3.2 or so (what I had when I left WPI). The various websites I’ve read say that this is a good average, and combined with me doing un-fucking-believable on the GRE’s, I should be able to get into a decent grad school.

I’ll try to keep Lowlymath.net[2] updated with my adventures in grad school application, and hopefully it will be a resource to anyone else hunting for graduate education.

No little thing like a bad GPA will keep me down, if necessary, I’ll take every math class WSC, WPI, and the rest of the consortium[3] offers to pull up my grade! Someday you’ll all be able to say “Whatsup Doc?” to me, and I will be quite pleased about it!

/Joe

 

 

[1] Thats Worcester Polytechnical Institute and Worcester State College, respectively.

[2] I’m actually posting to both blogs at once right now, so the Lowlymath Readers can just mentally replace “Lowlymath.net” with “this blog”

[3] The Consortium is a group of 6 schools which allow for “easy” crossregistration of classes and (supposedly) “easy” transfers between institutions. Though the latter is somewhat of a misrepresentation, given my experience.

July 30, 2008

Evolve — The History Channel may have finally done something right.

Filed under: Uncategorized — jfredett @ 12:23 pm

I just watched this show, on THC. I actually liked it, the title of the show was “Evolve” and it has to be the first show I’ve seen on THC that didn’t bugger up the science much at all. It was unabashedly pro-evolution, one of the biologists on the show (whose name escapes me) called out the ID frakwittery without any qualms. It was fantastic.

Everyone should watch this show, Hold off on buying episodes till they air a few more, but I guarantee, if this is going to be the road the show takes, I’ll be buying a full season.

Fan-frigging-tastic.

 

On another note, hopefully I’ll start being able to do some more blogging, I’ve been working on some videos, I’ve put a few up on the tubes. In any case, check it out, subscribe if you like, hopefully I’ll be internet-famous someday, like thunderfoot or edwardcurrent. :)

Oh, one more tube related thing. In case you people feel that I hate all religious people, you should check out DonExodus2 on the tubes, he’s a PhD biologist, anticreationist christian.

If all Christians were like DonExodus, I would be out of a job.

June 2, 2008

Basic Math: Algebra 101

Filed under: Math — jfredett @ 12:53 am

In High schools across the country, kids learn algebra. I know quite a few parents who have some big problems with algebra. For some, it’s just because it’s been a while since they’ve needed to use it, for others, it’s always been a hard thing, but it doesn’t have to be. In this post, I’ll take you through some of the basic concepts of algebra as if you had never seen it before, it’ll be a bit of a whirlwind review of the idea, but (hopefully) written well enough so that someone who has never fully understood algebra. Oh, yes, by the way, there will be word problems. :)

(more…)

Older Posts »

Powered by WordPress