Day 212: More about counting things with fancy math.

January 01, 2014

Hacker School: day 212.

(My batch ended on August 22, 2013, but as they say, never graduate.)

In a previous post I talked about how the algebraic structure of a statistic you’re aggregating can give you hints about how to distribute it across a cluster. Such is the premise of Twitter’s neat little library Algebird, which I have continued to poke around with in my free time.

(NOTE: this post will contain a good amount of math, but you can peek at the definitions here to help you along if you like.)

I’m still kind of a noob, and a lot of this has been unclear to me:

Which structures have which guarantees?
What are the big wins for underpinning everything with abstract algebra? i.e., what’s wrong with the old way of programming these things?
It is not always clear what algebraic structure should underpin some computations. For example, matrix rings could also probably be considered matrix semirings, because we don’t really ever use the additive inverse of a matrix — so which should it be?

The main advantage of this mathematically-informed programming model, at least as I judge it, is this.

A lot of count aggregation problems — more than you’d think — end up being monoids. Especially linear algebra stuff, like finding the person who has the most “second-followers”, i.e., people who are followers of your followers. Writing these aggregation problems in terms of map reduce jobs or Storm topologies is intensely unpleasant, and really hides the underlying algebraic structure of the computation, which can be concisely described using something like Algebird.

Thus, if you write your code using Algebird, it can be transformed cleanly into map reduce jobs or Storm topologies, which is a nice thing to get for free.

There are some nice code examples of in this interesting talk by Oscar Boykin.

This talk is actually doubly interesting because Boykin spends a fair amount of time contrasting the CS applications of category-theory-like concepts (e.g., monads, monoids, rings, etc.) to their mathematical counterpoints.

His main point seems to be that these mathematical abstractions are provided in a library like Algebird because they make for clean code, rather than because they can be used to make bulletproof theorems about your code. So, while you can leverage the intuition that something is a monad, it’s usually a different matter to prove that our instance of monad is a monad a la category theory, or that something is really bijective (which would be hard to do in the presence of, e.g. a third-party library that might throw exceptions everywhere).

Although at this point it seems clear that most aggregation problems happen to propagate via monoids, it’s still not clear to me what is a monad, or what other than matrices and polynomials (which don’t seem useful) have rings.

Computer "scientist"

Day 212: More about counting things with fancy math.

Hacker School: day 212.