Day 207: How to count things.

December 27, 2013

Hacker School: day 207.

(My batch ended on August 22, 2013, but as they say, never graduate.)

I’ve been playing with Brushfire, which is a machine learning library that distributes the learning of decision trees across a cluster.

Most of ML is basically aggregating counts of stuff, and Brushfire is no different. What’s sort of interesting is the way Brushfire does this — it uses a fascinating little library called Algebird.

The idea behind Algebird is this.

Sometimes you need to count so many things that one machine can’t process the counts quickly enough.
The problem of distributing an aggregation problem across a cluster of computers can be made a lot easier if you know about the algebraic structure of the aggregation problem.
Which means that if you phrase problems using Algebird’s data structures, it is almost trivial to turn them into MapReduce jobs or Storm topologies (or something else).

When I say algebraic structure here, I mean something slightly different than what you learned about in grade school. I’m talking about general, abstract mathematical properties of the aggregation problem. For example, consider that in normal addition, it doesn’t matter in which order you add some numbers. If I want to aggregate this sum:

\[ 1 + 2 + 3 + 4 + 5 + 6 \]

I can decide to add them up in basically whatever order I want. Like starting from the left:

\[ ((((1 + 2) + 3) + 4) + 5) + 6 \]

Or from the right:

\[ 1 + (2 + (3 + (4 + (5 + 6)))) \]

Or in a different order entirely:

\[ 6 + 4 + 1 + 2 + 3 + 5 \]

In the parlance of this “abstract” algebra, addition on the natural numbers is commutative — i.e., that it does not matter in which order the numbers are added — and associative — i.e., you can put the parentheses anywhere you want, meaning that any pair can be added first.

Practically speaking, the use of knowing that addition has these algebraic properties means that if you make a cluster of computers add up these numbers, it doesn’t matter which computer adds up which numbers, as long as they all get added eventually.

Another example is finding the largest element in a stream — it doesn’t matter which computer processes which numbers, as long as they all report the largest number they’ve seen, and as long as you process all the numbers, you can continually aggregate until you have one number, which is always your largest number.

In the terminology of algebra, the natural numbers are a monoid under addition and the “largest number” operation.

Monoids are just one type of algebraic data structure that Algebird provides interfaces for. There are many more. Each of these has different guarantees for how they can be distributed across a cluster, meaning that if you can phrase your aggregation problem in terms of these, it is much easier to distribute your problem across a cluster.

Here is a good summary of what other sorts of problems can be approached using these strucutres, but a general guideline seems to be that they are particularly good for problems that look kind of like counting. Frequency estimation, top n frequent items, SGD, etc.

Computer "scientist"

Day 207: How to count things.

Hacker School: day 207.