If you have the means

Sunday, January 24th, 2010

Sometimes you want the maximum of a set of numbers, but you don’t want to deal with a discontinuous function — those sharp corners kill estimation algorithms — so you go with a soft maximum function, a continuous function — nice and smooth — that approximates the true max function.

John Cook recommends one based on natural logs:

softmax(x1, x2, …, xn) = log( exp(x1) + exp(x2) + … + exp(xn) )

I had something else in mind — from a totally different context — but no alarm bells went off until I read his example:

Suppose you have three numbers: –2, 3, 8. Obviously the maximum is 8. The soft maximum is 8.007.

Wait, his soft maximum is greater than the actual maximum? I had just assumed we’d want a function that gave us a value slightly less than the true maximum.

Of course, I had also assumed all non-negative numbers, which is why I had something like this in mind:

softmax(x1, x2, …, xn) = sqrt( (x12 + x22 + … + xn2)/n )

At the time, I was daydreaming about something between an arithmetic mean and a max function, and squaring the values before averaging them and then un-squaring that mean seemed simple enough.

It’s generalizable, of course. Raising each x to the third power and then taking the cube root gets us closer to the max — still not very close though — and works for negative numbers. As we move from third, to fifth, to seventh, to ninth power, we move closer and closer to the max: 5.61, then 6.43, then 6.84, then 7.08.

I didn’t realize it at the time, but this is what real mathematicians call a power mean or generalized mean. Now, I did recognize that it asymptotically approached the maximum function as the power went to infinity — and the minimum function as the power went to negative infinity — but I did not realize that it produced the harmonic mean at a power of –1 and the geometric mean as the power approached 0. (Naturally it produces the arithmetic mean at a power of 1, but that’s not especially interesting.)

Leave a Reply