Erik McClure: Mathematical Notation Is Awful

July 30, 2016

Mathematical Notation Is Awful

Today, a friend asked me for help figuring out how to calculate the standard deviation over a discrete probability distribution. I pulled up my notes from college and was able to correctly calculate the standard deviation they had been unable to derive after hours upon hours of searching the internet and trying to piece together poor explanations from questionable sources. The crux of the problem was, as I had suspected, the astonishingly bad notation involved with this particular calculation. You see, the expected value of a given distribution $$X$$ is expressed as $$E[X]$$, which is calculated using the following formula:
\[ E[X] = \sum_{i=1}^{\infty} x_i p(x_i) \]
The standard deviation is the square root of the variance, and the variance is given in terms of the expected value.
\[ Var(X) = E[X^2] - (E[X])^2 \]
Except that $$E[X^2]$$ is of course completely different from $$(E[X])^2$$, but it gets worse, because $$E[X^2]$$ makes no notational sense whatsoever. In any other function, in math, doing $$f(x^2)$$ means going through and substitution $$x$$ with $$x^2$$. In this case, however, $$E[X]$$ actually doesn't have anything to do with the resulting equation, because $$X \neq x_i$$, and as a result, the equation for $$E[X^2]$$ is this:
\[ E[X^2] = \sum_i x_i^2 p(x_i) \]
Only the first $$x_i$$ is squared. $$p(x_i)$$ isn't, because it doesn't make any sense in the first place. It should really be just $$P_{Xi}$$ or something, because it's a discrete value, not a function! It would also explain why the $$x_i$$ inside $$p()$$ isn't squared - because it doesn't even exist, it's just a gross abuse of notation. This situation is so bloody confusing I even explicitely laid out the equation for $$E[X^2]$$ in my own notes, presumably to prevent me from trying to figure out what the hell was going on in the middle of my final.

That, however, was only the beginning. Another question required them to find the covariance between two seperate discrete distributions, $$X$$ and $$Y$$. I have never actually done covariance, so my notes were of no help here, and I was forced to return to wikipedia, which gives this helpful equation.
\[ cov(X,Y) = E[XY] - E[X]E[Y] \]
Oh shit. I've already established that $$E[X^2]$$ is impossible to determine because the notation doesn't rely on any obvious rules, which means that $$E[XY]$$ could evaluate to god knows what. Luckily, wikipedia has an alternative calculation method:
\[ cov(X,Y) = \frac{1}{n}\sum_{i=1}^{n} (x_i - E(X))(y_i - E(Y)) \]
This almost works, except for two problems. One, $$\frac{1}{n}$$ doesn't actually work because we have a nonuniform discrete probability distribution, so we have to substitute multiplying in the probability mass function $$p(x_i,y_i)$$ instead. Two, wikipedia refers to $$E(X)$$ and $$E(Y)$$ as the means, not the expected value. This gets even more confusing because, at the beginning of the Wikipedia article, it used brackets ($$E[X]$$), and now it's using parenthesis ($$E(X)$$). Is that the same value? Is it something completely different? Calling it the mean would be confusing because the average of a given data set isn't necessarily the same as finding what the average expected value of a probability distribution is, which is why we call it the expected value. But naturally, I quickly discovered that yes, the mean and the average and the expected value are all exactly the same thing! Also, I still don't know why Wikipedia suddenly switched to $$E(X)$$ instead of $$E[X]$$ because it stills means the exact same goddamn thing.

We're up to, what, five different ways of saying the same thing? At least, I'm assuming it's the same thing, but there could be some incredibly subtle distinction between the two that nobody ever explains anywhere except in some academic paper locked up behind a paywall that was published 30 years ago, because apparently mathematicians are okay with this.

Even then, this is just one instance where the ambiguity and redundancy in our mathematical notation has caused enormous confusion. I find it particularly telling that the most difficult part about figuring out any mathematical equation for me is usually to simply figure out what all the goddamn notation even means, because usually most of it isn't explained at all. Do you know how many ways we have of taking the derivative of something?

$$f'(x)$$ is the same as $$\frac{dy}{dx}$$ or $$\frac{df}{dx}$$ even $$\frac{d}{dx}f(x)$$ which is the same as $$\dot x$$ which is the same as $$Df$$ which is technically the same as $$D_xf(x)$$ and also $$D_xy$$ which is also the same as $$f_x(x)$$ provided x is the only variable, because taking the partial derivative of a function with only one variable is the exact same as taking the derivative in the first place, and I've actually seen math papers abuse this fact instead of use some other sane notation for the derivative. And that's just for the derivative!

Don't even get me started on multiplication, where we use $$2 \times 2$$ in elementary school, $$*$$ on computers, but use $$\cdot$$ or simply stick two things next to each other in traditional mathematics. Not only is using $$\times$$ confusing as a multiplicative operator when you have $$x$$ floating around, but it's a real operator! It means cross product in vector analysis. Of course, the $$\cdot$$ also doubles as meaning the Dot Product, which is at least nominally acceptable since a dot product does reduce to a simple multiplication of scalar values. The Outer Product is generally given as $$\otimes$$, unless you're in Geometric Algebra, in which case it's given by $$\wedge$$, which of course means AND in binary logic. Geometric Algebra then re-uses the cross product symbol $$\times$$ to instead mean commutator product, and also defines the regressive product as the dual of the outer product, which uses $$\nabla$$. This conflicts with the gradient operator in multivariable calculus, which uses the exact same symbol in a totally different context, and just for fun it also defined $$*$$ as the "scalar" product, just to make sure every possible operator has been violently hijacked to mean something completely unexpected.

This is just one area of mathematics - it is common for many different subfields of math to redefine operators into their own meaning and god forbid any of these fields actually come into contact with each other because then no one knows what the hell is going on. Math is a language that is about as consistent as English, and that's on a good day.

I am sick and tired of people complaining that nobody likes math when they refuse to admit that mathematical notation sucks, and is a major roadblock for many students. It is useful only for advanced mathematics that take place in university graduate programs and research laboratories. It's hard enough to teach people calculus, let alone expose them to something useful like statistical analysis or matrix algebra that is relevant in our modern world when the notation looks like Greek and makes about as much sense as the English pronunciation rules. We simply cannot introduce people to advanced math by writing a bunch of incoherent equations on a whiteboard. We need to find a way to separate the underlying mathematical concepts from the arcane scribbles we force students to deal with.

Personally, I understand most of higher math by reformulating it in terms of lambda calculus and type theory, because they map to real world programs I can write and investigate and explore. Interpreting mathematical concepts in terms of computer programs is just one way to make math more tangible. There must be other ways we can explain math without having to explain the extraordinarily dense, outdated notation that we use.

11 comments:

AnonymouzJul 31, 2016, 5:27:00 AM
I end up understanding math by re-writing it in code. it seems to be the only way that I can finally understand how its supposed to work. The only issue is that of course it fails with numbers like infinity, however I still do my best to plod along with some kind of actual implementable logic.

ReplyDelete
Replies
John ConnorsJul 31, 2016, 6:07:00 AM
Mathemeticians should give up on algebraic notation as a mess; instead go with prefix s-expressions the way our new AI overlords will eventually (sensibly) insist upon. :-) I do wonder if we are making a mistake as fundamental as the Roman-conditioned Europe ignoring Arabic numerals.
ReplyDelete
Replies
Eustáquio RangelJul 31, 2016, 7:46:00 AM
This is a recent very interesting article about this: https://aeon.co/videos/maths-notation-is-needlessly-complex-it-can-and-should-be-better
ReplyDelete
Replies
UnknownJul 31, 2016, 7:19:00 PM
Well, the good thing is that people define their notation in papers, due to overlap. The different styles in some sense prepare you to, eh, different styles. Finance on the other hand, is witchcraft - all letters have been defined to mean exactly the same thing all the time, and you're just supposed to know what they mean. Then they invented a bunch of more letters, that sound greek, just for fun, but then, since it is hard to typeset new letters, they use existing greek letters... yeah, f*ck finance..
ReplyDelete
Replies
pftqAug 29, 2016, 5:32:00 AM
Funny story on my end is I learned more calculus from programming games than I did in class. I literally averaged Ds, Fs in my first semester before the teacher stopped requiring "proper" work and then it was As from there because my answer was never wrong, just my notation/proofs.
ReplyDelete
Replies
Carlos ManzanedoNov 25, 2016, 11:20:00 AM
Absolutely with you on this; Lately I'm finding myself thinking how cool it will be if scientific papers (specially discrete maths) will look more like jupyter notebooks. This is not necessarily the best example but I was quite excited to with how the ligo observations on gravitational waves were explained in this format https://losc.ligo.org/s/events/GW150914/GW150914_tutorial.html
ReplyDelete
Replies
UnknownNov 25, 2016, 3:57:00 PM
Looks like your course notes are bad. That expression for expected value doesn't make sense, a quick google search shows a wikipedia article with the correct definition: https://en.wikipedia.org/wiki/Expected_value#Univariate_discrete_random_variable.2C_countable_case
ReplyDelete
Replies
Xavier Y. PoissonDec 5, 2016, 6:15:00 AM
Hi there! I just read your post and wanted to help; I'm a PhD student in pure math so I can really relate to your struggles. Before I start, I'd like to just say that all the equations you pulled from Wikipedia are in fact correct, all this stuff does make sense, and I hope to convince you below that the notation, once understood, is actually very efficient and precise. I would guess that your confusions mostly stem from poor instruction (e.g. professors placing too much emphasis on computations rather than defining things carefully). I'll just start from the top and work my way down, and then you can ask me if anything I say is unclear.

Alright, so at the top you're discussing X, which you refer to as a "distribution", but really, it is an object called a (discrete) random variable. Many courses will gloss over what exactly is meant by a random variable, and say something vague, like "it's just some random numerical quantity". However, as the existence of your post shows, this can easily lead one into a swamp of confusion. So, here's the straight dope: a random variable is really a (measurable) *function*, X, from a *probability space* (what's this? see below), call it Ω, to the set of real numbers. Ignoring the "measurable" part (which is just a technical assumption), this means that X is a thingy which eats sample points (in other words, elements of the probability space Ω aka sample space), and spits out real numbers. For example, consider rolling two dice. The sample space here would be the set {(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),(2,1),(2,2),...,(6,5),(6,6)}, which of course has 6*6=36 elements. An example of a discrete random variable would be the function X that sends a sample point to the sum of the two values, so for example X((6,5)) would just be 6+5 = 11. In a book you would see them write "Define a random variable X to be the sum of the two numbers rolled on the dice", or something. It's important to note that this is sort of misleading, because they're really defining X to be the sum NOT of two numbers, but of two random variables! (Which ones are they?) So there, now you know exactly what the X you're talking about above actually *is*.

Above I should have also told you what "probability space" means. Well, it's basically a set, equipped with a "probability measure": a way of assigning probabilities (numbers between 0 and 1) to certain subsets which people call "events". Why only *certain* subsets you ask? Uhhh... this is a technicality and not really relevant to understanding things for now. It's because by invoking the Axiom of Choice and constructing some awful monsters, you can show that in a lot of natural scenarios you can't find a consistent notion of measure that works for *all* subsets. You can read more about this in any introductory treatment of the Lebesgue measure.
ReplyDelete
Replies

Add comment