Working In Uncertainty

Taming probability notation

Improving the three types of probability notation

This article proposes that we should make probability notation simpler and more consistent to avoid confusing learners and everyone else.

Improving the three types of probability notation

It's not obvious, but there are three basic types of notations for 'probabilities' in probability theory:

Generic notation using P, p, or Pr for everything.
Specific functions, using a variety of names invented to represent particular distributions, such as f_x.
Distribution families (e.g. Normal(x|μ, σ²)), where specific distributions are selected by specifying particular parameter values (e.g. μ = 2.3 and σ² = 9.4).

The approach I suggest below is to use the generic notation more strictly, which tends to make it a bit more lengthy, but to use specific functions more often and more systematically to compensate for this extra writing.

More complete generic notation

The format suggested below is inspired by Z (see Spivey, for example), a mathematical style developed for specifying computer systems. It also has similarities with proposals for notation by Carroll Morgan and by Maarten Fokkinga.

The format looks like this: P[X, A, B], where P is the symbol used every time to show that this is a probability, X is the name of the probability space involved while A and B are sets used in the probability space. If the probability space has the usual three elements, so that X = (Ω, S, μ), then:

P[X, A, B] = μ[A ∩ B]/μ[A].

In other words, this is the probability, using the probability measure μ, that the truth lies in B given that the truth lies in A. (Remember that Ω is the set of possible truths, S is a set of sets of these possible truths, and μ is a function that gives a probability number for each set in S.)

If the probability is not considered as conditional on anything, it is still in fact conditional on something in Ω being true, so we can write:

P[X, Ω, B].

Here are some familiar probabilities in old notation and the complete notation I am suggesting:

Old notation	Complete notation
P(A)	P[X, Ω, A]
P(heads)	P[X, Ω, {heads}]
P(t ≤ T)	P[X, Ω, {t : Ω \| t ≤ T}]
P(A\|B)	P[X, B, A]
P(Z = 3), where Z is a 'random variable'	P[X, Ω, {ω : Ω \| Z[ω] = 3}]

An advantage of the stricter notation is that you can avoid making mistakes when two or more probability spaces are involved in a problem. This might be because you are working with the views of two or more people, each one having a different view of the probabilities, represented by a different probability space. For example, when analysing a negotiation, the two parties might have different views of the outcome from a particular settlement and it would be helpful to be able to distinguish between them. Perhaps both parties analyse the future in the same way but just have different views as to how likely different outcomes are:

X_Adam = (Ω, S, μ_a) and X_Bob = (Ω, S, μ_b)

Or perhaps they analyse the future differently so that not even the set of possible truths agree:

X_Adam = (Ω_a, S_a, μ_a) and X_Bob = (Ω_b, S_b, μ_b)

We also want to be explicit about probability spaces when we build one from another.

I like the way this notation continually reminds us that there is a probability space involved and that all probabilities are conditional. I also prefer the rigour of the set builder notation used to specify the sets involved. The stricter notation is consistent and gives more information. The old notation for random variables (e.g. P(Z = 3)) is a particularly misleading abuse of notation.

Systematic and frequent use of specific functions

Both the old and the complete versions of generic probability notation are extremely flexible and powerful. However, they both have two limitations. One is that they can be long when written down. The other is that they only represent individual probabilities, not whole distributions. In practical applications of probabilities we nearly always want to work with whole distributions, most of the time.

It is helpful to avoid using generic notation all the time by introducing specific functions with individual names, rather than trying to make P do all the work.

Defining these specific functions produces more compact notation but requires some care in thinking of function names that are easy to remember and then providing clear definitions if needed. When working on a particular problem it is usually easy to learn the type and meaning of the functions you create.

The following examples again assume a probability space, X, defined as X = P(Ω, S, μ). Also, notice that I am using square brackets for functions to avoid confusion with the curved brackets used to show order of calculation.

#	Old notation	Generic notation	Typical specific notation	Definition
1	P(A)	P[X, Ω, A]	f[A]	f : ℙ Ω → ℝ ∀ A : ℙ Ω • f[A] = P[X, Ω, A]
2	P(B\|A)	P[X, A, B]	g[A][B]	g : ℙ Ω → (ℙ Ω → ℝ) ∀ A, B : ℙ Ω • g[A][B] = P[X, A, B]
3	P(Z < F)	P[X, Ω, {ω : Ω • Z[ω] ≤ F}]	Z_f[F]	Z_f : ℝ → ℝ ∀ F : ℝ • Z_f[F] = P[X, Ω, {ω : Ω • Z[ω] ≤ F}]
4	P(x \| y)	P[X, {y}, {x}]	f[y][x]	f : ℝ → (ℝ → ℝ) ∀ x, y : ℝ • f[y][x] = P[X, {y}, {x}]

In example number 1, the specific notation does little more than eliminate the need to explicitly specify the probability space and conditioning set. The function, f, takes as input a set from Ω (the 'outcome space' from the probability space X) and returns the probability that the truth lies in that set, according to the μ from the probability space.

In example 2, the idea of a conditional probability distribution is captured as a function, g, that takes as input a set from Ω and returns another function, this one taking as input a second subset of Ω and returning the probability that the truth lies in that second set, given that it is known to lie in the first. The function, g, is used by giving the inputs one after the other, as shown above.

Example 3 shows a typical situation involving a so-called 'random variable'. The old notation is read as saying 'the probability of the random variable, Z, being less than F. However, technically, Z is a function that takes as input an item from Ω and returns a Real number. The generic notation shows this idea using the standard rules of set builder notation. The function Z_f simply returns the probability of the random variable returning a number less than the input.

Example 4 is another conditional probability distribution, this time probably based on a joint probability density distribution, with the notation representing the probability density of a particular value of x given a particular value of y.

Standard notation for distribution families

The notation for distribution families is really that of conditional distributions so instead of writing Normal(x|μ, σ²) we can write Normal[μ, σ²][x]. In this example, Normal is a function that, given values for its parameters, returns a probability density function.

An advantage of this style is that it is possible to talk about the function Normal[μ, σ²] without referring to a particular value produced by it, which would be written as Normal[μ, σ²][x].

Some longer examples

Flipping a fair coin

Flipping a fair coin, and using the usual assumptions of equal probabilities, we might define a probability space, X, as follows:

Coin Probability Space

X = (Ω, S, μ)

Ω = {heads, tails}

S = {{}, {heads}, {tails}, {heads, tails}}

μ = {({}, 0), ({heads}, 0.5), ({tails}, 0.5), ({heads, tails}, 1)}

The probability of 'heads' can then be written as:

P[X, Ω, {heads}]

To set up a probability distribution with a specific name we can start with a space-saving abbreviation for the results:

HT == heads | tails

and then write:

Abbreviation f_c

Coin Probability Space

f_c : HT → ℝ

dom[f_c] = {heads, tails}

∀ r : HT | r ∈ dom[f_c] • f_c[r] = P[X, Ω, r]

Note that all the objects and rules defined in the Coin Probability Space schema (the box) earlier are imported into this schema at the start, just by writing Coin Probability Space.

Alternatively, using lambda notation, we could replace the last three lines:

Abbreviation f_c

Coin Probability Space

f_c = λ r : HT | r ∈ HT • P[X, Ω, r]

The lambda notation for defining functions can be read as 'f_c is the function that maps a result, r, from the set 'heads-or-tails', to its probability P[X, Ω, r].'

Another alternative is to give the probabilities directly rather than refer back to the probability space. Either technique could be used:

Abbreviation f_c

f_c : HT → ℝ

∀ r : HT | r ∈ dom[f_c] • f_c[r] = 0.5

	Abbreviation f_c
f_c = λ r : HT \| r ∈ HT • 0.5

Or, since this is a very small distribution we could just write:

	Abbreviation f_c
f_c = {(heads, 0.5), (tails, 0.5)}

This style uses the idea that a function is really just a set of paired inputs and outputs.

With all these definitions the function is the same. It lets us write things like:

f_c[heads] = 0.5

Bayesian modelling

Bayesian modelling of data is a good area for using specific functions.

The probability space for most Bayesian methods combines potential models with data we could potentially observe and use as evidence as to how likely each model is to be the best model of the bunch. Since the type is quite complicated, here is an abbreviation for it:

[MODEL, DATUM]

BAYES_SPACE == (MODEL × DATUM) × ℙ (MODEL × DATUM) × ((MODEL × DATUM) → ℝ)

We can now define the probability space, giving the definition the name Bayes Space so that it can reused later.

Bayes Space

(Ω, S, μ) : BAYES_SPACE

ms : ℙ MODEL

ds : ℙ DATUM

isProbSpace[(Ω, S, μ)]

Ω = ms × ds

In addition to a Bayes Space, we also need functions representing views, before and after considering the data, of the probability that each of the set of models is the best model. We also need a function giving the probability of observing particular data assuming each model is true.

Basic Functions

Bayes Space

v₀, v₁ : MODEL → ℝ

f_d : MODEL → (DATUM → ℝ)

dom[v₀] = ms

v₀ = (λ m : MODEL | m ∈ ms • P[(Ω S, μ), Ω, {(m_x, d_x) : MODEL × DATUM | m_x = m}])

dom[f_d] = ms

∀ m : MODEL | m ∈ ms • dom[f_d[m]] = ds

∀ m : MODEL, d : DATUM | m ∈ ms ∧ d ∈ ds •

f_d[m][d] = P[(Ω S, μ), {(m_x, d_x) : MODEL × DATUM | m_x = m}, {(m_x, d_x) : MODEL × DATUM | d_x = d}]

dom[v₁] = ms

∀ m : MODEL, d : DATUM | m ∈ ms ∧ d ∈ ds •

v₁[m] = P[(Ω S, μ), {(m_x, d_x) : MODEL × DATUM | d_x = d}, {(m_x, d_x) : MODEL × DATUM | m_x = m ∧ d_x = d}]

These definitions start off by importing the elements of Bayes Space. This makes available Ω, S, μ, ms, and ds, with the relationships established between them.

Then each function is defined with statements that establish the domain of the function (i.e. the inputs it can handle) and the rule that maps inputs to outputs. In these cases the rule uses the probability space.

One of the easiest ways to do a Bayesian analysis is using 'conjugate priors.' The beauty of this technique is that the distributions representing views before and after using evidence can be taken from the same distribution family. All that changes is the value of the parameters that select a particular distribution from the distribution family.

The simplest example is that of tossing an unfair coin to learn about the rate at which it turns up heads, long term. Our initial view of the relative probabilities of each possible rate of heads can be represented by a probability density distribution from the beta family. Our view of the relative probabilities of each possible rate of heads after considering the evidence from some tosses of that coin can also be represented by a distribution from the beta family. The beta distribution has two parameters that select a particular distribution: α and β. We can call the values of those parameters before and after considering evidence (α₀, β₀) and (α₁, β₁) respectively.

A second distribution family is also used in this analysis. The binomial family is used to state the probability of getting a certain number of heads from a series of tosses, assuming the probability of heads is the same on every toss. Two parameters are used to select a particular distribution from the binomial family. They are the number of trials (i.e tosses) and the probability of 'success' on each trial.

The function M simply maps models to particular Real numbers. For example, if you think the rate of heads is 0.3 then the associated Real number is 0.3. It's almost too obvious to mention, but there is a logical difference between a model and a Real number.

Conjugate Functions

Bayes Space

Basic Functions

beta : ℝ² → (ℝ → ℝ);

binom : (ℕ × R) → (DATUM → ℝ)

M : MODEL → ℝ

α₀, β₀, α₁, β₁ : ℝ

n, d : ℕ

dom[M] = ms

α₀ = 0

β₁ = 0

dom[beta[α₀, β₀]] = rng[M]

∀ m : MODEL | m ∈ ms • beta[α₀, β₀][M[m]] = v₀[m]

dom[binom] = {(n', r) : ℕ × ℝ | 0 ≤ r ∧ r ≤ 0}

∀ m : MODEL | m ∈ ms • dom[binom[n, M[m]]] = ds

∀ m : MODEL, d : DATUM | m ∈ ms ∧ d ∈ ds • binom[n, M[m]][d] = f_d[m][d]

dom[beta[α₁, β₁]] = rng[M]

α₁ = α₀ + d

β₁ = β₀ + (n - d)

∀ m : MODEL | m ∈ ms • beta[α₁, β₁][M[m]] = v₁[m]

Final thoughts

These examples give a flavour of the notation that can be used, but probably also look rather complicated and perhaps even intimidating on a first look. Bear in mind that these examples give vastly more information than typical writing about probabilities and distributions. Also, the effect of reading, carefully, each statement and understanding what it says is to provide a much clearer understanding of probabilities than can usually be achieved. Brevity is not always the key to clarity — not if brevity is achieved by leaving the reader to guess the rest.

References

Fokkinga, Maarten M. (2006) Z-style notation for Probabilities. In: Second Twente Data Management Workshop, TDM: Uncertainty in Databases, Enschede, pp. 19-24.

Morgan, C. (2012). Elementary Probability Theory in the Eindhoven Style. Mathematics of Program Construction, Lecture Notes in Computer Science, Volume 7342, pp 48-73.

Spivey, J.M. (1989). The Z Notation. Prentice-Hall, Englewood Cliffs, NJ. Available online at: http://spivey.oriel.ox.ac.uk/mike/zrm/zrm.pdf

Made in England

Working In Uncertainty

Taming probability notation

Contents

Improving the three types of probability notation

More complete generic notation

Systematic and frequent use of specific functions

Standard notation for distribution families

Some longer examples

Flipping a fair coin

Bayesian modelling

Final thoughts

References