# Axioms of Probability

The axiom foundation of probability theory is laid by Kolmogorov, one of the greatest mathematician of the 20th century, who advanced various very different fields of mathematics.

 Definition (Probability Space) A probability space is a triple ${\displaystyle (\Omega ,\Sigma ,\Pr )}$. ${\displaystyle \Omega }$ is a set, called the sample space. ${\displaystyle \Sigma \subseteq 2^{\Omega }}$ is the set of all events, satisfying: (A1). ${\displaystyle \Omega \in \Sigma }$ and ${\displaystyle \emptyset \in \Sigma }$. (The certain event and the impossible event.) (A2). If ${\displaystyle A,B\in \Sigma }$, then ${\displaystyle A\cap B,A\cup B,A-B\in \Sigma }$. (Intersection, union, and diference of two events are events). A probability measure ${\displaystyle \Pr :\Sigma \rightarrow \mathbb {R} }$ is a function that maps each event to a nonnegative real number, satisfying (A3). ${\displaystyle \Pr(\Omega )=1}$. (A4). If ${\displaystyle A\cap B=\emptyset }$ (such events are call disjoint events), then ${\displaystyle \Pr(A\cup B)=\Pr(A)+\Pr(B)}$. (A5*). For a decreasing sequence of events ${\displaystyle A_{1}\supset A_{2}\supset \cdots \supset A_{n}\supset \cdots }$ of events with ${\displaystyle \bigcap _{n}A_{n}=\emptyset }$, it holds that ${\displaystyle \lim _{n\rightarrow \infty }\Pr(A_{n})=0}$.

The sample space ${\displaystyle \Omega }$ is the set of all possible outcomes of the random process modeled by the probability space. An event is a subset of ${\displaystyle \Omega }$. The statements (A1)--(A5) are axioms of probability. A probability space is well defined as long as these axioms are satisfied.

Example
Consider the probability space defined by rolling a dice with six faces. The sample space is ${\displaystyle \Omega =\{1,2,3,4,5,6\}}$, and ${\displaystyle \Sigma }$ is the power set ${\displaystyle 2^{\Omega }}$. For any event ${\displaystyle A\in \Sigma }$, its probability is given by ${\displaystyle \Pr(A)={\frac {|A|}{6}}.}$
Remark
• In general, the set ${\displaystyle \Omega }$ may be continuous, but we only consider discrete probability in this lecture, thus we assume that ${\displaystyle \Omega }$ is either finite or countably infinite.
• In many cases (such as the above example), ${\displaystyle \Sigma =2^{\Omega }}$, i.e. the events enumerates all subsets of ${\displaystyle \Omega }$. But in general, a probability space is well-defined by any ${\displaystyle \Sigma }$ satisfying (A1) and (A2). Such ${\displaystyle \Sigma }$ is called a ${\displaystyle \sigma }$-algebra defined on ${\displaystyle \Omega }$.
• The last axiom (A5*) is redundant if ${\displaystyle \Sigma }$ is finite, thus it is only essential when there are infinitely many events. The role of axiom (A5*) in probability theory is like Zorn's Lemma (or equivalently the Axiom of Choice) in axiomatic set theory.

Laws for probability can be deduced from the above axiom system. Denote that ${\displaystyle {\bar {A}}=\Omega -A}$.

 Proposition ${\displaystyle \Pr({\bar {A}})=1-\Pr(A)}$.
Proof.
 Due to Axiom (A4), ${\displaystyle \Pr({\bar {A}})+\Pr(A)=\Pr(\Omega )}$ which is equal to 1 according to Axiom (A3), thus ${\displaystyle \Pr({\bar {A}})+\Pr(A)=1}$. The proposition follows.
${\displaystyle \square }$

Exercise: Deduce other useful laws for probability from the axioms. For example, ${\displaystyle A\subseteq B\Longrightarrow \Pr(A)\leq \Pr(B)}$.

# Notation

An event ${\displaystyle A\subseteq \Omega }$ can be represented as ${\displaystyle A=\{a\in \Omega \mid {\mathcal {E}}(a)\}}$ with a predicate ${\displaystyle {\mathcal {E}}}$.

The predicate notation of probability is

${\displaystyle \Pr[{\mathcal {E}}]=\Pr(\{a\in \Omega \mid {\mathcal {E}}(a)\})}$.
Example
We still consider the probability space by rolling a six-face dice. The sample space is ${\displaystyle \Omega =\{1,2,3,4,5,6\}}$. Consider the event that the outcome is odd.
${\displaystyle \Pr[{\text{ the outcome is odd }}]=\Pr(\{1,3,5\})}$.

During the lecture, we mostly use the predicate notation instead of subset notation.

# The Union Bound

We are familiar with the principle of inclusion-exclusion for finite sets.

 Principle of Inclusion-Exclusion Let ${\displaystyle S_{1},S_{2},\ldots ,S_{n}}$ be ${\displaystyle n}$ finite sets. Then {\displaystyle {\begin{aligned}\left|\bigcup _{1\leq i\leq n}S_{i}\right|&=\sum _{i=1}^{n}|S_{i}|-\sum _{i

The principle can be generalized to probability events.

 Principle of Inclusion-Exclusion for Probability Let ${\displaystyle {\mathcal {E}}_{1},{\mathcal {E}}_{2},\ldots ,{\mathcal {E}}_{n}}$ be ${\displaystyle n}$ events. Then {\displaystyle {\begin{aligned}\Pr \left[\bigvee _{1\leq i\leq n}{\mathcal {E}}_{i}\right]&=\sum _{i=1}^{n}\Pr[{\mathcal {E}}_{i}]-\sum _{i

We only prove the basic case for two events.

 Lemma For any two events ${\displaystyle {\mathcal {E}}_{1}}$ and ${\displaystyle {\mathcal {E}}_{2}}$, ${\displaystyle \Pr[{\mathcal {E}}_{1}\vee {\mathcal {E}}_{2}]=\Pr[{\mathcal {E}}_{1}]+\Pr[{\mathcal {E}}_{2}]-\Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}]}$.
Proof.
 The followings are due to Axiom (A4). {\displaystyle {\begin{aligned}\Pr[{\mathcal {E}}_{1}]&=\Pr[{\mathcal {E}}_{1}\wedge \neg ({\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2})]+\Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}];\\\Pr[{\mathcal {E}}_{2}]&=\Pr[{\mathcal {E}}_{2}\wedge \neg ({\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2})]+\Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}];\\\Pr[{\mathcal {E}}_{1}\vee {\mathcal {E}}_{2}]&=\Pr[{\mathcal {E}}_{1}\wedge \neg ({\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2})]+\Pr[{\mathcal {E}}_{2}\wedge \neg ({\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2})]+\Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}].\end{aligned}}} The lemma follows directly.
${\displaystyle \square }$

A direct consequence of the lemma is the following theorem, the union bound.

 Theorem (Union Bound) Let ${\displaystyle {\mathcal {E}}_{1},{\mathcal {E}}_{2},\ldots ,{\mathcal {E}}_{n}}$ be ${\displaystyle n}$ events. Then {\displaystyle {\begin{aligned}\Pr \left[\bigvee _{1\leq i\leq n}{\mathcal {E}}_{i}\right]&\leq \sum _{i=1}^{n}\Pr[{\mathcal {E}}_{i}].\end{aligned}}}

The name of this inequality is Boole's inequality. It is usually referred by its nickname the "union bound". The bound holds for arbitrary events, even if they are dependent. Due to this generality, the union bound is extremely useful in probabilistic analysis.

# Independence

 Definition (Independent events) Two events ${\displaystyle {\mathcal {E}}_{1}}$ and ${\displaystyle {\mathcal {E}}_{2}}$ are independent if and only if {\displaystyle {\begin{aligned}\Pr \left[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}\right]&=\Pr[{\mathcal {E}}_{1}]\cdot \Pr[{\mathcal {E}}_{2}].\end{aligned}}}

This definition can be generalized to any number of events:

 Definition (Independent events) Events ${\displaystyle {\mathcal {E}}_{1},{\mathcal {E}}_{2},\ldots ,{\mathcal {E}}_{n}}$ are mutually independent if and only if, for any subset ${\displaystyle I\subseteq \{1,2,\ldots ,n\}}$, {\displaystyle {\begin{aligned}\Pr \left[\bigwedge _{i\in I}{\mathcal {E}}_{i}\right]&=\prod _{i\in I}\Pr[{\mathcal {E}}_{i}].\end{aligned}}}

Note that in probability theory, the "mutual independence" is not equivalent with "pair-wise independence", which we will learn in the future.

# Conditional Probability

In probability theory, the word "condition" is a verb. "Conditioning on the event ..." means that it is assumed that the event occurs.

 Definition (conditional probability) The conditional probability that event ${\displaystyle {\mathcal {E}}_{1}}$ occurs given that event ${\displaystyle {\mathcal {E}}_{2}}$ occurs is ${\displaystyle \Pr[{\mathcal {E}}_{1}\mid {\mathcal {E}}_{2}]={\frac {\Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}]}{\Pr[{\mathcal {E}}_{2}]}}.}$

The conditional probability is well-defined only if ${\displaystyle \Pr[{\mathcal {E}}_{2}]\neq 0}$.

For independent events ${\displaystyle {\mathcal {E}}_{1}}$ and ${\displaystyle {\mathcal {E}}_{2}}$, it holds that

${\displaystyle \Pr[{\mathcal {E}}_{1}\mid {\mathcal {E}}_{2}]={\frac {\Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}]}{\Pr[{\mathcal {E}}_{2}]}}={\frac {\Pr[{\mathcal {E}}_{1}]\cdot \Pr[{\mathcal {E}}_{2}]}{\Pr[{\mathcal {E}}_{2}]}}=\Pr[{\mathcal {E}}_{1}].}$

It supports our intuition that for two independent events, whether one of them occurs will not affect the chance of the other.

# Law of total probability

The following fact is known as the law of total probability. It computes the probability by averaging over all possible cases.

 Theorem (law of total probability) Let ${\displaystyle {\mathcal {E}}_{1},{\mathcal {E}}_{2},\ldots ,{\mathcal {E}}_{n}}$ be mutually disjoint events, and ${\displaystyle \bigvee _{i=1}^{n}{\mathcal {E}}_{i}=\Omega }$ is the sample space. Then for any event ${\displaystyle {\mathcal {E}}}$, ${\displaystyle \Pr[{\mathcal {E}}]=\sum _{i=1}^{n}\Pr[{\mathcal {E}}\mid {\mathcal {E}}_{i}]\cdot \Pr[{\mathcal {E}}_{i}].}$
Proof.
 Since ${\displaystyle {\mathcal {E}}_{1},{\mathcal {E}}_{2},\ldots ,{\mathcal {E}}_{n}}$ are mutually disjoint and ${\displaystyle \bigvee _{i=1}^{n}{\mathcal {E}}_{i}=\Omega }$, events ${\displaystyle {\mathcal {E}}\wedge {\mathcal {E}}_{1},{\mathcal {E}}\wedge {\mathcal {E}}_{2},\ldots ,{\mathcal {E}}\wedge {\mathcal {E}}_{n}}$ are also mutually disjoint, and ${\displaystyle {\mathcal {E}}=\bigvee _{i=1}^{n}\left({\mathcal {E}}\wedge {\mathcal {E}}_{i}\right)}$. Then ${\displaystyle \Pr[{\mathcal {E}}]=\sum _{i=1}^{n}\Pr[{\mathcal {E}}\wedge {\mathcal {E}}_{i}],}$ which according to the definition of conditional probability, is ${\displaystyle \sum _{i=1}^{n}\Pr[{\mathcal {E}}\mid {\mathcal {E}}_{i}]\cdot \Pr[{\mathcal {E}}_{i}]}$.
${\displaystyle \square }$

The law of total probability provides us a standard tool for breaking a probability into sub-cases. Sometimes, it helps the analysis.

# A Chain of Conditioning

By the definition of conditional probability, ${\displaystyle \Pr[A\mid B]={\frac {\Pr[A\wedge B]}{\Pr[B]}}}$. Thus, ${\displaystyle \Pr[A\wedge B]=\Pr[B]\cdot \Pr[A\mid B]}$. This hints us that we can compute the probability of the AND of events by conditional probabilities. Formally, we have the following theorem:

 Theorem Let ${\displaystyle {\mathcal {E}}_{1},{\mathcal {E}}_{2},\ldots ,{\mathcal {E}}_{n}}$ be any ${\displaystyle n}$ events. Then {\displaystyle {\begin{aligned}\Pr \left[\bigwedge _{i=1}^{n}{\mathcal {E}}_{i}\right]&=\prod _{k=1}^{n}\Pr \left[{\mathcal {E}}_{k}\mid \bigwedge _{i
Proof.
 It holds that ${\displaystyle \Pr[A\wedge B]=\Pr[B]\cdot \Pr[A\mid B]}$. Thus, let ${\displaystyle A={\mathcal {E}}_{n}}$ and ${\displaystyle B={\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}\wedge \cdots \wedge {\mathcal {E}}_{n-1}}$, then {\displaystyle {\begin{aligned}\Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}\wedge \cdots \wedge {\mathcal {E}}_{n}]&=\Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}\wedge \cdots \wedge {\mathcal {E}}_{n-1}]\cdot \Pr \left[{\mathcal {E}}_{n}\mid \bigwedge _{i Recursively applying this equation to ${\displaystyle \Pr[{\mathcal {E}}_{1}\wedge {\mathcal {E}}_{2}\wedge \cdots \wedge {\mathcal {E}}_{n-1}]}$ until there is only ${\displaystyle {\mathcal {E}}_{1}}$ left, the theorem is proved.
${\displaystyle \square }$

# Random Variable

 Definition (random variable) A random variable ${\displaystyle X}$ on a sample space ${\displaystyle \Omega }$ is a real-valued function ${\displaystyle X:\Omega \rightarrow \mathbb {R} }$. A random variable X is called a discrete random variable if its range is finite or countably infinite.

For a random variable ${\displaystyle X}$ and a real value ${\displaystyle x\in \mathbb {R} }$, we write "${\displaystyle X=x}$" for the event ${\displaystyle \{a\in \Omega \mid X(a)=x\}}$, and denote the probability of the event by

${\displaystyle \Pr[X=x]=\Pr(\{a\in \Omega \mid X(a)=x\})}$.

# Independent Random Variables

The independence can also be defined for variables:

 Definition (Independent variables) Two random variables ${\displaystyle X}$ and ${\displaystyle Y}$ are independent if and only if ${\displaystyle \Pr[(X=x)\wedge (Y=y)]=\Pr[X=x]\cdot \Pr[Y=y]}$ for all values ${\displaystyle x}$ and ${\displaystyle y}$. Random variables ${\displaystyle X_{1},X_{2},\ldots ,X_{n}}$ are mutually independent if and only if, for any subset ${\displaystyle I\subseteq \{1,2,\ldots ,n\}}$ and any values ${\displaystyle x_{i}}$, where ${\displaystyle i\in I}$, {\displaystyle {\begin{aligned}\Pr \left[\bigwedge _{i\in I}(X_{i}=x_{i})\right]&=\prod _{i\in I}\Pr[X_{i}=x_{i}].\end{aligned}}}

Note that in probability theory, the "mutual independence" is not equivalent with "pair-wise independence", which we will learn in the future.

# Expectation

Let ${\displaystyle X}$ be a discrete random variable. The expectation of ${\displaystyle X}$ is defined as follows.

 Definition (Expectation) The expectation of a discrete random variable ${\displaystyle X}$, denoted by ${\displaystyle \mathbf {E} [X]}$, is given by {\displaystyle {\begin{aligned}\mathbf {E} [X]&=\sum _{x}x\Pr[X=x],\end{aligned}}} where the summation is over all values ${\displaystyle x}$ in the range of ${\displaystyle X}$.

### Linearity of Expectation

Perhaps the most useful property of expectation is its linearity.

 Theorem (Linearity of Expectations) For any discrete random variables ${\displaystyle X_{1},X_{2},\ldots ,X_{n}}$, and any real constants ${\displaystyle a_{1},a_{2},\ldots ,a_{n}}$, {\displaystyle {\begin{aligned}\mathbf {E} \left[\sum _{i=1}^{n}a_{i}X_{i}\right]&=\sum _{i=1}^{n}a_{i}\cdot \mathbf {E} [X_{i}].\end{aligned}}}
Proof.
 By the definition of the expectations, it is easy to verify that (try to prove by yourself): for any discrete random variables ${\displaystyle X}$ and ${\displaystyle Y}$, and any real constant ${\displaystyle c}$, ${\displaystyle \mathbf {E} [X+Y]=\mathbf {E} [X]+\mathbf {E} [Y]}$; ${\displaystyle \mathbf {E} [cX]=c\mathbf {E} [X]}$. The theorem follows by induction.
${\displaystyle \square }$

The linearity of expectation gives an easy way to compute the expectation of a random variable if the variable can be written as a sum.

Example
Supposed that we have a biased coin that the probability of HEADs is ${\displaystyle p}$. Flipping the coin for n times, what is the expectation of number of HEADs?
It looks straightforward that it must be np, but how can we prove it? Surely we can apply the definition of expectation to compute the expectation with brute force. A more convenient way is by the linearity of expectations: Let ${\displaystyle X_{i}}$ indicate whether the ${\displaystyle i}$-th flip is HEADs. Then ${\displaystyle \mathbf {E} [X_{i}]=1\cdot p+0\cdot (1-p)=p}$, and the total number of HEADs after n flips is ${\displaystyle X=\sum _{i=1}^{n}X_{i}}$. Applying the linearity of expectation, the expected number of HEADs is:
${\displaystyle \mathbf {E} [X]=\mathbf {E} \left[\sum _{i=1}^{n}X_{i}\right]=\sum _{i=1}^{n}\mathbf {E} [X_{i}]=np}$.

The real power of the linearity of expectations is that it does not require the random variables to be independent, thus can be applied to any set of random variables. For example:

${\displaystyle \mathbf {E} \left[\alpha X+\beta X^{2}+\gamma X^{3}\right]=\alpha \cdot \mathbf {E} [X]+\beta \cdot \mathbf {E} \left[X^{2}\right]+\gamma \cdot \mathbf {E} \left[X^{3}\right].}$

However, do not exaggerate this power!

• For an arbitrary function ${\displaystyle f}$ (not necessarily linear), the equation ${\displaystyle \mathbf {E} [f(X)]=f(\mathbf {E} [X])}$ does not hold generally.
• For variances, the equation ${\displaystyle var(X+Y)=var(X)+var(Y)}$ does not hold without further assumption of the independence of ${\displaystyle X}$ and ${\displaystyle Y}$.

# Conditional Expectation

Conditional expectation can be accordingly defined:

 Definition (conditional expectation) For random variables ${\displaystyle X}$ and ${\displaystyle Y}$, ${\displaystyle \mathbf {E} [X\mid Y=y]=\sum _{x}x\Pr[X=x\mid Y=y],}$ where the summation is taken over the range of ${\displaystyle X}$.

### The Law of Total Expectation

There is also a law of total expectation.

 Theorem (law of total expectation) Let ${\displaystyle X}$ and ${\displaystyle Y}$ be two random variables. Then ${\displaystyle \mathbf {E} [X]=\sum _{y}\mathbf {E} [X\mid Y=y]\cdot \Pr[Y=y].}$