- Published on

# Probability Theory, Maximum Likelihood Estimate & Maximum a Posteriori Estimation

## Overview

This is my notes on:

- Lecture 1 Basic of Probability Theory, MSBD5012
- Chapter 1.2 Probability Theory, Pattern Recognition and Machine Learning
- And many other resources ...

## Table of Contents

- Sample Space and Event of a Random Experiment
- Probability and Probability Distribution
- Joint Probability
- Marginal Probability
- Conditional Probability
- Independence
- Test of Independence
- Bayes' theorem
- Expectation and Variance
- Likelihood in Machine Learning
- Maximum Likelihood Estimation (MLE)
- Maximum a Posteriori Estimation (MAP)

## Sample Space and Event of a Random Experiment

A `Random Experiment`

is a process with uncertain outcomes. A set of all the possible outcomes of the random experiment is called the `Sample Space`

$\Omega$.

- If rolling one dice is the random experiment, the sample space $\Omega = \{1, 2, 3, 4, 5, 6\}$.

A `Random Variable`

is a function that takes a sample space and maps it to a new sample space:

- The easiest one is the identical (1-to-1) function, let the
`Random Variable`

$X$ be the outcomes (the sample space) of rolling one dice (the random experiment). $X$ takes values $\Omega_{X} = \{1, 2, 3, 4, 5, 6\}$, as well.

A `Event`

is a subset of a sample space $\Omega$:

- $(X=2)$ describes an
`Event`

subset $\Omega_{X=2}=\{2\}$, and it describes we got a value $2$ from the random experiment of rolling one dice. - $(X=\mathbf{Odd})$ describes an
`Event`

subset $\Omega_{X=\mathbf{Odd}}=\{1, 3, 5\}$, and it describes we got a odd number value from the random experiment of rolling one dice.

A `Composed Experiment`

describes the repetitions of the same random experiment, and each repetition can be called a `trail`

.

## Probability and Probability Distribution

Let's consider a random variable $X$ with $M$ discrete outcomes $\Omega_{X}=\{x_1, x_2,..., x_m, ... x_M\}$.

$P(X=x_m)$ denotes the `probability`

of an event of $X$ being $x_m$. For example, if $X$ is the smartphone model we found in a random experiment, $P(X=\mathbf{}IPhoneSE)$ is the probability of finding an IPhoneSE smartphone in the random experiment.

Taking all outcomes into account, $P(X)$ denotes the `probability distribution`

of $X$:

- Probability Mass Function (Histogram) for discrete outcomes.
- Probability Density Function for continuous outcomes.

## Joint Probability

Let's say there are two random variables $(X, Y)$ in a random experiment.

- $X$ has M outcomes $\Omega_{X} =\{x_1, x_2, ..., x_m, ..., x_M\}$.
- $Y$ has L outcomes $\Omega_{Y} =\{y_1, y_2, ..., y_l, ..., y_L\}$.

Then each observation in the random experiment is a pair of events $(X=x_m, Y=y_l)$. $P(X=x_m, Y=y_l)$ denotes the `joint probability`

of two events, $(X=x_m)$ and $(Y=y_l)$, occurring at the same time.

For example, if $X$ and $Y$ are the smartphone model and the gender we found in an random experiment, respectively, $P(X=\mathbf{IPhoneSE}, Y=\mathbf{Male})$ indicates the probability of finding a male with IPhone SE in the random experiment.

After obtaining the joint probability of each $(X=x_m, Y=y_l)$ pair, we can get the `joint probability distribution`

$P(X, Y)$. To illustrate this idea, the following code generates 1000 $(X=x_m, Y=y_l)$ pairs (`xs`

and `ys`

):

```
# Take 1000 samples between [1, 5)
xs = [np.random.randint(1, 5) for _ in range(1000)]
# Take 1000 samples between [1, 3)
ys = [np.random.randint(1, 3) for _ in range(1000)]
# Count the (x, y) pairs and convert them to a dataframe
joint_dist_df = pd.DataFrame.from_dict(Counter(zip(xs, ys)), orient="index").reset_index()
joint_dist_df.columns = ["Event", "P(X, Y)"]
joint_dist_df["X"] = [e[0] for e in joint_dist_df["Event"]]
joint_dist_df["Y"] = [e[1] for e in joint_dist_df["Event"]]
# Compute the joint probability
joint_dist_df["P(X, Y)"] /= joint_dist_df["P(X, Y)"].sum()
joint_dist_df
```

The above code generates the following dataframe:

## Marginal Probability

Based on the observations above of $(X=x_m, Y=y_l)$ we can calculate the `marginal probability`

$P(X=x_m)$ by marginalizing $Y$:

For example, if $X$ and $Y$ are the smartphone model and the gender we found in a random experiment, respectively. We can compute $P(X=\mathbf{IPhoneSE})$ by adding up the corresponding joint probabilities for each gender.

Applying this to all $X$'s outcomes we get the `marginal probability distribution`

$P(X)$:

This is called the `sum rule of probability theory`

. The following code shows how to compute the marginal probability distribution:

```
marginal_x_dist_df = joint_dist_df.groupby("X")["P(X, Y)"].sum().reset_index()
marginal_x_dist_df.columns = ["X", "P(X)"]
marginal_x_dist_df
```

The above code generates the following dataframe:

## Conditional Probability

If we filter the observations by a particular outcome (e.g., $X=x_m$), we can calculate the probabilities of observing $Y=y_l$ given the filtered observations $(X=x_m)$. It is called the `conditional probability`

$P(Y=y_l|X=x_m)$

The following code shows how to compute the `conditional probability distribution`

$P(Y|X)$:

```
for x_i in range(1, 5):
print("For x_i =", x_i)
# Filter obseravtions based on x_i
_df = joint_dist_df[joint_dist_df["X"]==x_i].copy()
_df = pd.merge(_df, marginal_x_dist_df, how="left", on="X")
# Calculate the conditional probabilites
_sum = _df["P(X, Y)"].sum()
_df["P(Y|X)"] = _df["P(X, Y)"] / _sum
display(_df)
```

The above code generates the following dataframe:

From the result we can also observe that the conditional probability $P(Y=y_l|X=x_m)$ can be calculated with:

And the conditional probability distribution can be written as:

And, $P(X,Y)=P(Y|X)P(X)$ is called the `product rule of probability theory`

.

## Independence

There is a special case for the `product rule`

above. If two events $X=x_m$ and $Y=y_l$ are independent, Knowing $X=x_m$ in advance won't change the probability of having $Y=y_l$; therefore:

- The conditional probability becomes $P(Y=y_l|X=x_m) = P(Y=y_l)$
- The conditional probability distribution becomes $P(Y|X) = P(Y)$

Applying this new information to the `product rule`

above we will get:

if $X=x_m$ and $Y=y_l$ are independent.

## Test of Independence

```
# Take 1000 samples between [1, 5)
xs = [np.random.randint(1, 5) for _ in range(1000)]
# Take 1000 samples between [1, 3)
ys = [np.random.randint(1, 3) for _ in range(1000)]
```

In the Python coding example, we generated 1000 $(X=x_m, Y=y_l)$ pairs (`xs`

and `ys`

) uniformly and independently. But why we are not getting a perfect result of $P(X, Y) = P(X) \times P(Y)$?

Let's find out the reason by looking at the marginal probabilities $P(X)$ and $P(Y)$. Ideally we should see $P(X=x_m)=0.25$ and $P(Y=y_l)=0.5$ from the code output, but it is not the case. Actually, if we increase the sample size, $P(X)$ and $P(Y)$ will get closer to the ideal values.

So there is a problem with counting observations: the probabilities obtained from counting observations are not entirely accurate.

We can also use the Chi-squared test to test if two categorical values are independent. Chi-squared test tests if two categorical variables are dependent on each other or not.

- The null hypothesis: $(X=x_m)$ and $(Y=y_l)$ are independent.
- The alternative hypothesis: $(X=x_m)$ and $(Y=y_l)$ are dependent.

```
from sklearn.feature_selection import chi2
# The null hypothesis is that they are independent.
# P <= 0.05: Reject the null hypothesis.
# P > 0.05: Accept the null hypothesis.
chi2(np.array(xs).reshape(-1, 1), np.array(ys).reshape(-1, 1))
# > (array([0.88852322]), array([0.34587782]))
```

The test returns a P-value of 0.346; therefore, we cannot reject the null hypothesis that $(X=x_m)$ and $(Y=y_l)$ are independent.

## Bayes' theorem

Recall the `product rule`

$P(X=x_m,Y=y_l)=P(Y=y_l|X=x_m)P(X=x_m)$, since joint probability distribution is symmetrical $P(X=x_m,Y=y_l) = P(Y=y_l,X=x_m)$, we can deduce the `Bayes' theorem`

like this:

Here,

- $P(Y=y_l|X=x_m)$ describes the probability of finding our target $(Y=y_l)$ given the
`Evidence`

$(X=x_m)$. It is also called the`Posterior`

Probability. - $P(Y=y_l)$ describes the probability of finding our target $(Y=y_l)$ before knowing the
`Evidence`

$(X=x_m)$. It is also called the`Prior`

Probability.

If the new `Evidence`

$(X=x_m)$ is value-adding, we should see `Posterior`

$P(Y=y_l|X=x_m)$ deviates from the `Prior`

$P(Y=y_l)$. In other words, the new `Evidence`

$(X=x_m)$ can update our degree of belief.

Also:

This shows the `Evidence`

$(X=x_m)$ is for $(Y=y_l)$. Therefore, $P(X=x_m|Y=y_l)/P(X=x_m)$ can be understood as the `Support`

of the evidence for our target.

## Expectation and Variance

Given a random variable $X$ that has $M$ outcomes $\Omega_{X} =\{x_1, x_2, ..., x_m, ..., x_M\}$ and a probability distribution $P(X)$ over $M$ different outcomes. The `Expectation`

of $f$ over $x$ is defined as:

In most cases, we won't know the probability distribution $P(X)$, but we can approximate the `Expectation`

by observations. Let's say we have a dataset with $N$ observations $\{x_1, x_2, ..., x_n, ..., x_N\}$ that were drawn from the probability distribution $P$ with replacement. The `Expectation`

of $f$ can be approximated by $N$ observations:

The count of obseravtions with $x_n=x_m$ over the sample size $N$ is the approximate probability $P(X=x_m)$.

The `Variance`

of $f$ over $x$ is defined as:

$f(x)-E_{x}[f(x)]$ is the deviation of $f(x)$ from its `Expectation`

. The `Variance`

is the `Expectation`

of the squared deviation.

## Likelihood in Machine Learning

Let's we have a dataset with $N$ observations $\bold{d} = \{d_1,d_2...d_N\}$ from a distribution $P(\bold{D})$. We want to estimate a model with 2 parameters $\bold{w}=\{w_0,w_1\}$ from a distribution $P(\bold{W})$

$L(\bold{W}=\bold{w}|\bold{D}=\bold{d}) = P(\bold{D}=\bold{d}|\bold{W}=\bold{w})$ is the `Likelihood`

of having the parameter $\bold{w}$ given the dataset $\bold{d}$.

In a training process, we have a varing $\bold{w}$ for a given fixed $\bold{d}$, therefore the `Likelihood`

is a function of model parameter $\bold{w}$. For example, given a training dataset $\bold{d}$ and 2 models $\bold{w}_a$ and $\bold{w}_b$, $\bold{w}_a$ is better if $P(\bold{D}=\bold{d}|\bold{W}=\bold{w}_a) > P(\bold{D}=\bold{d}|\bold{W}=\bold{w}_b)$

Also, since $P(\bold{D}=\bold{d}|\bold{W}=\bold{w})$ is a function of $\bold{w}$ (varying $\bold{w}$) in this setting, it is not a probability distribution. $P(\bold{D}=\bold{d}|\bold{W}=\bold{w})$ is a probability distribution only if it is a function of $\bold{d}$ (varying $\bold{d}$).

## Maximum Likelihood Estimation (MLE)

There are $N$ observations $\{d_1,d_2...d_N\}$ in the dataset $\bold{D}=\bold{d}$:

Assume those samples are independent:

To find the best $\bold{W}=\bold{w}$, we going to maximize the `Likelihood`

:

The objective is also equivalent to maximize the `Log-likelihood`

:

## Maximum a Posteriori Estimation (MAP)

In `MLE`

, the `Likelihood`

$P(\bold{D}=\bold{d}|\bold{W}=\bold{w})$ is a function of $\bold{w}$ (varying $\bold{w}$) in this setting, it is not a probability distribution. To have a function of $\bold{w}$ that is a probability distribution given $\bold{D}=\bold{d}$, we need to use the Bayes' theorem to find the `Posterior`

probability of the model parameter $\bold{w}$ is the one we want.

Since $P(\bold{D}=\bold{d})$ is positive, `MAP`

becomes:

The `Prior`

$P(\bold{W}=\bold{w})$ can be considered as the external knowledge of the parameter. `MAP`

is like an external knowledge regularized `MLE`

.

For example, the external knowledge can come from an human expert, then under the `MAP`

setting, a bad model parameter ${\bold{w}_{bad}}$ can be avoided by providing a low $P(\bold{W}=\bold{w}_{bad})$. An example of bad parameters can be: A model that relies heavily on the gender information for predicting credit default.

When `Prior`

$P(\bold{W}=\bold{w})$ is a uniform distribution, `MAP`

becomes `MLE`

, because it doesn't matter what $\bold{w}$ is, the `Prior`

probability is the same.