# Week 1 - Probability theory and Statistics review

Hello! Welcome to our course on Model-Based Machine Learning! We hope you will enjoy and learn a lot of new powerful techniques for your future.

As usual in Python, the first thing to do is to import the necessary packages. Let's start with some usual ones for now...

## Part 1 - Random variables, independence, conditionality, Bayes theorem

We are going to work on a 2D world for now, and a uniform distribution.

So, let's create N 2D points:

Can you count how many such points fall inside a circle centered at (0,0) and radius 1?

Consider a={a given point (x,y) falls inside the circle centered at (0,0) with radius 1}

Calculate the probability of a, p(a)

Visualizations are often quite useful. Can you visualize both the points as well as the circle line?

Just to link with your trigonometrics knowledge: what is the area of this circle (radius=1)? And what is the area of a quarter of that circle (the part that you just drew)?

What is the relationship with that value and the probability that you just calculated (you can vary the number of points, N, to check your theory)?

Calculate p(b)

So, now you have p(a) and p(b). Do you want to try calculating p(a,b)?

Are they independent? Is there any reason to believe that they aren't?

To check this, try two approaches:

-- p(a,b)=p(a)*p(b)

-- Just count the points that fulfill both a and b constraints (and divide by the total number of points... ;-) ).

They are different! Why?... If you are in doubt, take a look at the picture...

Maybe they are not independent, after all... if so, their joint probability is instead

p(a,b)=p(a|b)*p(b)

Please calculate p(a|b), in order to get the right value...

Now you calculate p(a,b)=p(a|b)*p(b)

Compare with the values above. Does it make more sense now?

Another well known property is the Bayes theorem:

$p(b|a)=\frac{p(a|b)p(b)}{p(a)}$

Calculate the p(b|a) using the Bayes theorem, and calculate it directly from data. Are they converging to the same value?

Look at the following table:

Marginalization allows us to compute distributions over a selection of variables that we want. For example, for the table above, that represents p(W, T) in a non-normalized way, we could "marginalize out weather", i.e. calculate p(T), by doing

$p(T)=\sum_W p(W, T)$

Marginalize out weather (Hint: the pandas package has a nice "sum()" function associated with the the DataFrame object f that you can use...)

Marginalize out temperature

## Part 2 - Independence

Let's get further into the concept of **independence**

To begin with, let's create a small dataset of size N. We will start with two random variables, z and t, both uniformly distributed.

Each random variable will be an integer value in {0, 1, 2} (you can change this later, of course).

Let's calculate the probability of each different value of z. Take a look at this code. It may be useful later

Actually, we're going to reuse the above code quite a lot. Let's make a new function, then...

Let's try it with our variables Z and T

ok. So, we have the probability of each value, according to this dataset.

Assuming that both z and t are uniformly distributed, the calculation for the probability of each should be trivial, right? What is it?

Does it match the above? If you're in doubt, you can increase the value of N, the dataset size. As N grows, it should approximate your calculation...

Let's now think of the joint distribution of z and t. In other words, how these two variables seem to co-vary, together. The first thing to do is to "align" them, i.e., re-create a new random variable (z_t), that is a pair of observations, z and t.

So, let's use the function above, to calculate the join distribution

Are Z and T independent? If yes, the formula for their join distribution is trivial. Please calculate it (and check if it verifies).

Let's create two more variables, both based on Z.

Are they independent? What do you think? A good way to see is by plotting... do you want to do it?

What should it look like when variables are independent? And when they are correlated?

Let's check their joint distribution, then...

Notice that, for all other value combinations (e.g. x,y=0,-2), the probability is zero.

If x and y are independent, then the product of their marginal probabilities should be equal to the values above, right? Do you want to calculate?

Should the values match at all? Let's try instead to calculate the conditional probabilities, of and x and y, given z.

To do it, directly on the data, we need to calculate, for EACH value of z, the distributions for both x and y.

This means that we need to organize the data accordingly. How about using a dictionary?

To make sure you understand the code above, don't forget to check z_dict carefully...

The code below uses the dictionary to calculate the conditional marginal and joint distributions. Check it carefully.

So, what do you conclude? Is it true that x is independent of y given z?

The example above was particularly odd, because there is NO randomness involved (i.e. x is **exactly** betax times z). As a consequence, the probabilities are quite extreme.

Try to redo the whole exercise, by adding a little randomness to x and y (don't forget to keep them as integers). Notice how the conditional independence property varies of "intensity", as you add more or less randomness...

Finally, let's try a different perspective on conditional independence: why does it actually matter??

Particularly, we want to understand the meaning of the following sentence:

**"If we know z, then knowing about y tells us nothing about x"**

Using your small dataset (x, y, z), you will try to make linear regression prediction models. Our goal is to "know about x", i.e. x will be the target (or dependent) variable in our linear regression.

Compare the following models (e.g. using sklearn's LinearRegression model):

- $x=\beta_1*z$
- $x=\beta_1
*y+\beta_2*z$ - $x=\beta_1*y$

Notice the coefficients, what do they mean?

## Part 3 - Expectation of discrete variables

Calculate the expected value of x, using

$E(x)=\sum_x xp(x)$

Compare it with the mean of x (x.mean() )

## Part 4 - Well-known probability distributions

In this lecture, we talk about a few probability density functions. Let's define them. Take a careful look at the expressions, and the parameters.

Notice that these function exists in many other Python packages, you don't need to redefine them all the time...

It's very useful to plot them, as we do below (this is the code used to generate the pictures in the slides)

First, the **Gaussian**

Now, the **Poisson** . Remember that it is a discrete probability distribution (the x axis has to be discrete numbers - any ).

Try to use the parameters "bins=..." in the plt.hist call, to see this influence of different bin sizes in the shape of the histogram.

Let's try the **exponential**

And finally, the **Beta**

**The Central limit theorem**

A very important theorem in all statistics is called the Central Limit Theorem, which says that

The distribution of the sampling means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution.

Try to test yourself the Central Limit Theorem. Essentially, generate multiple random datasets, each time following the same parameters, and a probability density function (pdf) of your choice. Test with each of the pdfs created above.

We will now work with bivariate distributions, i.e. instead of having one value at a time, we have two.

Let's use the Bivariate normal distribution for it

Try changing the values of that matrix, and see the results...

Just to make sure you don't make mistakes in the future, notice that the code below is ONLY for a covariance matrix of independent normal variables.

Just to make sure you got it, sample from an equivalent (independent) gaussian, now with the np.random.multivariate_normal function mentioned above.

## Part 5 - Maximum likelihood estimation

Time now to work on the concept of maximum loglikelihood, a fundamental one in statistical modeling.

Let's create a dataset...

Now, let's follow the slides, and create the function that returns the Gaussian loglikelihood of a dataset, given a pair of parameters:

$-\frac{n}{2} (log(2\pi)+log(\sigma^2) -\frac{1}{2\sigma^2}\sum_i (x_i-\mu)^2$

Confirm that the function has no mistake...

So, we want to find the values of $\mu$ and $\sigma$ that maximize that function in our dataset... Do you want to try to find them?

A simple way is to try MANY such values (e.g. a range of possible values). Go ahead!

Compare the values you obtained with the sample mean and sample variance.

As mentioned in the lecture, this task falls in the realm of Optimization. Python also has tools for optimizing a function, which is much more clever than blindly trying many values for $\mu$ and $\sigma$. We will not explore them here, but in case you're curious, just try the methods in:

scipy.optimize

Now, just for illustration, let's try a case where the data comes from a combination (a *mixture*) of two different distributions.

Plot the respective histogram...

Just getting the mean and standard deviation, assuming a single Gaussian would be mistake... check it yourself!

Plot the distribution you estimated on top of the histogram, just to see the difference