# Probability and Statistics

## Statistics

### Types of Data

**Nominal/Categorical**: no ordering**Ordinal**: Have ranking/ordering, but not meaningful mathematically (e.g. disagree, neutral, agree)**Interval**: Meaningful differences, though can’t say if something is e.g. twice as much, no true zero point.**Ratio**: Meaningful differences and true zero point.

### Summary statistics

**Mean and Median**

**Geometric Mean**

Uses:

- In finance to calculate average growth rate
- As a filter to reduce image noise
- Matthews Correlation Coeeficient (MCC) in deep learning

**Harmonic Mean**

Uses:

- As F1 score, in deep learning, a frequently used metric for classifiers
**Note**:`recall`

and`precision`

are necessary to evaluate deep learning models

### Measures of Variation

**Mean Deviation**

**Biased Sample Variance**

**Unbiased Sample Variance**

- Using $n-1$ instead of $n$ is known as
**Bessel’s correction**. **Motivation**: The**true**population variance ($\sigma^2$) is the scatter of the population around the true population mean ($\mu$). However, we don’t know $\sigma^2$ or $\mu$, so instead, we estimate them from the dataset we have (sample). The mean of the sample is $\overline{x}$. That’s our estimate for $\mu$. It’s then natural to calculate the mean of the squared deviations around $\overline{x}$ and call that our estimate for ($\sigma^2$). That’s $s^2$. The claim is that if we apply Bessel’s correction, the estimated varicance of the sample will be closer to the true population variance.

### Missing Data

In many cases, if there’s enough of missing data that the dataset will become biased by dropping it, the safe thing to do is to replace the missing data with the mean or median. We need to consult with descriptive statistics, histogram of box plot to decide which.

### Correlation

- The assosiation between the features in a dataset
- Positive: if one goes up, the other might go up as well
- Negative: the inverse of positive
- In traditional ML, highly corerelated features were undersirable, as they didn’t add new info
- In Deep learning, where the network itself leands a new representation of data, it’s less critical to have uncorrelated features.
- In part, this is why images work well as inputs to deep networks, and not with traditional ML

- The
**Pearson correlation coefficient**returns a number $r \in [-1, +1]$, indicating the strength of linear correlation. A correlation of zero means no association, possibly independent features. We say ‘possibly’, because there might be**nonlinear**dependencies.

### Hypothesis testing

To understand if two sets of data are from the same parent distribution or not, we might look into summary statistics.

In hypothesis testing, we have to hypotheses:

- The
*null hypothesis ($H_0$)*. The two datasets are from the same parent distribution, nothing special to differentiate them. - The
*alternative hypothesis*($H_\alpha$). The two groups are not from the same distribution

Hypothesis testing **doesn’t tell us definitely whether ($H_0$) is true, it only gives us evidence in favor of accepting or rejecting it**.

#### The t-test

The t-test depends on t, the test statistic. This statistic is compared to the t-distribution and used to generate a p-value, a probability we’ll use to reach a conclusion about the null hypothesis.

The t-test is a parametric test that assumes:

- The data is independent and identically distributed (i.i.d), i.e. the data is a fair random sample
- The distribution of the data is normal

One suggestion is to use both the t-test and the Mann-Whitney U test together to decide if we accept the null hypothesis. In general, if the non parametric test is claiming evidence against the null hypothesis, then we should probably accept it. This process obviously has huge caveats and needs more careful consideration.

The t-test has different versions. Examples:

- Welch’s test, assuming that the variance of the two datasets is the same

The t-score, and an associated variable known as degrees of freedom, generate the appropriate t-distribution curve. To get a p-value we calculate the area under the curve.

The p-value tells us the probability of seeing the diference between the two means we see, or larger, if the null hypothesis is true. We typically reject the null hypothesis if p < a threshold we’ve chosen.

When we reject ($H_0$), we say that the difference is statistically significant. A usual (and problematic) theshold is $\alpha = 0.05$. It’s problematic because it’s too generous, $0.001$ coul dbe better. At $p=0.05$ all we have is a suggestion and we should repeat the experiment. If all experiments have a p-value <= 0.05, then rejecting the null hypothesis might start to make sense.

If the p-value is small it can have two meanings

- The null hypothesis is false
- A random sampling error has given us samples that fall outside what we might expect

**Confidence Intervals**

The confidence interval gives bounts within which we believe the true population difference in the means will lie. We typically report 95% confidence intervals. Any CI that includes zero signals to us that we cannot reject the null hypothesis.

The 95% confidence interval is such that if we could draw repeated samples from the distrubution that produced the two datasets, 95% of the caclulated confidence intervals would contain the true difference between the means. It is **not** the range that includes the true difference in the means at 95% certainty.

Usefulness:

- Checking if zero is a CI and make a call for $H_0$
- Its width give info on the magnitude of the effect

The CI will be narrow when the effect is large because small CIs imply a narrow range encompassing the true effect.

A p-value less that $\alpha$ will also have $CI_{\alpha}$ that doesn’t include $H_0$. They will not contradict.

**Effect size**

It’s one thing to have a stat sig p-value, and another for the difference represented to be of meaningful magnitude. A popular measure for the effect size is Cohen’s d.

## References

- Math for Deep Learning by Ronald T. Kneusel