Get Your Data On: probability

Showing posts with label probability. Show all posts

Wednesday, June 24, 2020

Dungeons and Dragons: Advantage

D20 and Random Events

In the game Dungeons and Dragons, the success or failure of an event is determined by rolling a 20 sided die (D20): higher is better. If you need to roll 11 or higher you have a 50% chance of success. If another event requires 10 or better you now have a 55% chance for success. Each point the roll goes up or down is worth 5%. Often times, rolling a 20 is a critical success and a 1 is a critical failure; critical means it's extra good/bad.

Advantage

Sometimes things are really going your way and you roll with advantage. Sometimes things are not in your favor and you roll with disadvantage. When rolling with advantage, roll two dice and pick the higher. When rolling with disadvantage, roll two dice and pick the lower. How do advantage and disadvantage affect the chance of getting a 20 or 1?

When rolling without advantage or disadvantage, the probability of getting a 20 or a 1 is 1/20 = 0.05 or 5%. The probability of not getting a twenty 1 - 1/20 = 19/20.

P20    <- 0.05
Pnot20 <- 1-P20

When you have advantage, to not get a 20, you have to not roll a 20 twice. The probability of getting a 20 is (1 - the proability of not getting a 20 twice), and as you can see below, is almost 10%. Advantage about doubles you chance of getting a 20. But you probability guessed that since you're rolling twice. :) The chance of getting a 1 is 0.05^2=0.0025, almost 0!

Pnot20*Pnot20

## [1] 0.9025

1 - Pnot20*Pnot20

## [1] 0.0975

How will advantage and disadvantage affect the average roll? This time let's estimate the answer using a simulation. A 1000 trials will give a good estimate.

Trials <-10000

Below we simulate rolling two D20s

set.seed(1)
x <- sample(1:20,Trials,replace = TRUE)
y <- sample(1:20,Trials,replace = TRUE)

When we have advantage, roll two D20 and pick the max. With disadvantage roll two D20 and pick the min.

RollsWithAdvantage <- apply(cbind(x,y), 1,max)
Advantage    <-mean(RollsWithAdvantage)
RollsWithDisadvantage <- apply(cbind(x,y), 1,min)
Disadvantage <-mean(RollsWithDisadvantage)

Below is the simulated advantage mean, the calculated mean of a regular D20 roll, and the simulated disadvantage mean. Advantage adds about 3.4 points and disadvantage subtracts 3.4 or about +/- 17%.

Advantage

## [1] 13.8863

mean(1:20)

## [1] 10.5

Disadvantage

## [1] 7.1287

Below are 3 figures with histograms. The first histogram is made with 10,000 simulated D20 advantage rolls. Advantage moves a lot of probability to the right. The second histogram is made using 10,000 simulated D20 regular rolls. The histogram is approximately flat and each number is at about 0.05% and that's what we calculated. The last histogram is made with 10,000 simulated D20 disadvantage rolls. Disadvantage moves an equal amount to the left.

The probability of getting 11 or better with advantage is 1- probability of getting 10 or less twice. Rolling with advantage moves a 50% chance to a 75% chance!

p10orLess <- 0.5
1 - p10orLess^2

## [1] 0.75

Sunday, June 2, 2019

What is a Statistic?

A statistic is a mathematical operation on a data set, performed to get information from the data.

Below is the R code that generates 20 random samples. The samples are uniformly distributed between 0 and 1. Uniformly means all the data samples are equally likely, like when you flip a coin heads and tails are equally likely.

x <- round(runif(20),2)
x

##  [1] 0.90 0.27 0.37 0.57 0.91 0.20 0.90 0.94 0.66 0.63 0.06 0.21 0.18 
## [14] 0.69 0.38 0.77 0.50 0.72 0.99 0.38

One thing we might want to know about this data set it the expected value (EV). The EV is the typical value of the data. Since we know the data is uniformly distributed between 0 and 1, the EV is just the middle of the range, 0.5. In practice, we have the data, but don't know the distribution the data came from. So, we can't calculate the EV, but we could use the average as an estimate of the EV. The R code below is the average.

sum(x)/20

## [1] 0.5615

The average of the data is about 0.56. The average is not the only way to estimate the EV, there are many ways! We could just use the first element of the data set 0.90 as the estimate. Another way is to find the maximum of the data and divide it by 2. The code below finds the max of the data and divides it by 2. Notice the max/2 produces an estimate that is closer to the true value of 0.5 than the average.

max(x)/2

## [1] 0.495

There are many way to estimate the EV, but some are better than others. Estimators of random data are also random. So, the only way to compare estimators is statistically. The case above where the max/2 produced a better estimate than the average could have been luck. The code below generates 20 data samples from the same uniform distribution above 20 times. We then find the estimate the EV in three ways: the average, the first sample, and the max(x)/2.

trials <- 100
avg     <- matrix(0,1,trials)
first   <- matrix(0,1,trials)
halfMax <- matrix(0,1,trials)
for (indx in 1:trials) {
  x <- runif(20)
  avg[indx]     <- sum(x)/20
  first[indx]   <- x[1]
  halfMax[indx] <- max(x)/2
}

The figure below contains 100 averages plotted with circles and true EV plotted in a line at 0.5. Most of the averages are between 0.4 and 0.6 and many are closer. The average seems to be a good estimate of the EV. Which shouldn't be surprising since the average is the most common statistic.

This figure is the estimate based only on the first element of the data set. The estimates are between just over 0 and almost 1. A few estimates are close to 0.5, but most are not. This is not a good estimator for EV. This is probably not surprising since this estimate is based on only one sample and the average is based on 20.

This figure is the plot of the estimate based on the max/2. Most of the estimates are very close to the true value. It looks even better than the average! This may seem surprising, since it looks like it is based on only one sample, the maximum sample, but it is actually makes use of all the samples. This is because you need to look at all the samples to know which one is the maximum.

The average is a good choice (and often the best choice) for estimating the EV, but it is not always the best estimator for EV. In the case of this particular distribution, the max/2 is better estimator of EV than the average.

So, we now know what a statistic is and we worked with a few. As a field of study, statistics uses visualization, probability and other math tools to find the best ways to get information from data.

Friday, May 24, 2019

Random Autocorrelation Sequences R version

What is an autocorrelation sequence?

Autocorrelation sequences (ACSs) are super common when doing anything in probability and statistics. Autocorrelation is a sequence of measurements of how similar a sequence is to it self. In math the autocorrelation sequence r[k] is

r[k] = E[x[n]x[n+k]] for k={0,1,...N-1},

where N is the number of data samples, E is the expected value, x[n] is a data sample and k is the lag. The lag is the separation in samples.

Why make a random autocorrelation sequence?

When testing an algorithm or conducting simulations it is often useful to use a random ACS. Generating random a random ACS can be difficult because they have a lot of special properties and if you select a sequence at random, the chance it is a valid ACS is small.

Trick to making a random autocorrelation sequence

We can use the following property of ACSs to make generating random ACSs easy. The ACS and the power spectral density (PSD) are Fourier transform (FT) pairs. For our purpose here, a PSD is just a function that is positive everywhere. "FT pair" means the FT of an ACS is a PSD and the inverse FT of a PSD is an ACS.

So we can generate a random ACS using the following steps. First, generate a random sequence. Second, square each element, so the sequence is positive. Finally, find the inverse FT of the squared sequence.

The R code that produces a random ACS

The ACS could be any size, but in this case we want a 9 element sequence.

N <- 9
PSD <- rnorm(N)^2
ACS <- fft(PSD,inverse = TRUE)

The line below outputs the ACS and as you can see it is a complex sequence.

ACS

[1] 0.6183715+0.0000000i -0.1375219+0.1960568i -0.1672163-0.2084656i 0.2199730-0.0977208i
[5] -0.0281983+0.2615475i -0.0281983-0.2615475i 0.2199730+0.0977208i -0.1672163+0.2084656i
[9] -0.1375219-0.1960568i

What if I want a real ACS

If you want a real ACS then the PSD has to be even. So, let's make the sequence even!

PSDeven <- c(PSD,PSD[N:2])
PSDeven

[1] 0.39244438 0.03372487 0.69827518 2.54492084 0.10857537 0.67316837 0.23758708 0.54512337
[9] 0.33152416 0.33152416 0.54512337 0.23758708 0.67316837 0.10857537 2.54492084 0.69827518
[17] 0.03372487

Notice the ACS is still complex. Numerical error causes some imaginary dust we need to clean up.

ACS <- fft(PSDeven,inverse = TRUE)/N
ACS

[1] 1.1931381+0i 0.1714080+0i -0.3200109+0i -0.5007558+0i -0.2372697+0i 0.2647028+0i
[7] 0.4700409+0i 0.3009945+0i -0.3750372+0i -0.3750372+0i 0.3009945+0i 0.4700409+0i
[13] 0.2647028+0i -0.2372697+0i -0.5007558+0i -0.3200109+0i 0.1714080+0i

Clean up the small imaginary part with Re() and now we are ready to plot.

ACS <- Re(ACS)

Plot of ACS

The figure below is a plot of the ACS from lag k = 0 to 16. In textbooks the ACS would have been plotted from k=-8 to 8, with r[0] in the center.

This is the plot of the ACS in the textbook style. Notice, the lag at 0 r[0] is positive and larger than the other lags, a standard property of ACSs. All is well!

Sunday, May 19, 2019

Predicting the future is hard. There are three mistakes that are very easy to make when trying to make predictions. First, is overly aggressive rounding. Second, is considering too few hypotheses. Finally, mistaking developing a hypothesis for testing a hypothesis.

Overly Aggressive Rounding

People are comfortable when an event is not going to happen 0% or going to happen 100%. People understand, but are not happy with, an event that is 50%. When people don't
think about it too much the default is to round to 0,100, or 50%. As an example, every year, before the state of the football season, the local sports talk radio tries to predict the Patriots final season record. It goes something like this. Le's say they are going to play 10 "bad" teams and 6 "good" teams. They assign a win for each bad team and they figure 3 and 3 for the good teams, for a final record of 13 and 3. A very good record. The bad teams sometimes win and everyone is shocked! In reality the bad teams have a better than 0% chase to win. Let's say it's 20% So a better estimate of the record would be 10*80% + 6*50% = 8+3=11 wins and 5 losses. In the case of games, who cares**?But this same faulty logic is applied in other more important real world cases.

*There are more than that, but these are the ones I see in the wild most often.
** People who bet money.

1 of 3