Sunday, June 2, 2019

What is a Statistic?

What is a Statistic?

A statistic is a mathematical operation on a data set, performed to get information from the data.

Below is the R code that generates 20 random samples. The samples are uniformly distributed between 0 and 1. Uniformly means all the data samples are equally likely, like when you flip a coin heads and tails are equally likely.

x <- round(runif(20),2)
x
##  [1] 0.90 0.27 0.37 0.57 0.91 0.20 0.90 0.94 0.66 0.63 0.06 0.21 0.18 
## [14] 0.69 0.38 0.77 0.50 0.72 0.99 0.38

One thing we might want to know about this data set it the expected value (EV). The EV is the typical value of the data. Since we know the data is uniformly distributed between 0 and 1, the EV is just the middle of the range, 0.5. In practice, we have the data, but don't know the distribution the data came from. So, we can't calculate the EV, but we could use the average as an estimate of the EV. The R code below is the average.

sum(x)/20
## [1] 0.5615

The average of the data is about 0.56. The average is not the only way to estimate the EV, there are many ways! We could just use the first element of the data set 0.90 as the estimate. Another way is to find the maximum of the data and divide it by 2. The code below finds the max of the data and divides it by 2. Notice the max/2 produces an estimate that is closer to the true value of 0.5 than the average.

max(x)/2
## [1] 0.495

There are many way to estimate the EV, but some are better than others. Estimators of random data are also random. So, the only way to compare estimators is statistically. The case above where the max/2 produced a better estimate than the average could have been luck. The code below generates 20 data samples from the same uniform distribution above 20 times. We then find the estimate the EV in three ways: the average, the first sample, and the max(x)/2.

trials <- 100
avg     <- matrix(0,1,trials)
first   <- matrix(0,1,trials)
halfMax <- matrix(0,1,trials)
for (indx in 1:trials) {
  x <- runif(20)
  avg[indx]     <- sum(x)/20
  first[indx]   <- x[1]
  halfMax[indx] <- max(x)/2
}

The figure below contains 100 averages plotted with circles and true EV plotted in a line at 0.5. Most of the averages are between 0.4 and 0.6 and many are closer. The average seems to be a good estimate of the EV. Which shouldn't be surprising since the average is the most common statistic.

This figure is the estimate based only on the first element of the data set. The estimates are between just over 0 and almost 1. A few estimates are close to 0.5, but most are not. This is not a good estimator for EV. This is probably not surprising since this estimate is based on only one sample and the average is based on 20.

This figure is the plot of the estimate based on the max/2. Most of the estimates are very close to the true value. It looks even better than the average! This may seem surprising, since it looks like it is based on only one sample, the maximum sample, but it is actually makes use of all the samples. This is because you need to look at all the samples to know which one is the maximum.

The average is a good choice (and often the best choice) for estimating the EV, but it is not always the best estimator for EV. In the case of this particular distribution, the max/2 is better estimator of EV than the average.

So, we now know what a statistic is and we worked with a few. As a field of study, statistics uses visualization, probability and other math tools to find the best ways to get information from data.

No comments:

Post a Comment

November Fog