Get Your Data On: statistics

Showing posts with label statistics. Show all posts

Sunday, June 8, 2025

Interpolation in the Frequency Domain Improved!

Introduction

This post describes interpolation in the frequency domain (IFD) and an improvement to IFD. IFD is a method of interpolation that is easy to use and produces good results on most time series data (data samples equally spaced in time). It is easy to use because it does not require any prior knowledge of the data. Also, since IFD uses the Fast Fourier Transform (FFT), it is fast (even on large data sets). The drawback to IFD is that it sometimes produces an interpolation that is not good at the start and end of the sequence (referred to as “end effects”). The improved IFD reduces the end effects. While end effects are not always an issue, it is always better to not have them.

The first section briefly introduces interpolation. The next section defines the standard IFD algorithm and shows examples of its use. The final section shows an improved IFD that reduces end effects.

Interpolation Example

Let’s say we check a rain gauge at 7:00 and then once an hour for five hours, resulting in the following measurements: 0.01 cm, 0.19 cm, 0.31 cm, 0.40 cm, 0.52 cm, and 0.59 cm. We didn’t measure the rain at 7:30, but we could estimate the rainfall at 7:30 using a linear spline to interpolate (connecting the points with straight lines).

The figure below shows the five rainfall values from above and an illustration of interpolating a rainfall value at 730. The figure shows the five measurements as black circles, the interpolating blue line connecting the points 700 and 800, and the interpolated point at 0730 shown as a red circle. The red circle shows that the rainfall at 730 is 0.1 cm.

Interpolation in the Frequency Domain

Linear spline interpolation is effective if the underlying data can be modeled by a straight line. IFD works well regardless of the type of data.

The steps in the algorithm for IFD are enumerated below.

Find the discrete Fourier transform (DFT) of the data using the built-in function fft, which is R’s implementation of the FFT. The data started in the time domain (data indexed by time). After the transform, the data is in the frequency domain (data indexed by frequency).
Add zeros to the data. This is called zero padding, and it forces the inverse DFT (IDFT) in the next step to evaluate the data at more points. When adding zeros to the frequency domain data, the zeros must go in the middle. This step is slightly different for even and odd length sequences.
Perform the IDFT on the zero padded DFT sequence.

These steps are explained below.

Find the DFT of the Sequence

To illustrate this step, we use the rain fall values from above.

First, we use the fft function to find the DFT of the sequence. Below, I’m going to stop saying, “Use the fft function to find the DFT.” I’ll just say find the FFT. Note: to take up less space on the screen, I’ll round the output to two decimal places.

N <- length(x)
X <- fft(x)
round(X,digits = 2)

## [1]  1.43+0.00i -0.34+0.37i -0.34+0.11i -0.34-0.11i -0.34-0.37i

Zero Pad the Data

In this case, we want to double the number of points, so we add five zeros to X. The zeros have to be added in the “middle” of the sequence. If you are familiar with Fourier transform theory, you’ll note that this is between the positive and negative frequencies.

Xzp <- c(X[1],X[2:3],rep(0,5),X[4:5])
round(Xzp,digits = 2)

##  [1]  1.43+0.00i -0.34+0.37i -0.34+0.11i  0.00+0.00i  0.00+0.00i  0.00+0.00i
##  [7]  0.00+0.00i  0.00+0.00i -0.34-0.11i -0.34-0.37i

Note if the DFT samples X had an even number of samples, for instance 6, then zero padded DFT Xzp would be c(X[1], X[2:3], X[4]/2, rep(0, N), X[4]/2, X[5:6]). The X[N/2+1] sample (in this case, X[4]) is used twice but cut in half to keep the energy the same. I show an example of this in the IFD Section.

Inverse DFT

Now, we find the inverse FFT. The inverse FFT is scaled by one over the number of samples. Notice we are dividing by the number of actual data points (five) and not the total number of points in the padded sequence (ten). If we divide by ten, we bias the result low because the zeros don’t add any energy to the sequence.

xinterp <- fft(Xzp,inverse = TRUE)/N
round(xinterp,digits=2)

##  [1] 0.01+0i 0.00+0i 0.19+0i 0.33+0i 0.31+0i 0.29+0i 0.40+0i 0.55+0i 0.52+0i
## [10] 0.26+0i

Since we started with a real sequence, we should end with a real sequence. However, the fft output is complex, so we convert the sequence to real using the Re function.

xinterp <- Re(xinterp)
round(xinterp,digits = 2)

##  [1] 0.01 0.00 0.19 0.33 0.31 0.29 0.40 0.55 0.52 0.26

The figure below contains two plots. The interpolated sequence in blue x’s and the original sequence in black circles (the circles are hard to see because they have x’s on them). Note the interpolation (especially the points at 7:30 and 12:30) is not great. This method suffers from end effects and performs poorly with short sequences. We show a method that mitigates the end effects in the IFD Function Improved section.

Example IFD with a Longer Sequence

Now, let’s try IFD with a longer sequence. Below is a 49 sample ramp.

N <- 49
x <-1:N

Since we now know the algorithm, I’ll go through the steps with less explanation. First, find the FFT.

X <-fft(x)

Second, zero pad by 2.

Xzp <- c(X[1:ceiling(N/2)],rep(0,N),X[(ceiling(N/2)+1):N])

Finally, the IDFT.

xInterp <- Re(fft(Xzp,inverse = TRUE)/N)

The figure below contains two plots. The interpolated sequence is shown as a blue line, and the ramp sequence is shown as black circles. You can see the end effects at the start and end of the plot. After the end effects, the interpolated points are close to the ramp.

The figure below contains a zoomed-in view of the one above. A blue line shows the interpolated sequence, and black circles show the original ramp samples. You can see that after the effects at the beginning, the interpolation is very good.

IFD Function

Here, we put the IFD steps into the function findIfd, which inputs a data sequence dataSet and the interpolation factor M. To keep the function simple, M is restricted to an integer value. The function first checks to see if the interpolation factor is an integer, not less than 1. Next, it finds the FFT of dataSet and sets N to the length of dataSet. The function determines if dataSet has an even or odd number of samples and then places the zeros in the middle. After zero padding, it finds the IFFT of the zero-padded sequence. Finally, it returns the interpolated sequence, converting the sequence to real if the input dataSet was real. Note the R function is.numeric returns true if the sequence is real.

findIfd <- function(dataSet,M=2) {
  
  if (round(M)!=M || M<1) {
    stop("M must be an integer larger than zero.")
  }
  
  X <- fft(dataSet)
  N <- length(X)
  
  if(floor(N/2)==N/2){
    Xzp <- c(X[1:(N/2)],X[N/2+1]/2,rep(0,(M-1)*N-1),X[N/2+1]/2,X[(N/2+2):N])
    
  } else {
    Xzp <- c(X[1:ceiling(N/2)],rep(0,((M-1)*N)),X[(ceiling(N/2)+1):N])
  }
  xint <- fft(Xzp,inverse = TRUE)/N
  
  if (is.numeric(dataSet))
      Re(xint[1:(M*N)])
  else
      xint[1:(M*N)]
}

Example: IFD Function and the Ramp

Below, we use the findIfd to interpolate the ramp. You can see we started with a 50 sample ramp and used an interpolation factor of 8, so the length of the interpolated sequence is 400.

N <- 50
x <-1:N
xInterp <- findIfd(x,M=8)

The plot below shows the interpolated sequence in blue (800 points make the plot hard to read), and the ramp sequence is shown in black points. As shown above, we can see that the interpolation is good away from the middle.

Example: Interpolating a Sine Wave

IFD works better when the function is periodic and smooth. In this example, we interpolate a sine sequence, and since it is periodic and smooth, we should get a better result.

N <- 50
n <- 0:(N-1)
f <- 0.02
x <- sin(2*pi*f*n)
M <- 4
xInterp <- findIfd(x,M)
Ndx <- seq(0,N-1/M,1/M)

The figure below contains two plots. The black circles are the original sine sequence, and the blue line shows the interpolation. You can see this is a very good interpolation with no end effects.

IFD Function Improved

The improved version of the IFD function “mirrors” the sequence before performing the interpolation. By mirror, I mean appending a reversed version of the sequence to the end of the original sequence. See below for an example of mirroring.

x <- 1:5
c(x,rev(x))

##  [1] 1 2 3 4 5 5 4 3 2 1

The function findIfdImproved inputs a sequence x and the interpolation factor M and then outputs an interpolated sequence. The difference between this function and findIfd is this function mirrors the sequence before implementing the IFD algorithm. The extra samples at the end of the interpolated sequence (caused by the mirroring) are removed before returning the interpolated sequence. Note mirroring doubles the sequence, making it even. So, we only need to use the zero padding method for even length sequences. Note that findIfdImproved doubles the sequence but does not take twice as long due to the proprieties of the FFT.

findIfdImproved <- function(x,M) {
  
  if (round(M)!=M || M<1 ) {
    stop("M must be an integer larger than zero.")
  }
  
  X <- fft(c(x,rev(x)))
  N <- length(X)
  
  Xzp <- c(X[1:(N/2)],X[N/2+1]/2,rep(0,(M-1)*N-1),X[N/2+1]/2,X[(N/2+2):N])

  xint <- fft(Xzp,inverse = TRUE)/N
  
  if (is.numeric(x)) {
    xint <- Re(xint[1:(M*length(x))]) }
  else{ 
    xint <- xint[1:(M*length(x))] }
}

Below, we use the findIfdImproved to interpolate the 50 point ramp, with an interpolation factor of 8.

N <- 50
x <-1:N
xInterp <- findIfdImproved(x,M=8)

The plot below shows the interpolated sequence as a blue line and the ramp sequence as black points. We see the findIfdImproved produces an improved interpolation with almost no end effects (you can see slight end effects at the end of the sequence).

Conclution

IFD is an easy-to-use method of interpolation that produces good results, except for the end effects. The improved method significantly reduces the end effects, at the cost of additional processing.

Sunday, July 16, 2023

Autoregressive Estimation and the AIC

1 Model Order Selection

This section provides a brief introduction to autoregressive (AR) model order estimation. An AR model of order p (AR[p]) can model up to p peaks in a PSD. So, if we wanted to estimate a PSD with 4 peaks we would use an AR[4] model. Unfortunately, we don’t always know the number of peaks. Note, however we choose the model order we don’t want to over fit the model. Meaning we want to choose the minimum model order that can model the important features. Over fitting the model requires more parameters and adds bias and variance to our estimate.

Model order estimators are mathematical tools that let you estimate the order of the process that produced the data. The Akaike Information Criterion (AIC) is a model order estimator built into the arima function. The AIC calculates how well model fits the data (lower is better).

The arima function inputs the model order and the data. The function produces a number of outputs, but we only use the AIC. Using the arima function with $aic outputs the AIC associated with the input model order. Below is an example of using the arima function to find the model order.

2 Example

In this example we use data from an AR model and then use the AIC to estimate the order of the AR model that produced the data. Note the example is implemented in the R programing language. First, we create data using an AR[1] model.

set.seed(1) 
N <- 512
y_ar1 <- rep(0,N)
x <- rnorm(N)
a <- 0.9 

for (n in 2:N) 
  y_ar1[n] <- -a*y_ar1[n-1] + x[n]

Next, we use the data to estimate the model order. We calculate the AIC for AR models AR[1] to AR[10]. The model with the lowest AIC is the model that fits the data. If two models have similar AICs we should use the lower order model.

aicAR <- rep(0,10)
for (modelOrder in 1:10) {
  aicAR[modelOrder] <- arima(y_ar1, order = c(AR=modelOrder, 0, MA=0))$aic
}

Figure 1 show the AIC value for AR model orders from 1 to 10.

Figure 1: This figure contains a plot of the AIC for model orders 1 to 10. The lowest AIC value is from the AR[1] model, which is the true model order. The AIC is not perfect and sometimes will not get the exact right answer.

Note a similar method could be use to find the model order of moving average (MA) and autoregressive moving avarage (ARMA) models.

Wednesday, June 24, 2020

Dungeons and Dragons: Advantage

D20 and Random Events

In the game Dungeons and Dragons, the success or failure of an event is determined by rolling a 20 sided die (D20): higher is better. If you need to roll 11 or higher you have a 50% chance of success. If another event requires 10 or better you now have a 55% chance for success. Each point the roll goes up or down is worth 5%. Often times, rolling a 20 is a critical success and a 1 is a critical failure; critical means it's extra good/bad.

Advantage

Sometimes things are really going your way and you roll with advantage. Sometimes things are not in your favor and you roll with disadvantage. When rolling with advantage, roll two dice and pick the higher. When rolling with disadvantage, roll two dice and pick the lower. How do advantage and disadvantage affect the chance of getting a 20 or 1?

When rolling without advantage or disadvantage, the probability of getting a 20 or a 1 is 1/20 = 0.05 or 5%. The probability of not getting a twenty 1 - 1/20 = 19/20.

P20    <- 0.05
Pnot20 <- 1-P20

When you have advantage, to not get a 20, you have to not roll a 20 twice. The probability of getting a 20 is (1 - the proability of not getting a 20 twice), and as you can see below, is almost 10%. Advantage about doubles you chance of getting a 20. But you probability guessed that since you're rolling twice. :) The chance of getting a 1 is 0.05^2=0.0025, almost 0!

Pnot20*Pnot20

## [1] 0.9025

1 - Pnot20*Pnot20

## [1] 0.0975

How will advantage and disadvantage affect the average roll? This time let's estimate the answer using a simulation. A 1000 trials will give a good estimate.

Trials <-10000

Below we simulate rolling two D20s

set.seed(1)
x <- sample(1:20,Trials,replace = TRUE)
y <- sample(1:20,Trials,replace = TRUE)

When we have advantage, roll two D20 and pick the max. With disadvantage roll two D20 and pick the min.

RollsWithAdvantage <- apply(cbind(x,y), 1,max)
Advantage    <-mean(RollsWithAdvantage)
RollsWithDisadvantage <- apply(cbind(x,y), 1,min)
Disadvantage <-mean(RollsWithDisadvantage)

Below is the simulated advantage mean, the calculated mean of a regular D20 roll, and the simulated disadvantage mean. Advantage adds about 3.4 points and disadvantage subtracts 3.4 or about +/- 17%.

Advantage

## [1] 13.8863

mean(1:20)

## [1] 10.5

Disadvantage

## [1] 7.1287

Below are 3 figures with histograms. The first histogram is made with 10,000 simulated D20 advantage rolls. Advantage moves a lot of probability to the right. The second histogram is made using 10,000 simulated D20 regular rolls. The histogram is approximately flat and each number is at about 0.05% and that's what we calculated. The last histogram is made with 10,000 simulated D20 disadvantage rolls. Disadvantage moves an equal amount to the left.

The probability of getting 11 or better with advantage is 1- probability of getting 10 or less twice. Rolling with advantage moves a 50% chance to a 75% chance!

p10orLess <- 0.5
1 - p10orLess^2

## [1] 0.75

Saturday, June 20, 2020

Complex Normal Samples In R

Normal Samples

If we want 10 samples from a Gaussian or normal random process with variance 4 can use rnorm(10,sd=2). Remember the standard deviation (sd) is the square root of the variance.

set.seed(2)
x <- rnorm(10,sd=2)
x

##  [1] -1.7938291  0.3696984  3.1756907 -2.2607513 -0.1605035  0.2648406
##  [7]  1.4159095 -0.4793960  3.9689479 -0.2775740

var(x)

## [1] 3.880803

The var() function produces an estimate of the variance, if we want a better estimate we need more samples.

var(rnorm(1000,sd=2))

## [1] 4.105966

Complex Normal Samples

If we are using base R and want complex normal (CN) samples, we need to write our own function. When the signal processing literature refers to CN they are usually referring to circularly-symmetric CN. Circularly-symmetric means the samples are independent and their mean is 0.

The function produces N CN samples with variance v. The real and imaginary parts are independent, because they are produced by different calls to rnorm(). Let x,y be independent. The var(ax) = a^2var(x) and var(x+y)=var(x)+var(y). So, if we want a variance of 1 would have to start a variance of sqrt(1/2).

makeCN <- function(N,v=1) {(sqrt(v/2))*rnorm(N) + (sqrt(v/2))*1i*rnorm(N)}
makeCN(10)

##  [1]  0.0023376-0.2079938i  0.7613032+0.6053620i  0.3946671-0.4049715i
##  [4]  0.4892950-0.1207824i  0.4651165-0.2871364i -0.2312504+0.9408834i
##  [7] -0.2153405-0.9648887i -1.0994866+1.0119199i  1.0396552+0.7824796i
## [10]  0.1147878+0.9059002i

If we want to check the variance, we can't use var() directly.

var(makeCN(10))

## Warning in var(makeCN(10)): imaginary parts discarded in coercion

## [1] 0.6039204

But the real and imaginary parts are independent, so we can calculate the variance separately.

z <- makeCN(10)
var(Re(z)) + var(Im(z))

## [1] 0.5623849

To make this easier, we can create a function to find the variance.

varComplex <- function(z) var(Re(z)) + var(Im(z))

To get a good estimate we need a-lot of samples.

varComplex(makeCN(1000))

## [1] 1.016615

Let's set the variance to 2 and then estimate the variance of the samples.

varComplex(makeCN(1000,v=2))

## [1] 1.925119

Success!

Tuesday, May 12, 2020

Book Review: Introductory Time Series with R

I'm a big fan of R and time series analysis, so I was excited to read the book "Introductory Time Series with R. I've been using the book for about 9 years, so I thought it was about time for a review! In this review, I'm going to cover the following topics: the amount of R content, the subject content, who is the book for, and my overall recommendation.

R Content
The R content is high. All the ideas in the book are heavily illustrated with R code. At the beginning of the book, the authors point out that they use Sweave to embed the code and plots. They also make use of online data sets, so you can type in examples from the book and reproduce the calculations and figures.

Time Series Content
The book covers all the time series topics you'd want in an introduction, plus a few specialty topics like multivariate models. Each chapter is a solid introduction to a topic in time series analysis.

Who is the Book For
1) People who what to learn time series analysis. It covers the theory, application, and has plenty of opportunities for hands-on learning. 2) People who want to learn to do time series analysis using R. That's why I bought the book! I'm familiar with time series analysis and I bought the book to learn how to do time series stuff with R. The book definitely delivered on that account. 3) People who want to teach a course on time series analysis. The book has plenty of examples and exercises (not sure if there is a solution manual).

Overall Recommendation
The book is well written. The R code is clear and well presented. The figures are numerous and informative. The book is wonderful and I highly recommend it!

https://www.amazon.com/gp/product/0387886974/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1

Wednesday, June 26, 2019

Blackman-Tukey Spectral Estimator in R

Blackman-Tukey Spectral Estimator in R!

There are two definitions of the power spectral density (PSD). Both definitions are mathematically nearly identical and define a function that describes the distribution of power over the frequency components in our data set. The periodogram PSD estimator is based on the first definition of the PSD (see periodogram post). The Blackman-Tukey spectral estimator (BTSE) is based on the second definition. The second definition says, find the PSD by calculating the Fourier transform of the autocorrelation sequence (ACS). In R this definition is written as

PSD <- function(rxx) {
  fft(rxx)
}

where fft is the R implementation of the fast Fourier transform, rxx is the autocorrelation sequence (ACS), the k'th element of the ACS rxx[k] = E[x[0]x[k]], k -infinity to +infinity, and E is the expectation operator. The xx in rxx[k] is a reminder r is a correlation between x and itself. The rxx[k]s are sometimes called lags. The ACS has the propriety that rxx[-k]=rxx[k]*, where * is the complex conjugate. In the post, we will only use real numbers and I'll drop the * from here forward.

So, to find the PSD we just calculate rxx and take its fft! Unfortunately, in practice, we cannot do this. Calculating the expected value requires the probability density function (PDF) of x, which we don't know and we need an infinite amount of data, which we don't have. So, we can't calculate the PSD: we're doomed!

No, we are not doomed. We can't calculate the PSD, but we estimate it! We can derive an estimator for the PSD from the definition of the PSD. First, we replace rxx with an estimate of rxx. We replace the expected value, which gives the exact rxx, with an average, which gives us an estimate of rxx. The E[x[0]x[k]] is replaced with (1/N)(x[1]x[1+k]+x[2]x[2+k]+...+x[N-1-k]x[N-1]+x[N-k]x[N]), where N is the number of data samples. For example; if k=0, then rxx[k]=(1/N)*sum(x*x). In R code the estimate is written as

lagEstimate <-function(x,k,N=length(x)){
   (1/N)*sum(x[1:(N-k)]*x[(k+1):N])
}

If we had an infinite amount of data, N=infinity, we could use lagEstimate to estimate the entire infinite ACS. Unfortunately we don't have an infinite amount of data, even if we did it wouldn't fit into a computer. So, we can only estimate a finite amount of ASC elements. The function below calculates lags 0 to kMax.

Lags <-function(x,kMax) {
  sapply(0:kMax,lagEstimate,x=x)
}

Before we can try these functions out we need data. In this case the data came from a random process with the PSD plotted in the figure below. The x-axis is normalized frequency(frequency divided by the sampling rate). So, if the sampling rate was 1000 Hz, you could multiply the normalized frequency by 1000 Hz and then the frequency axis would read 0 Hz to 1000 Hz. The y-axis in in dB (10log10(amplitude)). You can see six large sharp peaks in the plot and a gradual dip towards 0 Hz and then back up. Some of the peaks are close together and will be hard to resolve.

The data produced by the random process is plotted below. This is the data we will use through this post. The data is posted here: x.RData

Let's calculate the the ACS up to the 5th lag using the data.

Lags(x,kMax=5)

## [1]  6.095786 -1.368336  3.341608  1.738122 -1.737459  3.651765

A kMax of 5 gives us 6 lags: {r[0], r[1], r[2], r[3], r[4], r[5]}. These 6 lags are not an ACS, but are part of an ACS.

We used Lags to estimate the positive lags up to kMax, but the ACS is even sequence, r[-k]=r[k] for all k. So, let's write a function to make a sequence consisting of lags from r[-kMax] to r[kMax]. This is a windowed ACS, values outside of the +/- kMax are replaced with 0. Where it won't cause confusion, I'll refer to the windowed ACS, as the ACS.

acsWindowed <- function(x,kMax,Nzero=0){
  rHalf <- c(Lags(x,kMax),rep(0,Nzero))
  c(rev(rHalf[2:length(rHalf)]),rHalf)
}

Let's try this function out.

acsW <- acsWindowed(x,9)

In the figure below you can see the r[0] lag, the maximum, is plotted in the middle of the plot.

The ACS in the figure above is how the ACS is usually plotted in textbooks. In textbooks the sum in the Fourier transform ranges from -N/2 to (N-1)/2. So, the r[0] lag should be in the center of the plot. In R the sum in the Fourier transform ranges from 1 to N, so the 0'th lag has to be first. We could just make the sequence in R form, but it is often handy to start in textbook from and switch to R form. We can write a function to make switching from textbook to R easy.

Textbook2R <- function(x,N=length(x),foldN=ceiling(N/2)) {
  c(x[foldN:N],x[1:(foldN-1)])
}

Notice in the figure below the maximum lag r[0], is plotted at the beginning.

Let's imagine we have an infinite amount of data and used it to estimated and infinite number of ACS lags. Let's call that sequence rAll. We make a windowed ACS by setting rW=rAll*W, where W=1 for our 9 lags and 0 everywhere else. W is called the rectangular window, because, as you can see in the plot below, it's plot looks like a rectangle. By default when we estimate a finite number of lags we are using a rectangular window.

W <- c(rep(0,9),rep(1,9),rep(0,9))

The reason we can not use a rectangular window is its Fourier Transform is not always positive. As you can see in the plot below there are several values below zero, indicated with dotted line. Re() functions removes some small imaginary numbers due to numerical error, some imaginary dust we have to sweep up.

FFT_W <- Re(fft(Textbook2R(W)))

Even though the fft of the ACS rAll is positive , the produce rAll and a rectangular window might not be positive! The Bartlett window is a simple window whos fft is positive.

BartlettWindow <- function(N,n=seq(0, (N-1)))  {
  1 - abs( (n-(N-1)/2) / ( (N-1)/2) )
}
Wb <- BartlettWindow(19)

As you can see in the plot below the Fourier transform of the Bartlett window is positive.

WbFft <- Re(fft(Textbook2R(Wb)))

Calculating the BTSE with R

Now that we can estimate the ACS and window our estimate, we are ready to estimate the PSD of our data. The BTSE is written as

Btse <- function(rHat,Wb) {
  Re(fft(rHat*Wb))
}

Note the Re() is correcting for numerical error.

In the first example we use a 19 point ACS lag sequence.

rHat   <- Textbook2R(acsWindowed(x,kMax=9))
Wb     <- Textbook2R(BartlettWindow(length(rHat)))
Pbtse9 <- Btse(rHat,Wb)

In the figure below is the BTSE calculated with a maximum lag of 9. The dotted lines indicate the locations of the peaks in the PSD we are trying to estimate. The estimate with a maximum lage of only 9 produces a poor estimate.

We calculate a new estimate with a maximum lag of 18.

rHat    <- Textbook2R(acsWindowed(x,kMax=18))
Wb      <- Textbook2R(BartlettWindow(length(rHat)))
Pbtse18 <- Btse(rHat,Wb)

The next estimate is made with a maximum lag of 18. This estimate is better, the peaks around 0.4 and 0.6 are not resolved. We still need to increase the maximum lag.

Finally we increase the maximum lag to 65 and recalculate the estimate.

rHat    <- Textbook2R(acsWindowed(x,kMax=65))
Wb      <- Textbook2R(BartlettWindow(length(rHat)))
Pbtse65 <- Btse(rHat,Wb)

This finial estimate is very good. All six peaks are resolved and the location of our estimated peaks are very close to the true peak locations locations.

Final Thoughts

Could we use 500 lags in the BTSE? In this case we could, since we have a lot of data, but the higher lags get estimated with less data and therefore have more variance. Using the high variance lags will produce a higher variance estimate.

Are there other ways to improve the BTSE other than using more lags? Yes! There are a few other ways. For instance, we could zero pad the lags. Basically add zeros to the end of our lag sequence. This will make the fft, in the BTSE estimator, evaluate the estimate at more frequencies and we will be able to see more details in the estimated PSD.

Also keep in mind there are other PSD estimation methods that do better on other PSD features. For instance, if you we more interested finding deep nulls rather than peaks, a moving average PSD estimation would be better.

Thursday, June 13, 2019

Periodogram with R

Periodogram with R
The power spectral density (PSD) is a function that describes the distribution of power over the frequency components composing our data set. If we knew the process that generated the data, we could just calculate the PSD; we would not have to estimate it. Unfortunately, in practice we won't have access to the random process, only the samples (data) produced by the process. So, we can't get the true PSD, we can only get an estimate of the true PSD. In our example below, we made the data, so we know what the true PSD should look like.
The PSD is written as

where -1/2 <= f < 1/2 x is the data, e is the complex exponential, the sum of the data x multiplied by e is the Fourier transform. In English, calculate the PSD by finding the expected value of the magnitude squared of the Fourier transform as the amount of data increases to infinity. Well, that's a mouthful! The frequency f is expressed in normalized frequency. Normalized frequency is the frequency divided by the sampling rate. Normalized frequency is often used for simplicity. To convert to frequency in Hz multiply by the sampling rate.
The periodogram estimator is based on the definition in equation above. The definition has a limit and an expected value. To use the definition directly we would need an infinite amount of data, which we don't have, and we would need to know the probability density function (PDF) of the data which we don't know. To use this definition to get an estimator of the PSD, we must drop the limit and we must drop the expectation operator ($E$). We also can only evaluate a finite number of frequencies. After dropping the limit and the expectation we are left with

where 0 <= f < 1 and N is the number of data samples. Notice, the range of f changes from (-1/2,1/2) in the first equation to (0,1) in the second equation. Textbooks usually use the former and R uses the latter.
The periodogram is very easy to implement in R, but before we do we need to simulate some data. The code below first uses the set.seed command so R will produce the same "random" numbers each time. Then it creates a 32 normally distributed numbers and 32 points of a sine wave with a normalized frequency of 0.4 and a amplitude of 2. The signal is made up of a sine wave and the random points added together.
The figure below is a plot of the data generated above. To me it looks random and it is not obvious there is a sine wave in there.

set.seed(0)
N <-32
n<- 0:(N-1)
w <- rnorm(1:N)
f1 <- 0.4
A1 <- 2
s <- A1*sin(2*pi*f1*n)
x <- s + w

Now we are going to implement the periodogram in R. It is pretty much a direct translation from the math. We start with the 1/N and multiply that by the absolute value of the square of the Fourier transform of the data. Writing that in English was more work than implementing the periodogram in R!

xPer <- (1/N)*abs(fft(x)^2)
f    <- seq(0,1.0-1/N,by=1/N)

The figure below is a plot of the periodogram of the data. The dotted line marks the location of the frequency f1, the frequency of the sine wave of the data. Now the sine wave component really stands out!. Notice two things. First, the peaks of the periodogram seems to be a bit off the true values. Second, the plot looks jagged. Both of these things are caused by the same thing: the periodogram is only evaluated at 32 frequency bins. The frequencies we evaluate the periodogram on are called bins. We can fix both problems by evaluating the periodogram at more bins. One way to evaluate the periodogram at more points is to get more data! That would certainly fix both problems. Unfortunately, we often are struck with the data we have. There is another way.

Since f is a continuous variable we can evaluate the periodogram at as many points as we want. The way we do this is to pretend we have more data by sticking zeros at the end of our data, we zero pad it. The R code below adds 968 zeros to the end of x zero-padding it to a total of 1000 "data" points. As you can see in the figure below, we have fixed both problems. In the 32 point periodogram missed the peak, because the periodogram was not evaluated at enough points. The zero-padding does not guarantee we hit the peak, but it will get closer. Also, we fixed the jaggedness by evaluating the periodogram at many more bins.

xzp <- c(x,rep(0,1000-N))
Nzp <- length(xzp)

xPerZp <- (1/N)*abs(fft(xzp)^2)
fzp    <- seq(0,1.0-1/Nzp,by=1/Nzp)

Sunday, June 2, 2019

What is a Statistic?

A statistic is a mathematical operation on a data set, performed to get information from the data.

Below is the R code that generates 20 random samples. The samples are uniformly distributed between 0 and 1. Uniformly means all the data samples are equally likely, like when you flip a coin heads and tails are equally likely.

x <- round(runif(20),2)
x

##  [1] 0.90 0.27 0.37 0.57 0.91 0.20 0.90 0.94 0.66 0.63 0.06 0.21 0.18 
## [14] 0.69 0.38 0.77 0.50 0.72 0.99 0.38

One thing we might want to know about this data set it the expected value (EV). The EV is the typical value of the data. Since we know the data is uniformly distributed between 0 and 1, the EV is just the middle of the range, 0.5. In practice, we have the data, but don't know the distribution the data came from. So, we can't calculate the EV, but we could use the average as an estimate of the EV. The R code below is the average.

sum(x)/20

## [1] 0.5615

The average of the data is about 0.56. The average is not the only way to estimate the EV, there are many ways! We could just use the first element of the data set 0.90 as the estimate. Another way is to find the maximum of the data and divide it by 2. The code below finds the max of the data and divides it by 2. Notice the max/2 produces an estimate that is closer to the true value of 0.5 than the average.

max(x)/2

## [1] 0.495

There are many way to estimate the EV, but some are better than others. Estimators of random data are also random. So, the only way to compare estimators is statistically. The case above where the max/2 produced a better estimate than the average could have been luck. The code below generates 20 data samples from the same uniform distribution above 20 times. We then find the estimate the EV in three ways: the average, the first sample, and the max(x)/2.

trials <- 100
avg     <- matrix(0,1,trials)
first   <- matrix(0,1,trials)
halfMax <- matrix(0,1,trials)
for (indx in 1:trials) {
  x <- runif(20)
  avg[indx]     <- sum(x)/20
  first[indx]   <- x[1]
  halfMax[indx] <- max(x)/2
}

The figure below contains 100 averages plotted with circles and true EV plotted in a line at 0.5. Most of the averages are between 0.4 and 0.6 and many are closer. The average seems to be a good estimate of the EV. Which shouldn't be surprising since the average is the most common statistic.

This figure is the estimate based only on the first element of the data set. The estimates are between just over 0 and almost 1. A few estimates are close to 0.5, but most are not. This is not a good estimator for EV. This is probably not surprising since this estimate is based on only one sample and the average is based on 20.

This figure is the plot of the estimate based on the max/2. Most of the estimates are very close to the true value. It looks even better than the average! This may seem surprising, since it looks like it is based on only one sample, the maximum sample, but it is actually makes use of all the samples. This is because you need to look at all the samples to know which one is the maximum.

The average is a good choice (and often the best choice) for estimating the EV, but it is not always the best estimator for EV. In the case of this particular distribution, the max/2 is better estimator of EV than the average.

So, we now know what a statistic is and we worked with a few. As a field of study, statistics uses visualization, probability and other math tools to find the best ways to get information from data.