In Applied Statistics course, you learned couple of formulas for basic statistical inference: \[ \mbox{Assuming } X_1, \ldots, X_n \mbox{ are RS (i.i.d) from } N(\mu, \sigma) \mbox{distribution}\\ \hspace{10mm}\\ \hspace{10mm}\\ 95\% \mbox{ Confidence Interval for mean } \mu: \hspace{15mm} \bar X \pm 1.96 \frac{S}{\sqrt n}\\ \hspace{10mm}\\ 95\% \mbox{ Prediction Interval for the next observation} : \hspace{15mm} \bar X \pm 1.96 \, S \, \sqrt{\frac{1}{n} + 1} \]
Let’s simulate that using R. Random Sample of size 100 from the normal distribution with mean 2 and SD 2 looks like this:
set.seed(1453)
X <- rnorm(100, 0, 1) # generate RS of size 100 from N(mean=0,SD=1)
plot(X) # scatter plot with dots
Using the two formulas, we get the statistical inference for the simulated sample above.
## [1] -0.05789066
## [1] 0.9623882
#- compute CI
upper_CI = mean(X) + 1.96 * sd(X) * sqrt(1/100)
lower_CI = mean(X) - 1.96 * sd(X) * sqrt(1/100)
#- compute PI
upper_PI = mean(X) + 1.96 * sd(X) * sqrt(1/100 + 1)
lower_PI = mean(X) - 1.96 * sd(X) * sqrt(1/100 + 1)
plot(X,type='o', ylim=c(-3.5, 3.5), xlim=c(1,120)) # same plot as before
abline(h=upper_CI, col="red", lty=2) # add red dashed lines for CI
abline(h=lower_CI, col="red", lty=2)
abline(h=upper_PI, col="blue", lty=2) # add blue dashed lines for PI
abline(h=lower_PI, col="blue", lty=2)
This is all we can do to forecast if the data is random sample.
First, let’s load the package called TSA this is the package that comes with Cryer’s textbook.
# required only the first time you use the package
install.packages("TSA") # if asked to choose CRAN mirror, choose MI or OH site
# required every time you restart R
acf1 <- acf # Keep original acf() function
library(TSA) # load package TSA
acf <- acf1 # replace back the new acf with original
Now load the data that comes with TSA package. Here’s the data for annual rain falls in LA.
data(larain) # load dataset called larain that is in TSA package
is.ts(larain) # is larain TS object?
## [1] TRUE
## Time Series:
## Start = 1878
## End = 1992
## Frequency = 1
## [1] 20.86 17.41 18.65 5.53 10.74 14.14 40.29 10.53 16.72 16.02 20.82 33.26 12.69
## [14] 12.84 18.72 21.96 7.51 12.55 11.80 14.28 4.83 8.69 11.30 11.96 13.12 14.77
## [27] 11.88 19.19 21.46 15.30 13.74 23.92 4.89 17.85 9.78 17.17 23.21 16.67 23.29
## [40] 8.45 17.49 8.82 11.18 19.85 15.27 6.25 8.11 8.94 18.56 18.63 8.69 8.32
## [53] 13.02 18.93 10.72 18.76 14.67 14.49 18.24 17.97 27.16 12.06 20.26 31.28 7.40
## [66] 22.57 17.45 12.78 16.22 4.13 7.59 10.63 7.38 14.33 24.95 4.08 13.69 11.89
## [79] 13.62 13.24 17.49 6.23 9.57 5.83 15.37 12.31 7.98 26.81 12.91 23.66 7.58
## [92] 26.32 16.54 9.26 6.54 17.45 16.69 10.70 11.01 14.97 30.57 17.00 26.33 10.92
## [105] 14.41 34.04 8.90 8.92 18.00 9.11 11.57 4.56 6.49 15.07 22.65
Here’s color peoperty of a chemicals mixed in a plant.
data(color) # load dataset called color that is in TSA package
plot(color, ylab='Color Property',xlab='Batch',type='o')
Here’s annual population of rabbits in Canada
Sales for oil filters
## [1] TRUE
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1983 2385 3302 3958 3302 2441 3107
## 1984 5862 4536 4625 4492 4486 4005 3744 2546 1954 2285 1778 3222
## 1985 5472 5310 1965 3791 3622 3726 3370 2535 1572 2146 2249 1721
## 1986 5357 5811 2436 4608 2871 3349 2909 2324 1603 2148 2245 1586
## 1987 5332 5787 2886 5475 3843 2537
#-- Plot with month's initial letter
plot(oilfilters,type='l',ylab='Sales')
points(y=oilfilters,x=time(oilfilters), pch=as.vector(season(oilfilters)))
## [1] 31
## Time Series:
## Start = 1905
## End = 1935
## Frequency = 1
## [1] 50 20 20 22 27 50 55 78 70 59 28 20 15 15 25 35 65 78 82 65 26 15 10 1 2 3
## [27] 22 75 95 78 20
#- This year vs Last Year
plot(x=hare[2:31], y=hare[1:30],
ylab="Last Year's Abundance",
xlab="This Year's Abundance")
#- This year vs 2 Years ago ---
plot(x=hare[3:31], y=hare[1:29],
ylab="Abundance 2yrs Ago",
xlab="This Year's Abundance")
Here’s color peoperty of a chemicals mixed in a plant.
#- Plot of color vs previous color
plot(x=color[2:35], y=color[1:34], xlab='Color Property', ylab='Previous Batch')
#- Plot value of X agains previous value of X
plot(X[2:100], X[1:99], ylab='value of X', xlab='previous value of X')
Not so much pattern here.
To capture the pattern we saw between current vs previous values of hare data and color data, we use correlation.
#- Hare data This year vs Last Year
plot(x=hare[2:31], y=hare[1:30],
ylab="Last Year's Abundance",
xlab="This Year's Abundance")
## [1] 0.7025777
#- Hare data This year vs 2 Years ago
plot(x=hare[3:31], y=hare[1:29],
ylab="Abundance 2yrs Ago",
xlab="This Year's Abundance")
## [1] 0.2244224
#- Color data Plot of color vs previous color
plot(x=color[2:35], y=color[1:34], xlab='Color Property', ylab='Previous Batch')
## [1] 0.554917
This pattern doesn’t exist in random sample X.
#-- Random Sample X: Plot value of X agains previous value of X
plot(X[2:100], X[1:99], ylab='value of X', xlab='previous value of X')
## [1] -0.002031602
Autocorrelation Function is a collection of all the autocorrelaton at all the different lags.