Given data \(X_1, \ldots, X_n\), we want to predict \(X_{n+1}\). Let \(\hat X(1)\) be 1-step predictor given \(X_1, \ldots, X_n\).
Let \(n=4\) for example. Then \(h-\)step linear predictor is \[ \hat X(h) = a_0 + a_1 X_4 + a_2 X_3 + a_3 X_2 + a_4 X_1 \] We wanto to find coefficients \(\{a_0,a_1, \ldots, a_n\}\) that minimize the prediction MSE, \[ E \Big[(\hat X(h) - X_{n+h})^2\Big] = E \Big[\big( a_0 + a_1 X_n + \cdots + a_n X_1 - X_{n+h}\big)^2\Big]. \] To do that, we’ll take \(\frac{\partial }{ \partial a_i}\), and set them equal to 0.
First, we take \(\frac{\partial }{ \partial a_0}\). Letting it go inside the expectation, \[ \begin{align} \frac{\partial } { \partial a_0} E \Big[(\hat X(h) - X_{n+h})^2 \Big] & = \hspace3mm \frac{\partial } { \partial a_0} E \Big[ ( a_0 + a_1 X_4 + a_2 X_3 + a_3 X_2 + a_4 X_1 \hspace2mm - \hspace2mm X_{4+h})^2 \Big] \\\\ & = \hspace3mm E \Big[2 ( a_0 + a_1 X_4 + a_2 X_3 + a_3 X_2 + a_4 X_1 \hspace2mm - \hspace2mm X_{4+h}) \Big] \\\\ & = \hspace3mm 2 a_0 + 2 a_1 E\big[ X_4 \big] + 2 a_2 E\big[ X_3 \big] + 2 a_3 E\big[ X_2 \big] + 2 a_4 E\big[ X_1 \big] \hspace2mm - \hspace2mm E\big[ X_{4+h}\big] \hspace3mm = \hspace3mm 0 \end{align} \]
That means, since \(E(X_t)=0\), \[ 2a_0 \hspace3mm = \hspace3mm 0 \hspace10mm \mbox{ or, } \hspace10mm a_0 = 0. \]
Second, Plugging in \(a_0=0\), now we take \(\frac{\partial } { \partial a_1}\). We have \[ \begin{align} \frac{\partial } { \partial a_1} E \Big[(\hat X(h) - X_{n+h})^2\Big] & = \hspace3mm \frac{\partial } { \partial a_1} E \Big[ ( a_1 X_4 + a_2 X_3 + a_3 X_2 + a_4 X_1 \hspace2mm - \hspace2mm X_{4+h})^2 \Big] \\\\ & = \hspace3mm E \Big[2 ( a_1 X_4 + a_2 X_3 + a_3 X_2 + a_4 X_1 \hspace2mm - \hspace2mm X_{4+h}) \,\, X_4 \Big] \hspace3mm = \hspace3mm 0 \end{align} \]
We can rewrite this, and get \[ a_1 \gamma(0) + a_2 \gamma(1) + a_3 \gamma(2) + a_4 \gamma(3) - \gamma(h) = 0, \] or \[ a_1 \gamma(0) + a_2 \gamma(1) + a_3 \gamma(2) + a_4 \gamma(3) = \gamma(h) . \]
Third, we take \(\frac{\partial } { \partial a_2}\). We have \[ \frac{\partial } { \partial a_2} E \Big[(\hat X(h) - X_{n+h})^2\Big] \hspace3mm = \hspace3mm E \Big[2 ( a_1 X_4 + a_2 X_3 + a_3 X_2 + a_4 X_1 \hspace2mm - \hspace2mm X_{4+h}) \,\, X_{3} \Big] \hspace3mm = \hspace3mm 0. \] And we get \[ a_1 \gamma(1) + a_2 \gamma(0) + a_3 \gamma(1) + a_4 \gamma(2) \hspace3mm = \hspace3mm \gamma(h+1) . \]
Fourth, take \(\frac{\partial } { \partial a_3}\). We have \[ \frac{\partial } { \partial a_3} E \Big[(\hat Y(h) - Y_{n+h})^2\Big] = E \Big[2 ( a_1 Y_n + \cdots + a_n Y_1 \hspace2mm - \hspace2mm Y_{n+h}) \, Y_{n-2} \Big] = 0 \] We get \[ a_1 \gamma(2) + a_2 \gamma(1) + a_3 \gamma(0) + a_4 \gamma(1) = \gamma(h+2) . \]
Combining, we get set of equations, \[ a_1 \gamma(0) + a_2 \gamma(1) + a_3 \gamma(2) + a_4 \gamma(3) \hspace3mm = \hspace3mm \gamma(h) \\\\ a_1 \gamma(1) + a_2 \gamma(0) + a_3 \gamma(1) + a_4 \gamma(2) \hspace3mm = \hspace3mm \gamma(h+1) \\\\ a_1 \gamma(2) + a_2 \gamma(1) + a_3 \gamma(0) + a_4 \gamma(1) \hspace3mm = \hspace3mm \gamma(h+2) \\\\ a_1 \gamma(3) + a_2 \gamma(2) + a_3 \gamma(1) + a_4 \gamma(0) \hspace3mm = \hspace3mm \gamma(h+3) \]
This looks very similar to Yule-Walker equation. \[ \left[ \begin{array}{cccc} \gamma(0) & \gamma(1) & \gamma(2) & \gamma(3) \\ \gamma(1) & \gamma(0) & \gamma(1) & \gamma(2) \\ \gamma(2) & \gamma(1) & \gamma(0) & \gamma(1) \\ \gamma(3) & \gamma(2) & \gamma(1) & \gamma(0) \\ \end{array} \right] \hspace2mm \left[ \begin{array}{c} a_1 \\ a_2 \\ a_3\\ a_4\\ \end{array}\right] = \left[ \begin{array}{l} \gamma(h) \\ \gamma(h+1) \\ \gamma(h+2) \\ \gamma(h+3) \\ \end{array} \right]. \] This is a general equation of how we can find coefficients for best lienar h-step predictor.
If \(h=1\), the above equation is same as the Yule-Walker equation, \[ \left[ \begin{array}{c} a_1 \\ a_2 \\ a_3\\ a_4\\ \end{array}\right] \hspace2mm = \hspace2mm \left[ \begin{array}{cccc} \gamma(0) & \gamma(1) & \gamma(2) & \gamma(3) \\ \gamma(1) & \gamma(0) & \gamma(1) & \gamma(2) \\ \gamma(2) & \gamma(1) & \gamma(0) & \gamma(1) \\ \gamma(3) & \gamma(2) & \gamma(1) & \gamma(0) \\ \end{array} \right] ^{-1} \hspace2mm \left[ \begin{array}{l} \gamma(1) \\ \gamma(2) \\ \gamma(3) \\ \gamma(4) \\ \end{array} \right]. \] Therefore, \[ \left[ \begin{array}{c} a_1 \\ a_2 \\ a_3\\ a_4\\ \end{array}\right] \hspace3mm = \hspace3mm \left[ \begin{array}{c} \phi_1 \\ \phi_2 \\ \phi_3\\ \phi_4\\ \end{array}\right]. \]
Our \(1-\)step linear predictor, \[ \hat X(1) = a_0 + a_1 X_4 + a_2 X_3 + a_3 X_2 + a_4 X_1 \] should be \[ \hat X(1) = 0 + \phi_1 X_4 + \phi_2 X_3 + \phi_3 X_2 + \phi_4 X_1. \]
Note that some of the \(\phi_i\) can be 0. For example, if you have AR(2), then it means that \(\phi_3\) and \(\phi_4\) are \(0\).
So the best 1-step ahead predictor for \(X_{5}\) given \(X_1, \ldots, X_4\) is \[ \hat X(1) \hspace3mm = \hspace3mm \phi_1 X_4+ \phi_2 X_3 + \phi_3 X_2 + \phi_4 X_1 \]
This predictor minimizes the prediction MSE: \[ E \Big[(\hat X(h) - X_{n+h})^2\Big] \hspace3mm = \hspace3mm E \Big[\big( a_0 + a_1 X_n + \cdots + a_n X_1 - X_{n+h}\big)^2\Big]. \]
Note that our original AR(4) equation says, \[ X_5 \hspace3mm = \hspace3mm \phi_1 X_4+ \phi_2 X_3 + \phi_3 X_2 + \phi_4 X_1 + \epsilon_t \]
In practice, we don’t know the actual \(\phi_1, \ldots, \phi_4\), so we must use
the estimated version, \[
\hat X(1) \hspace3mm = \hspace3mm
\hat \phi_1 X_4+ \hat \phi_2 X_3 + \hat \phi_3 X_2 + \hat \phi_4 X_1
\]
Minimized prediction MSE is \[ E \Big[(\hat X(h) - X_{n+h})^2\Big] = E \Big[\big( \phi_1 X_4 + \phi_2 X_3 + \phi_3 X_2 + \phi_4 X_1 \hspace2mm - \hspace2mm X_{4+h}\big)^2\Big]. \]
If \(h=1\), \[ \begin{align} E \Big[(\hat X(1) - X_{5})^2\Big] \\ &= \hspace3mm E \Big[\big( \phi_1 X_4 + \phi_2 X_3 + \phi_3 X_2 + \phi_4 X_1 \hspace2mm - \hspace2mm X_{5}\big)^2\Big] \\\\ &= \hspace3mm E \bigg[\Big\{\big( \phi_1 X_4 + \phi_2 X_3 + \phi_3 X_2 + \phi_4 X_1 ) \hspace2mm - \hspace2mm \big( \phi_1 X_4 + \phi_2 X_3 + \phi_3 X_2 + \phi_4 X_1 + \epsilon_t \big)\Big\}^2\bigg] \\\\ &= \hspace3mm E \Big[ \hspace2mm \epsilon_t^2 \hspace2mm \Big] \hspace5mm = \hspace5mm \sigma^2 . \end{align} \]
In practice, we don’t know the actual \(\sigma^2\), so we settle for \[ \hat{\mbox{pMSE}} = \hat \sigma^2. \]
This is what we found above: \[ \left[ \begin{array}{cccc} \gamma(0) & \gamma(1) & \gamma(2) & \gamma(3) \\ \gamma(1) & \gamma(0) & \gamma(1) & \gamma(2) \\ \gamma(2) & \gamma(1) & \gamma(0) & \gamma(1) \\ \gamma(3) & \gamma(2) & \gamma(1) & \gamma(0) \\ \end{array} \right] \hspace2mm \left[ \begin{array}{c} a_1 \\ a_2 \\ a_3\\ a_4\\ \end{array}\right] = \left[ \begin{array}{l} \gamma(h) \\ \gamma(h+1) \\ \gamma(h+2) \\ \gamma(h+3) \\ \end{array} \right]. \]
If \(h=2\), we have \[ \left[ \begin{array}{c} a_1 \\ a_2 \\ a_3\\ a_4\\ \end{array}\right] \hspace2mm = \hspace2mm \left[ \begin{array}{cccc} \gamma(0) & \gamma(1) & \gamma(2) & \gamma(3) \\ \gamma(1) & \gamma(0) & \gamma(1) & \gamma(2) \\ \gamma(2) & \gamma(1) & \gamma(0) & \gamma(1) \\ \gamma(3) & \gamma(2) & \gamma(1) & \gamma(0) \\ \end{array} \right] ^{-1} \hspace2mm \left[ \begin{array}{l} \gamma(2) \\ \gamma(3) \\ \gamma(4) \\ \gamma(5) \\ \end{array} \right]. \] Therefore, \[ \left[ \begin{array}{c} a_1 \\ a_2 \\ a_3\\ a_4\\ \end{array}\right] \hspace3mm = \hspace3mm \mathbf{\Gamma}^{-1} \, \mathbf{\gamma} \hspace3mm \approx \hspace3mm \mathbf{\hat\Gamma}^{-1} \, \mathbf{\hat \gamma} \]
Prediction Mean Error is zero \[ E \Big[(\hat X(h) - X_{n+h}) \Big] = 0 \] prediction Mean Squared Error is \[ E \Big[(\hat X(h) - X_{n+h})^2\Big] \]
95% prediction interval is \[ \hat X(h) \pm 1.96 \sqrt{ \hat{\mbox{pMSE}} } \]
h-step ahead forecast error can be written as (See Cryer p196) \[ \Big(\hat X(h) - X_{n+h}\Big) = e_{t+h} + \psi_1 e_{t+h-1} + \psi_2 e_{t+h-2} + \psi_3 e_{t+h-3} + \cdots + \psi_{h-1} e_{t+1}. \]
Prediction MSE for large \(h\) will be \[ pMSE \to \sqrt{Var(Y_t)} \]
Consider MA(4) model: \[ X_t \hspace3mm = \hspace3mm \epsilon_t - \theta_1 \epsilon_{t-1} - \theta_2 \epsilon_{t-2} - \theta_3 \epsilon_{t-3} - \theta_4 \epsilon_{t-4} \]
Tomorrow’s value will be generated as \[ X_{t+1} \hspace3mm = \hspace3mm e_{t+1} - \theta_1 e_{t} - \theta_2 \epsilon_{t-1} - \theta_3 \epsilon_{t-2} - \theta_4 \epsilon_{t-3} \]
It turns out that best linear predictor of MA(4) model is \[ \hat X(1) \hspace3mm = \hspace6mm - \theta_1 e_{t} - \theta_2 \epsilon_{t-1} - \theta_3 \epsilon_{t-2} - \theta_4 \epsilon_{t-3} \]
In practice, we must use estimator for both \(\theta\) and \(\epsilon_t\), \[ \hat X(1) \hspace3mm = \hspace6mm - \hat \theta_1 \hat \epsilon_{t-1} - \hat \theta_2 \hat \epsilon_{t-2} - \hat \theta_3 \hat \epsilon_{t-3} - \hat \theta_4 \hat \epsilon_{t-4} \] where residuals \(\hat \epsilon_t\) are calculated using the invertible representation.
1-step predition MSE is the conditional variance \[ E[(\hat X(1) - X_{t-1})^2] = V(\epsilon_t) = \sigma^2 \]
long-term h-step prediction is the mean \[ \hat X(h) = E(X_{t+1}) = 0 \]
long-term h-step prediction MSE is the unconditional variance \[
E[(\hat X(h) - X_{t+h})^2] = V(X_{t+h})
\]
Best 1-step linear predictor of AR(\(p\)) was simply to use the AR equation without \(\epsilon_t\), \[ \hat X_{n+1} = \phi_1 X_{n} + \cdots \phi_p X_{n-p+1} \]
Best 1-step linear predictor of MA(q) model is \[ \hat X_{n+1} = - \theta_1 e_{t} - \cdots \theta_q \epsilon_{n-q+1} \]
For both AR and MA, 1-step prediction’s MSE is \[ \mbox{pMSE} = \sigma^2 \] where \(\sigma^2\) is the variance of the error (innovation), \(\epsilon_t\).
This leads to 1-step 95% prediction interval of \[ \hat X(1) \pm 1.96 \hat \sigma \]
Looking \(h\)-step ahead, the “best prediction”, value will dacay toward \(E(\epsilon_t)=0\) as \(h\) increases.