Forecasting is a statistical technique used to make predictions about future value(s) of a quantity based on past known values. Common applications of forecasting can be found in weather forecasts, sales planning and production planning.
Definitions
In the context of forecasting, the quantity being forecasted is
usually referred to as the observable
, the future
values predicted by the process as the forecast
,
the past data used to generate the forecast as the
observations
(or the sample
), and
an individual observation value as a data point
.
For the rest of this post, we will use the data points
[1125, 1177, 1224, 1264, 1326, 1367, 1409, 1456, 1500, 1570, 1636, 1710, 1440, 1493, 1553, 1611, 1674, 1742, 1798, 1876, 1955, 2033, 2115, 2190, 1955, 2022, 2117, 2216, 2295, 2403, 2498, 2602, 2723, 2837, 2948, 3066]
collected for some arbitrary quantity as an example.
The following image depicts these data points graphically.

Figure 1: Example observations
It is common for forecasting techniques to make predictions corresponding to at least some, if not all of the (past) observed data points (in addition to generating predictions for the future). This allows the forecast to be compared with the (actual) observations to understand forecast accuracy.
A common measure for forecast accuracy is the difference between
a data point and its corresponding prediction, known as the
error
(that is,
error = observation - prediction
).
If a prediction is higher than the data
point (prediction > observation
), the error is
negative, if both are the same
(prediction = observation
),
the error is zero, and if the prediction is lower
than the data point (prediction < observation
),
the error is positive. Given that the error can be negative,
zero or positive, it is more common to use the square of the
error, known as the squared-error
, which is
always zero or positive. This is done to avoid situations where
negative errors cancel out positive errors exactly, thereby
providing a false sense of forecast accuracy.
Forecasts for the same observations, generated using different forecasting techniques can be compared to each other by comparing the squared-errors (or just the errors, although this is usually avoided) for the different forecasts. Lower the squared-error, higher the accuracy of the associated forecasting technique.
Forecasting basics
For most observations, there are three important points of
interest. The first, known as the level
is just
the simple average of all data points. The image below shows
the observations introduced earlier, with their level.

Figure 2: Example observations, with level
The second, known as the trend
indicates whether
the observations increase or decrease on an overall basis
(from the first to the last data point). A visual inspection of
the example observations above shows that they have a
(significant) increasing trend.
The third and last point of interest is the
seasonality
, which is defined as the periodicity
of any patterns embedded within the observations. The example
observations above have a periodicity of 12
, as
there is an upward trend for 12 data points, followed by a
downward trend for 1 data point, followed by a repetition of
the pattern.
Simple Exponential Smoothing
It is generally seen that more the number of data points used to generate a forecast, higher the forecast accuracy. This is because such a forecast would have an "insight" into a larger number of data points, thereby being able to assess the level, trend and seasonality to a greater accuracy. However, it is reasonable to expect that more recent data points should have a higher contribution towards a prediction than older ones, since more recent events are more likely to influence the future.
Exponential smoothing is a class of forecasting techniques that use the weighted average method for generating the forecast. These techniques use progressively decreasing weights for older observations to let more recent data points have higher influence on the forecast, and older ones to have lower influence.
The simplest of these techniques, aptly called
Simple Exponential Smoothing (or
Single Exponential Smoothing) uses a single parameter
to assign "exponentially" lower weights to older data points.
A smooth version
is generated for each data point.
If there are n observations, denoted by
O1,O2,...,On, their smooth versions are calculated
as:
Si=αOi−1+(1−α)Si−1, or alternatively as
Si=Si−1+α(Oi−1−Si−1)
where, i ranges from 2 to n, α
is called the smoothing parameter
, and Si
are the smooth versions of the observations Oi. There
is no smooth version S1 for the first observation
O1. The value of α ranges between 0 and
1.
a. Effect of the smoothing parameter
When α=0, 1−α=1, and the formula for calculating the smooth versions becomes Si=Si−1, that is, each forecast value is the same and has no dependence on the observations. Consequently, the forecast has no predictive value at all (since each predicted value is the same).
Similarly, when α=1, 1−α=0, and the formula becomes Si=Oi−1, that is, each forecast value is the same as the previous observation. Consequently, this too has no preditive value at all.
For these reasons, neither 0, nor 1 is a good value for α.
The quantity mean squared error (MSE)
is calculated
as the simple average of the squared errors, that is:
MSE=1n[(O2−S2)2+(O3−S3)2+...+(On−Sn)2]
MSE is a measure of the forecast accuracy. Lower the MSE, more accurate the forecast, and vice versa. Therefore, in real-world scenarios, the value of α is chosen such that the MSE is minimized.
b. Finding optimal α
The optimal value of α can be determined through trial-and-error, beginning with some value for the smoothing parameter between 0 and 1. The MSE can then be calculated by varying the value of the smoothing parameter by small amounts and comparing it with the previous value to see if it decreases or increases.
The trial-and-error method can be quite time-consuming if the observations fluctuate too much and if the value of α is required to a high degree of accuracy, say, up to a certain number of decimal places. It is therefore commonplace to use an optimization method to find the optimal value for α. One such method is the Levenberg-Marquardt (LM) method.
The LM method relies upon finding the Jacobian of a function at every point to determine if the search for the optimal parameter value is progressing in the right direction. The Jacobian for a function S of k parameters xk is a matrix, where an element Jik of the matrix is given by Jik=∂Si∂xk, xk are the k parameters on which the function S is dependent, and Si are the values of the function S at i distinct points.
In the case of simple exponential smoothing, there is only one parameter α that impacts the predicted value for a given observed value. Therefore, the Jacobian (J) depends on only this one parameter α and thereby reduces to Ji=∂Si∂α, or
J=[J2J3...Jn]=[∂S2∂α∂S3∂α...∂Sn∂α]
Given that the function S is defined for simple exponential smoothing as Si=Si−1+α(Oi−1−Si−1) (see above),
Ji=∂Si∂α becomes
Ji=∂(Si−1+α(Oi−1−Si−1))∂α, or
(by the associative rule of differentiation)
Ji=∂Si−1∂α+∂(α(Oi−1−Si−1))∂α, or
(by the associative rule of differentiation)
Ji=∂Si−1∂α+∂(αOi−1)∂α−∂(αSi−1)∂α, or
(by the chain rule of differentiation)
Ji=∂Si−1∂α+α∂Oi−1∂α+Oi−1∂α∂α−α∂Si−1∂α−Si−1∂α∂α, or
(upon simplification)
Ji=∂Si−1∂α+Oi−1−α∂Si−1∂α−Si−1
(since ∂Oi−1∂α=0, given that
Oi−1 does not depend on α), or
(upon rearrangement of terms)
Ji=Oi−1−Si−1+(1−α)∂Si−1∂α, or
using the fact that ∂Si−1∂α=Ji−1
Ji=Oi−1−Si−1+(1−α)Ji−1
Since S2=O1, J2=∂S2∂α=0. This gives an initial value for the Jacobian that can be used to derive the rest using the recursive formula derived above, for any given value of α, that is
J2=0
J3=O2−S2+(1−α)J2
J4=O3−S3+(1−α)J3
J5=O4−S4+(1−α)J4
... and so on.
The Apache Commons Math library provides an excellent implementation of the LM method, should there be a need to write a software program for finding optimal α for simple exponential smoothing.
For the example observations above, an optimal α of
0.5522115480262714
is obtained. Using this value
yields the following forecast:

Figure 3: Example observations, with level and forecast
The following Java code shows how to use the Apache Commons Math library (version 3.6.1 to be precise) to find the optimal value for α (Java 8 or higher required).
Reference
An excellent overview of simple exponential smoothing can be found here.