Back

Average Temperature Forecasting

In this notebook we going through the steps of loading, cleaning up, and visualizing a time series dataset to build a model which can predict daily temperatures. The data used in this notebook is from NOAA Climate Data Online.

The model we will use is Prophet by Facebook. This model is built on top of scikit-learn and allows us to effortlessly construct models from processed time series datasets.

Requirement

Tools used

Files

Loading Data and Preprocessing

Data Visualization

There are four components to time series,

  1. Trend: Long time pattern (increase / decrease) of the series.
  2. Seasonality: Repeating cycles. For example, increase of temperatures every year during the summer months.
  3. Cyclical: Up down patterns of the series.
  4. Noise: Random variation.

We can get a visualization of these quantities using the statsmodel library.

Interesting

Stationary Assumption for Time Series Models

Most time series (TS) models require that the data be stationary to work well. When a dataset is stationary, its statistical quantities are constant over the time period of the data. For example, the mean, variance, and autocovariance are constant.

We want our data to be stationary so our model can accurately predict data at all points of interest.

Let's take a look at our yearly average data again...

This plot shows trend and seasonal effects on the data. Statistics are not consistent over time. There is also a trend change at around 1960.

Our histogram shows a truncated normal distribution. This is another sign of non stationary data. We can also run an Augmented Dickey-Fuller Test to determine if our data has a unit root. The presence of a unit root indicates a stochastic, or random trend in the data. This means non-stationary data!

Our test statistic is much greater than our critical values (absolute), therefore this series is non stationary based on the Dickey-Fuller test.

The KPSS test concludes that our series is stationary. Since the KPSS test for stationary is positive and the DF test is negative, it means we have a trend stationary.

Model Training

Before we start to train our model, we need to partition our data so we can test our model's accuracy later. Here, we will put aside the the 2018-2019 data for testing the model later. We have lots of data so not applying the 80:20 rule is fine here.

Model Evaluation

Time to bring back our test dataset that we put aside.