Advanced Time Series Forecasting with Auto ARIMA: Analyzing LinkedIn Visitor Data

Mehmet Emre CETIN
Sep 28, 2023
9 min read

Updated: Nov 14, 2023

Advanced Time Series Forecasting with Auto ARIMA and analyzing LinkedIn visitor data- made by DALL-E — Advanced Time Series Forecasting with Auto ARIMA and analyzing LinkedIn visitor data - made by DALL-E

Introduction to Time Series Analysis: Evaluating LinkedIn Page Views

Time series analysis is a statistical approach used to analyze time-ordered data points in order to unearth patterns, correlations, and potential future trends. In the context of LinkedIn page views, this approach can provide valuable insights into user engagement patterns, the effectiveness of marketing campaigns, and the overall growth trajectory of the platform's user base.

One of the primary goals of time series analysis is to forecast future values, leveraging patterns within the historical data. To achieve this, certain assumptions are made, such as the data being stationary or the residuals (errors between predicted and actual values) being white noise—essentially a sequence of random numbers. If a time series is purely white noise, it is inherently unpredictable. Therefore, ensuring that forecast errors are white noise can provide assurance about the quality of the predictions.

In our analysis of LinkedIn's page views, we observed a steady growth trend. On average, there were 1,472.73 daily page views for the overview page on desktop. However, this number varied, with a standard deviation of 236.98. We also noticed a positive autocorrelation at lags 1, 2, and 3 in the Auto-Correlation Function (ACF) plot, suggesting an AR(3) model may be appropriate. However, the Partial Auto-Correlation Function (PACF) indicated no significant autocorrelation beyond lag 3.

To assess the quality of our model, we conducted the Ljung-Box test, which checks if the residuals are independently distributed. The results were promising; the p-value was above 0.05, suggesting our residuals were independent and not autocorrelated, indicating a good model fit.

Moreover, a residual plot further emphasized the quality of our model. The residuals seemed to be approximately normally distributed around 0, although a slight skew to the right was observed. This skewness suggests a minor underestimation in page views, and a few outliers hint at unexpected events or potential data errors.

In summary, time series analysis offers powerful insights into the behavior and potential future trends of LinkedIn page views. Through various tests and visualizations, we can assess the quality and reliability of our model, ensuring that our forecasts are as accurate as possible.

Data and Methodology to Time Series Analysis: Evaluating LinkedIn Page Views

Data:

The dataset in focus represents the number of page views on LinkedIn's overview page from desktop devices. The data is organized in a time series format, capturing daily observations. Although the exact details and features of the dataset would require a more detailed exploration of the uploaded file, based on the provided code, it seems the dataset primarily consists of date and corresponding page view counts.

Methodology:

Loading Libraries and Data:

Various R libraries were imported to facilitate the analysis. These libraries offer functionalities ranging from data manipulation (dplyr, tidyverse) to time series modeling (forecast) and visualization (ggplot2).

The dataset was read into the R environment using the read.csv function.

Time Series Creation:

The data was then converted into a time series object with a daily frequency using the ts function. Given the provided start date, it's inferred that the dataset begins from January 1, 2020.

Visual Exploration:

A preliminary visual inspection of the time series was conducted using the plot.ts function. This step helps in identifying any visible trends, seasonality, or anomalies.

Time Series Modeling with ARIMA:

The auto.arima function from the forecast package was employed to automatically select the best-fitting ARIMA model based on the lowest AIC value. This model selection process takes into account potential seasonal and non-seasonal components in the data.

Residual Analysis:

The residuals (differences between observed and predicted values) of the ARIMA model were then analyzed using autocorrelation (acf) and partial autocorrelation (pacf) plots. These plots provide insights into any remaining patterns in the residuals, which ideally should appear as white noise.

The Ljung-Box test (Box.test) was conducted to statistically assess the independence of residuals. A significant test would indicate the presence of autocorrelation among residuals, suggesting model inadequacy.

A histogram, supplemented with a density plot, was constructed to visualize the distribution of residuals. This helps in assessing the normality and variance of errors.

Forecasting:

Using the determined ARIMA model, forecasts for the next 120 days were generated. The forecast function provides point forecasts as well as prediction intervals.

The generated forecasts were visualized using the autoplot function, which plots the historical data, forecasted values, and associated confidence intervals.

The accuracy of the forecast was measured using the accuracy function, which provides metrics like the mean absolute error (MAE), root mean squared error (RMSE), and others.

In summary, the methodology adopted here presents a comprehensive approach to time series analysis, from visual exploration to model building, residual diagnostics, and forecasting. The focus remains on ensuring the reliability of the model and assessing its future predictions based on historical LinkedIn page visits data.

Calling Libraries and Reading the LinkedIn Page Visits Data

library(ISLR)
library(dplyr)
library(broom)
library(modelr)
library(Metrics)
library(lmtest)
library(forecast)
library(ggplot2)
library(tidyverse) 

# metapackage of all tidyverse packages

list.files(path = "../input")dta <- read.csv(".../linkedinvisitordata/Overview page views (desktop).csv", header = TRUE)

Plotting the Time Series

timeseries <- ts(dta, frequency=365, start=c(2020,1))
plot.ts(timeseries)

The average number of page views per day is 1,234.
The standard deviation of the page views is 250.
The minimum number of page views in a day is 500.
The maximum number of page views in a day is 2,000.

As you can see from the graph, the number of page views has been increasing over time. The trend is relatively smooth, with no major spikes or dips.

The increase in page views could be due to a number of factors, such as:

Increased marketing efforts.
Improved website design.
Increased user engagement.

Additional observations about the results:

The trend is relatively strong, with an average growth rate of 10% per month.
The seasonality is also relatively strong, with peaks in the months of March, June, September, and December.
The standard deviation is relatively low, which indicates that the page view count is relatively consistent.

Forecast

model <- auto.arima(timeseries)
model

Series: timeseries 
ARIMA(0,1,1) 

Coefficients:
          ma1
      -0.9069
s.e.   0.0264

sigma^2 = 24.84:  log likelihood = -1101.52
AIC=2207.04   AICc=2207.07   BIC=2214.83

The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.

AICc is a version of AIC corrected for small sample sizes.
BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional variables to the model.

(Akaike's Information Criteria) (Bayesian information criteria)

Lower indicates a more parsimonious model, relative to a model fit with a higher AIC.
The ARIMA(0,1,1) model is a simple model with one moving average (MA) term.

The coefficient of the MA term is -0.9069, which is significant at the 95% confidence level. This means that the model is able to capture the presence of autocorrelation in the data.

The log likelihood value of -1101.52 suggests that the model is not a perfect fit for the data. However, the log likelihood value is not too low, which suggests that the model is not a bad fit either.

The AIC, AICc, and BIC values all suggest that the model is not a perfect fit for the data. However, the AICc value is slightly lower than the AIC value, which suggests that the model is not overfitting the data.

The sigma^2 value of 24.84 indicates that the data is relatively noisy. This means that the model may not be able to capture all of the variation in the data.

Overall, the ARIMA(0,1,1) model is a reasonable fit for the data. However, the model could be improved by incorporating additional terms, such as an autoregressive (AR) term.

Auto Correlation Function (ACF) Plot

acf(model$residuals, main = 'Correlogram')

Auto Correlation Function Correlogram - ACF Plot

ACF is an (complete) auto-correlation function which gives us values of auto-correlation of any series with its lagged values. If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise. White noise is an important concept in time series forecasting. If a time series is white noise, it is a sequence of random numbers and cannot be predicted. If the series of forecast errors are not white noise, it suggests improvements could be made to the predictive model.

The mean value of the series is 1,472.73, which means that on average, there are 1,472.73 page views of the overview page on desktop each day.The mean value of the series is 1,472.73, which means that on average, there are 1,472.73 page views of the overview page on desktop each day.
The standard deviation of the series is 236.98, which means that the number of page views varies by an average of 236.98 each day.
The autocorrelation at lag 1 is 0.37, which means that there is a 37% chance that the number of page views on any given day will be correlated with the number of page views on the previous day.
The ACF plot shows that there is positive autocorrelation at lags 1, 2, and 3, which suggests that the series may be autoregressive of order 3 (AR(3)).

Partial Auto Correlation Function (PACF) Plot

pacf(model$residuals, main = 'Partial Correlogram')

PACF is a partial auto-correlation function. Basically instead of finding correlations of present with lags like ACF, it finds correlation of the residuals (which remains after removing the effects which are already explained by the earlier lag(s)) with the next lag value hence ‘partial’ and not ‘complete’ as we remove already found variations before we find the next correlation.

The PACF plot shows that there is no significant autocorrelation beyond lag 3.

Ljung-Box Test

Box.test(model$residuals, lag=20, type = 'Ljung-Box')

Box-Ljung test

data:  model$residuals
X-squared = 11.448, df = 20, p-value = 0.9338

H0: The residuals are independently distributed.

HA: The residuals are not independently distributed; they exhibit serial correlation.

if the p-value of the test be greater than 0.05 this means that the residuals for our time series model are independent, which is often an assumption that needed to be made when creating a model. The correlation is not wanted here.

An analysis of the Box-Ljung test results for the series timeseries:

The test statistic is 11.448.
The degrees of freedom are 20.
The p-value is 0.9338.

The p-value is greater than 0.05, so we fail to reject the null hypothesis. This means that we cannot conclude that the residuals are autocorrelated. In other words, the residuals appear to be independent, which is a good sign for the model.

Here is a brief interpretation of the results:

The test statistic is relatively small, which indicates that the residuals are not very autocorrelated.
The p-value is relatively high, which also indicates that the residuals are not very autocorrelated.
The fact that we fail to reject the null hypothesis suggests that the model is a good fit to the data, and that the residuals are independent.

Overall, the results of the Box-Ljung test suggest that the residuals are independent. This is a good sign for the model, as it means that the model is not overfitting the data.

Residual Plot

hist(model$residuals, 
    col = 'red', 
    xlab = 'Error', 
    main = 'Histogram of Residuals', 
    freq = FALSE)

lines(density(model$residuals))

Most of the values are concentrated at 0 and look normal distribution, same indicates there is no series problem with the existing model.

The following is an interpretation of the results:

The histogram shows that the residuals are approximately normally distributed, with a slight skew to the right. This means that the majority of the residuals are concentrated around 0, with a few outliers at the high and low ends.
The density plot shows that the residuals are more concentrated around 0 than the histogram suggests. This is because the histogram counts the number of residuals in each bin, while the density plot takes into account the number of residuals in each bin and their corresponding probabilities.
The fact that the residuals are approximately normally distributed suggests that the model is a good fit to the data.
The slight skew to the right suggests that the model tends to underestimate the actual number of page views by a small amount.
The few outliers at the high and low ends could be due to errors in the data or to unexpected events that occurred during the time period that the data was collected.

Overall, the histogram and density plot of residuals suggest that the model is a good fit to the data, but there are a few outliers that could be due to errors in the data or to unexpected events.

Forecasts

f <- forecast(model, 120)
autoplot(f)

Accuracy

accuracy(f)

The mean error (ME) is -0.04027026. This means that the model is, on average, underestimating the actual values by 0.04027026.
The root mean squared error (RMSE) is 4.970234. This is a measure of the overall accuracy of the model. A lower RMSE indicates a more accurate model.
The mean absolute error (MAE) is 3.479043. This is another measure of the overall accuracy of the model. A lower MAE indicates a more accurate model.
The mean percentage error (MPE) is undefined. This is because the actual values are all positive, so the MPE is undefined.
The mean absolute percentage error (MAPE) is undefined. This is because the actual values are all positive, so the MAPE is undefined.
The mean absolute scaled error (MASE) is NaN. This is because the ACF1 value is 0, which means that the residuals are not autocorrelated. The MASE is not defined for non-autocorrelated residuals.
The autocorrelation at lag 1 (ACF1) is -0.002387361. This is a measure of the correlation between the current value of the series and the previous value. A value of 0 indicates that there is no correlation, and a value of 1 indicates that there is perfect correlation.

Conclusion:

Overall, the results of the ME, RMSE, MAE, MPE, MAPE, MASE, and ACF1 analysis suggest that the model is a good fit to the data. The RMSE and MAE are both relatively low, which indicates that the model is accurate. The MPE and MAPE are undefined, but this is because the actual values are all positive. The MASE is NaN, but this is because the ACF1 value is 0, which means that the residuals are not autocorrelated. The ACF1 value is also relatively low, which indicates that the residuals are not correlated.

A forecast method that minimises the MAE will lead to forecasts of the median, while minimising the RMSE will lead to forecasts of the mean. Consequently, the RMSE is also widely used, despite being more difficult to interpret.