ASHRAE GREAT ENERGY PREDICTION(III)

vishal sharma
10 min readMar 3, 2021

TABLE OF CONTENTS

  1. Introduction
  2. Business Problem
  3. ML Formulation
  4. Performance-Metric
  5. Dataset Analysis
  6. Exploratory Data Analysis(EDA)
  7. Imputation Techniques
  8. Feature Engineering
  9. Feature Selection
  10. Models with Hyperparameter Tuning
  11. Deployment of ML Model using Flask on Google Colab
  12. Test Results Comparison
  13. Future Work
  14. References

1) Introduction

This is the third energy competition which is held by ASHRAE in the year 2019.

This competition is different from the other two competitions held in the past cause the dataset provided in this spans over a 3 year timeframe(Jan 2016-Dec 2018) whereas for the other two only few months of data was available. Larger dataset resulted in more generalization of the ML Models. Here we have to predict the energy consumption for four different types of meters(Steam,Electricity,ChilledWater,Hotwater).

The training dataset contains reading from 1448 buildings taken from 16 different sites of North America and Europe. The weather conditions and information about the buildings were also provided.

2) Business Problem

In todays world in the energy sector we always see that the industries are coming up with new ways to reduce the emission of the greenhouse gases and also deploying new ways to cut the cost of energy consumption. Due to this the renewable and smart grid sectors are expanding but not at that much rate as it should be if we consider the present environmental conditions.

So the question is why they are not able to expand at a faster rate although they have come up with efficient methods of lowering down the energy consumption.

This is due to the fact that the power sector industries are not having the exact methodology of calculating the savings in the post retrofit period.

Now if the power sector industries can get a model which can predict the energy consumption of the buildings accurately based on the historical energy usage then they can compare it with the energy consumption in the post retrofit period which can help them to calculate the energy savings effectively. This will also lead to better cost optimization and incentives.

Now who is going to build these models. This is where data scientists and machine learning practitioners come in. They are going to build these counterfactual models which will predict the meter readings based on the historical energy usage.

3) ML Formulation

As we have to predict the meter readings which are real valued numbers so the given problem can be posed as regression based ml problem.

The first year of hourly meter data was provided as the training set along with the weather conditions and the building metadata.

Next two years of data was provided as the test set on which we have to make our predictions.

4) Performance-Metric

The performance metric used here is the Root Mean Squared Logarithmic Error(RMSLE).

Now I will explain why we are using this metric for our regression problem why not use RMSE.

RMSLE incurs large penalty for the underestimation of the actual variable as compared to RMSE. This is very useful for our business case as underestimation of our target variable is not accepted. For Ex- If we are underestimating our electricity consumption then this can lead to shutdown of buildings during business hours which can lead to monetary losses. This point indeed is the most important point for our use case.

In regression based models RMSE is very much affected by outliers. Now if we take the case of RMSLE it is much more robust to outliers as it takes the log1p of the values which kind of compresses the value.

N is the total number of observations the public/private dataset

yi is the actual target variable

yi^ is the predicted target variable

log(x) is the natural logarithm of x

Source → https://www.kaggle.com/c/ashrae-energy-prediction/data

5) Dataset Analysis

The training dataset consists of 20 million data points with the hourly meter readings for each meter. The weather conditions and building metadata was also provided. This dataset spans over one-year timeframe(Jan 2016-Dec 2016).

The test dataset consists about 41 million data points which spans over 2 year timeperiod from Jan 2017 to Dec 2018.

Train.csv → Building_id, Timestamp, Meter(0-Electricity,1-Chilledwater,2-Steam,3-Hotwater)

Weather_train.csv→Site_id,Timestamp,wind_speed,wind_direction,cloud_coverage,air_temperature,precipitation_depth_1_hr,dew_temperature,sea_level_pressure

Building_metadata.csv→Building_id,floor_count,square_feet,year_built,primary_usage

Site_id →Unique id of the site. It ranges from 0–15

Building_id →unique id of the building. It ranges from 0–1448

Sources →https://www.kaggle.com/c/ashrae-energy-prediction/data

6) Exploratory Data Analysis

Here first what I did is merged the train data with building metadata and weather data. Now after getting my final train dataset I started analyzing each building site by site.

  1. Interesting Findings

As I started analyzing meter readings for each building for every site what i found is that the meter readings for site 0 and meter 0 were mostly zero till May 20 2016.Below is the image shown along with the code snippet.

I have also checked the energy consumption of different meters across the hour of the day to see the energy usage patterns. Here we can see that the electricity meter reading starts increasing from 6:00 am in the morning and starts decreasing after 18:00 pm.

If we want to check for chilled water reading it generally peaks around during the afternoon hours and starts decreasing from evening hours.

Hot water and steam are not showing any considerable patterns during the hour of the day.

Lets further investigate the energy pattern over the day of the week.

The first one which I have shown here is the electricity consumption over the day of the week. Here we can observe that the consumption is less for the weekend as compared to the weekdays. This pattern is consistent for almost all the sites when checking for meter 0. Now for the remaining three meters this pattern can be observed more or less the same for most of the sites.

Now coming to the next point here I will show you the energy consumption of meters according to the seasons. The most dominant effect of this is present on the Chilled water and Hotwater Consumption.

From the above two plots, we can see that Chilled water and Hotwater Consumption shows strong variations according to the seasons.

7) Imputation Techniques

Firstly I will check the percentage of the missing values for each feature present. Below is the code along with the percentage of null values.

Percentage of Null Values

Now coming to the imputation part as the magnitude of energy consumption depends on individual site. It also depends on the day of the week along with the month. Therefore we need to group it by taking all the three mentioned features.

First I have dropped the floor count as it was missing for more than 80%.

Now coming to the other features. This is the code which I used to fill all the other features except the air and dew temperature. Now here for filling the missing values we have considered each of the particular site id and then taking the median of the particular day for individual month.

For air and dew temperature as it was showing strong correlation with the previous hour data. Therefore I used linear interpolation to fill up the missing values.

8) Feature Engineering

Firstly what I did is aligned all the air temperature with the local timestamp of the meter readings.

Now for Site 0 the air temperature was shifted by 5 hours. This was different for different sites but the code is exactly the same. Below is the code for the timestamp alignment.

  1. Relative Humidity - Humidity affects the energy requirements of the buildings. As humidity increases load on HVAC system increases due to which the meter readings gets affected. This feature will help the model to learn the energy requirements according to the humidity factor

2 ) Is Summer Month and Is Winter Month - This feature will help the model to learn pattern over heating and cooling season. Steam and Hot water shoes considerable patterns over summer and winter month.

3) Holidays - As the energy requirements are different over holidays as compared to the working days. Therefore this feature will help the model to learn consecutive patterns.

holidays = ["2016-01-01", "2016-01-18", "2016-02-15", "2016-05-30", "2016-07-04",
"2016-09-05", "2016-10-10", "2016-11-11", "2016-11-24", "2016-12-25"]

4) Is Weekday - This feature helps the model to learn energy pattern over the weekday as compared to the weekends.

5) Busy Hours - As the energy requirements are different over the day as compared to night. Therefore to help the model learn this we are adding this feature. Busy hours are considered from 6:00 AM To 18:00 PM.

6) Added basic timestamp features such as Year, Month, Hour and Day.

9) Feature Selection

I have done feature selection before fitting my model as the dataset was large so more features would definitely increase the chances of overfitting. I have done it using XGB Regressor as my base model.

As we can see from the above plot there are a total of 17 features which are important and the rest 6 features are discarded.

10) Model and Hyperparameter Tuning

Here I have tried different models such as RandomForest, Xgboost, Lgbm, Catboost, Stacked Ensembling, and a Deep Learning Model. For each mode, I have performed hyperparameter tuning. Initially, only I have applied target transformation(log1p) on my target variable, therefore I am using RMSE as my evaluation metric which by default becomes RMSLE.

  1. XGBoost Model

2. LightGbm Model

3. Catboost model

4. Random Forest Regressor

5) Custom Ensembling

Here I divide the train set into 80–20. Now 20% is kept as test set and 80% is further divided into 50–50(df1 and df2). On df1 we take our base models and do hyperparameter tuning and predict on df2. Now we combine the predictions of df2 and that is used as training set for my meta model. As we have the target variable for the rest 20% along with the predictions of df2 is used for hyperparameter tuning.

Further we do the predictions on the final test set and these predictions are again combined and used as an input to meta model. Now the final predictions come from the meta model.

The base models used are Catboost, LightGbm and Xgboost. Meta model used is LightGbm. Here how is the dataset is created for Ensembling.

Here is the code for hyperparameter tuning of my base models

Here is the code for hyperparameter tuning for my meta model

6) Simple MLP Model

Here I am building my first neural network model

11) Test Result Comparison Of All Models

Here we can see that LGBM Model performs the best out of all the models that we have experimented with.

12) Deployment of ML Models using Flask on Google Colab

As the final step, I have deployed my model using flask on google colab. One more important thing to note while deploying the model through colab is to use ngrok as it makes the IP public as colab is a virtual machine. Here I am attaching the video link of my deployed model.

13) Future Work

Here different ideas can be tried as building 16 different models for each site or building 4 different models for each of the meter. One more thing I would like to add is that I haven’t used the leaked data which can definitely be used for cross-validation purposes which can improve our metric score. Ensembling different models per meter or per site can also be tried out.

14) REFERENCES

Predicting energy demand with neural networks | Towards Data Science

Estimating Counterfactual Energy Usage of Buildings with Machine Learning | by Steven Smiley | Towards Data Science

ASHRAE — Missing Weather Data Handling | Kaggle

ASHRAE — EDA and Preprocessing | Kaggle

M5-Forecasting-Accuracy/Ensemble_Model.ipynb at main · vence-andersen/M5-Forecasting-Accuracy · GitHub

Locate cities according weather temperature🌇 | Kaggle

--

--