ASHRAE GREAT ENERGY PREDICTION(III)

TABLE OF CONTENTS

  1. Introduction
  2. Business Problem
  3. ML Formulation
  4. Performance-Metric
  5. Dataset Analysis
  6. Exploratory Data Analysis(EDA)
  7. Imputation Techniques
  8. Feature Engineering
  9. Feature Selection
  10. Models with Hyperparameter Tuning
  11. Deployment of ML Model using Flask on Google Colab
  12. Test Results Comparison
  13. Future Work
  14. References

1) Introduction

This is the third energy competition which is held by ASHRAE in the year 2019.

2) Business Problem

In todays world in the energy sector we always see that the industries are coming up with new ways to reduce the emission of the greenhouse gases and also deploying new ways to cut the cost of energy consumption.Due to this the renewable and smart grid sectors are expanding but not at that much rate as it should be if we consider the present environmental conditions.

3) ML Formulation

As we have to predict the meter readings which are real valued numbers so the given problem can be posed as regression based ml problem.

4) Performance-Metric

The performance metric used here is the Root Mean Squared Logarithmic Error(RMSLE).

5) Dataset Analysis

The training dataset consists of 20 million data points with the hourly meter readings for each meter.The weather conditions and building metadata was also provided.This dataset spans over one year timeframe(Jan 2016-Dec2016).

6) Exploratory Data Analysis

Here first what i did is merged the train data with building metadata and weather data.Now after getting my final train dataset I started analyzing each building site by site.

df.drop(index=df[(df['building_id']<=104) & (df['meter']==0) & (df['timestamp']<'2016-05-21')].index,inplace=True)
df.drop(index=df[(df['building_id']==53) & (df['meter']==0)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==1099) & (df['meter']==2)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==1250) & (df['meter']==2)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==1227) & (df['meter']==0)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==1314) & (df['meter']==0)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==1281) & (df['meter']==0)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==279) & (df['meter']==3)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==263) & (df['meter']==3)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==287) & (df['meter']==3)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==1018) & (df['meter']==1)].index,inplace=True)#Removing Anamolous Building
df.drop(index=df[(df['building_id']==1022) & (df['meter']==1)].index,inplace=True)#Removing Anamolous Building

7) Imputation Techniques

Firstly i will check the precentage of the missing values for each feature present.Below is the code along with the precentage of null values.

(df_train_merge_cleaned.isnull().sum()/df_train_merge_cleaned.shape[0])*100
df_train_merge_cleaned.drop('floor_count',axis=1,inplace=True)
df_train_merge_cleaned.reset_index(inplace=True)
df_train_merge_cleaned['day']=df_train_merge_cleaned['timestamp'].dt.day
df_train_merge_cleaned['month']=df_train_merge_cleaned['timestamp'].dt.month
cc_fill=df_train_merge_cleaned.groupby(['site_id','day','month'])['cloud_coverage'].median().reset_index()
cc_fill.rename(columns={'cloud_coverage':'cc_filler'},inplace=True)
cc_fill['cc_filler'].fillna(method='ffill',inplace=True)
df_train_merge_cleaned=df_train_merge_cleaned.merge(cc_fill,how='left',on=['site_id','day','month'])
df_train_merge_cleaned['cloud_coverage'].fillna(df_train_merge_cleaned['cc_filler'],inplace=True)
df_train_merge_cleaned.drop(labels=['cc_filler'],axis=1,inplace=True)

8) Feature Engineering

Firstly what i did is aligned all the air temperature with the local timestamp of the meter readings.

df_train_site_0=df_train_merge_cleaned_imputed[df_train_merge_cleaned_imputed['site_id']==0]
df_train_site_0.reset_index(inplace=True)
df_train_site_0['timestamp_aligned']=df_train_site_0['timestamp']-timedelta(hours=5,minutes=0)
df_air_temp_timestamp=df_train_site_0[['timestamp_aligned','building_id','meter','air_temperature']].copy()
df_air_temp_timestamp.rename(columns={'timestamp_aligned':'timestamp'},inplace=True)
df_train_site_0.drop(['air_temperature','timestamp_aligned'],axis=1,inplace=True)
df_train_site_0['air_temperature_aligned']=df_air_temp_timestamp[df_air_temp_timestamp['timestamp'].isin(df_train_site_0['timestamp'])].reset_index(drop=True)['air_temperature']
df_train_site_0['air_temperature_aligned']=df_train_site_0['air_temperature_aligned'].interpolate()
df_train_site_0.rename(columns={'air_temperature_aligned':'air_temperature'},inplace=True)
df_train_site_0.drop(['level_0','index'],axis=1,inplace=True)
saturated_vapor_pressure = 6.11 * (10**(7.5*df_train_merged_final['air_temperature']/(237.3+df_train_merged_final['air_temperature'])))
actual_vapor_pressure = 6.11 * (10**(7.5*df_train_merged_final['dew_temperature']/(237.3+df_train_merged_final['dew_temperature'])))
df_train_merged_final['relative_humidity']=(actual_vapor_pressure/saturated_vapor_pressure)*100
df_train_merged_final['is_weekday']=((~df_train_merged_final['timestamp'].dt.date.isin(holiday_datetime.date))&(df_train_merged_final['weekday'].isin([0,1,2,3,4]))).astype(int)
z_busy_hours=df_train_merged_final.set_index(['timestamp']).between_time('06:00:00','18:00:00').reset_index()
z_busy_hours_timestamp=[i for i in z_busy_hours['timestamp']]
df_train_merged_final['busy_hours']=((~df_train_merged_final['timestamp'].dt.date.isin(holiday_datetime.date))&(df_train_merged_final['timestamp'].isin(z_busy_hours_timestamp))).astype(int)
df_train_merged_final['hour']=df_train_merged_final['timestamp'].dt.hourdf_train_merged_final['weekday']=df_train_merged_final['timestamp'].dt.weekdaydf_train_merge_cleaned['day']=df_train_merge_cleaned['timestamp'].dt.day
df_train_merge_cleaned['month']=df_train_merge_cleaned['timestamp'].dt.month
label_encoder=LabelEncoder()
df_train_merged_final_red['primary_use']=label_encoder.fit_transform(df_train_merged_final_red['primary_use'])

9) Feature Selection

I have done feature selection before fitting my model as the dataset was large so more features would definitely increase the chances of overfitting.I have done it using XGB Regressor as my base model.

10) Model and Hyperparameter Tuning

Here I have tried different models such as RF,XGBOOST,LGBM,CATBOOST,CUSTOM-ENSEMBLING and a DEEP LEARNING MODEL.For each model I have performed hyperparameter tuning.Initially only I have applied target transformation(log1p) on my target variable , therefore I am using RMSE as my evaluation metric which by default becomes RMSLE.

{'colsample_bytree': 0.8,
'learning_rate': 0.1,
'max_depth': 9,
'n_estimators': 500}
{'colsample_bytree': 0.9,
'learning_rate': 0.1,
'max_depth': 7,
'min_child_samples': 100,
'n_estimators': 1200}
params=[]
err_score=[]
for i in range(10):
max_depth=np.random.randint(3,15)
estimators=np.random.randint(300,1500)




cat_reg=CatBoostRegressor(task_type='GPU',loss_function='RMSE',max_depth=max_depth,n_estimators=estimators,learning_rate=0.1)
cat_reg.fit(X_train,y_train)
test_pred=cat_reg.predict(X_test)
err_test=np.sqrt(mean_squared_error(y_test,test_pred))
err_score.append(err_test)
params.append((max_depth,estimators))
[0.8014502674080963,
0.969339306037929,
1.1105379937315227,
0.8017482632418587,
1.083261361679219,
0.7968043441249344,
0.88638072298227,
1.2696015010888373,
1.0057081335900901,
1.3935640553815056]
{'max_depth': 9, 'n_estimators': 100}
X_train,X_test,y_train,y_test=train_test_split(df_tr_red_final,y_tr,test_size=0.2,random_state=0)X_train_d1,X_train_d2,y_train_d1,y_train_d2=train_test_split(X_train,y_train,test_size=0.5,random_state=0)s1_d1=X_train_d1.sample(frac=0.8,replace=True,random_state=0)
y1_d1=y_train_d1.sample(frac=0.8,replace=True,random_state=0)
s2_d1=X_train_d1.sample(frac=0.8,replace=True,random_state=1)
y2_d1=y_train_d1.sample(frac=0.8,replace=True,random_state=1)
s3_d1=X_train_d1.sample(frac=0.8,replace=True,random_state=2)
y3_d1=y_train_d1.sample(frac=0.8,replace=True,random_state=2)
x_cfl=XGBRegressor(tree_method='gpu_hist')
params={'n_estimators':[300,500,1000,1500,2000],
'learning_rate':[0.01,0.03,0.05,0.1],
'max_depth':[3,5,7,9],
'colsample_bytree':[0.5,0.8,0.9,1]}
random_xgb=RandomizedSearchCV(x_cfl,params,scoring='neg_root_mean_squared_error',n_jobs=-1,cv=3,verbose=10,random_state=1,n_iter=10)
random_xgb.fit(s1_d1,y1_d1)
random_xgb.best_params_{'colsample_bytree': 0.8,
'learning_rate': 0.1,
'max_depth': 7,
'n_estimators': 2000}
params={'max_depth':[3,5,7,9,11,13,15],
'n_estimators':[300,500,800,1000,1200,1500],
'learning_rate':[0.1,0.01,0.03,0.05]}
cat_reg=CatBoostRegressor()
random_cat=RandomizedSearchCV(cat_reg,params,scoring='neg_root_mean_squared_error',n_jobs=-1,cv=3,verbose=1,random_state=1,n_iter=8)
random_cat.fit(s2_d1,y2_d1)
random_cat.best_params_{'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 1200}params={'max_depth':[3,5,7,9,11],
'learning_rate':[0.1,0.01,0.03,0.05],
'colsample_bytree':[0.7,0.8,0.9,1.0],
'n_estimators':[300,500,800,1200],
'min_child_samples':[50,100,200,300,500]}


lgb_reg=LGBMRegressor()
random_lgb=RandomizedSearchCV(lgb_reg,params,n_iter=8,scoring='neg_root_mean_squared_error',cv=3,verbose=1,random_state=42,n_jobs=-1)
random_lgb.fit(s3_d1,y3_d1)
random_lgb.best_params_{'colsample_bytree': 1.0,
'learning_rate': 0.1,
'max_depth': 11,
'min_child_samples': 300,
'n_estimators': 800}
params={'max_depth':[3,5,7,9,11],
'learning_rate':[0.1,0.01,0.03,0.05],
'colsample_bytree':[0.7,0.8,0.9,1.0],
'n_estimators':[300,500,800,1200],
'min_child_samples':[50,100,200,300,500]}


lgb_reg=LGBMRegressor()
random_lgb=RandomizedSearchCV(lgb_reg,params,n_iter=8,scoring='neg_root_mean_squared_error',cv=3,verbose=24,random_state=5,n_jobs=-1)
random_lgb.fit(X_train_d2_pred,y_train_d2)
random_lgb.best_params_{'colsample_bytree': 0.9,
'learning_rate': 0.05,
'max_depth': 7,
'min_child_samples': 50,
'n_estimators': 800}
model_3=Sequential()
model_3.add(Dense(256,activation='relu',input_shape=(X_train.shape[1],)))
model_3.add(Dense(128,activation='relu'))
model_3.add(Dense(64,activation='relu'))
model_3.add(Dense(32,activation='relu'))
model_3.add(Dense(1))
model_3.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),loss=rmse)
model_3.fit(X_train,y_train,epochs=10,validation_data=(X_test,y_test),batch_size=int(X_train.shape[0]/10))Epoch 1/10
11/11 [==============================] - 57s 5s/step - loss: 1200632.2554 - val_loss: 3.3051
Epoch 2/10
11/11 [==============================] - 55s 5s/step - loss: 2.6724 - val_loss: 2.3621
Epoch 3/10
11/11 [==============================] - 55s 5s/step - loss: 2.3916 - val_loss: 2.1868
Epoch 4/10
11/11 [==============================] - 55s 5s/step - loss: 2.1294 - val_loss: 2.0849
Epoch 5/10
11/11 [==============================] - 55s 5s/step - loss: 2.0988 - val_loss: 2.0801
Epoch 6/10
11/11 [==============================] - 55s 5s/step - loss: 2.0969 - val_loss: 2.1468
Epoch 7/10
11/11 [==============================] - 55s 5s/step - loss: 2.1280 - val_loss: 2.0903
Epoch 8/10
11/11 [==============================] - 55s 5s/step - loss: 2.1155 - val_loss: 2.1396
Epoch 9/10
11/11 [==============================] - 55s 5s/step - loss: 2.1278 - val_loss: 2.0815
Epoch 10/10
11/11 [==============================] - 55s 5s/step - loss: 2.0806 - val_loss: 2.0793

11) Test Result Comparison Of All Models

12) Deployment of ML Models using Flask on Google Colab

As the final step I have deployed my model using flask on google colab.One more important thing to note while deploying the model through colab is to use ngrok as it makes the ip public as colab is a virtual machine.Here I am attaching the video link of my deployed model.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store