Getting Started Predicting Forest Fire Size using Machine Learning

The used dataset contains 517 fires from the Montesinho natural park in Portugal. For each incident weekday, month, coordinates, and the burnt area are recorded, as well as several meteorological data such as rain, temperature, humidity, and wind.

Data Information

X — x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y — y-axis spatial coordinate within the Montesinho park map: 2 to 9
month — month of the year: “Jan” to “dec”
day — day of the week: “mon” to “sun”
FFMC — FFMC (loose material moisture content) index from the FWI system: 18.7 to 96.20
DMC — DMC index from the FWI system: 1.1 to 291.3
DC — DC index from the FWI system: 7.9 to 860.6
ISI — ISI index from the FWI system: 0.0 to 56.10
temp — the temperature in Celsius degrees: 2.2 to 33.30
RH — relative humidity in %: 15.0 to 100
wind — wind speed in km/h: 0.40 to 9.40
rain — outside rain in mm/m2 : 0.0 to 6.4
area — the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).

Missing Attribute Values: None

Definition of Features

For a regression task this difficult, a wrong thing to do would be to dive straight into predicting burnt areas. “Why?”

Let import libraries and view our data;

# Import Libraries
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
import seaborn as sns
from scipy.stats import norm
# load in the data
data = pd.read_csv('C:/Forest_Fire/forestfires.csv')
data.head()

Done!..We would follow the “Data Science Process”.

We already obtained the data, and it is in a clean format. The next process is to “Explore the data”. Exploring the data would help one discover meaningful insights, and these insights can be derived by asking meaningful questions using the data. Let’s get started!

  • At what month did the largest forest fires occur?
sns.set(rc={'figure.figsize':(15.7,7.27)})
plt.scatter(x=df["month"], y=df["area"], color = 'red')
plt.xlabel("Month", size = 15)
plt.ylabel("Area", size = 15)
plt.title("Month With the highest forest burnt area", size = 15)

Answer: The month September had the largest forest fires in terms of burnt areas, next to August and then July. What does this imply? Weather is obviously one of the factors that affects forest fire size. Also, we could create awesome features using month.

  • Do most forest fires occur during weekdays or weekends?
days_of_fire = df['day'].value_counts()
days_of_fire = days_of_fire.to_frame()
days_of_fire = days_of_fire.reset_index()
days_of_fire.columns = ['day', 'fire_count']
days_of_fire.plot.bar(x = 'day', y= 'fire_count', color = 'orange')
plt.xlabel("Day", size = 15)
plt.ylabel("Fire Count", size = 15)
plt.title("Fire Count for each Day in the Montesinho park", size = 15)

Answer: Majority of the forest fires occured during days when there was less working activity (Sunday, Friday and Saturday). This could imply that forest fires are most likely to happen on week-ends than week days. We could also engineer some features from the day of the week.

  • Does high temperatures imply larger forest fire in terms of area?
temp_by_area = df.groupby(['temp'], sort=True)['area'].max()
temp_by_area = temp_by_area.to_frame()
temp_by_area = temp_by_area.reset_index()
temp_by_area.columns = ['temp', 'area']
temp_by_area.plot(y ='area', x = 'temp')

Answer: Larger forest fire were associated with high temperature.

Note: Relationships between other variables can still be decided. A quick way to do that would be to plot the pairwise relationship between features/variables present in the data.

sns.pairplot(data=df, diag_kind='kde', vars={'DC','FFMC','DMC','temp','X','Y','area',
'ISI', 'RH', 'wind', 'rain'})
plt.show()

Though I can’t really see straightaway visible relationships. An approach I would strongly recommend in determining relationships between features/variables would be to find their “correlation”. Correlation coefficient are easily interpreted. The correlation coefficient measures the strength and direction of a linear relationship between two variables.

df.corr(method ='pearson')

According to this publication, arson is the cause for about 42% of human-caused wildfires in Portugal. This could supply more understanding to why some outliers are present in the data.

Feature Engineering

Hot- Encoding is a form of feature engineering that transforms categorical features to a format that works better with classification and regression algorithms. It’s very useful in methods where multiple types of data representation is necessary.

Features to be hot-encoded would be all categorical variables within the data. These are- X, Y, Month, Day.

# one hot encoding coordinates X and Y
df_X = pd.get_dummies(df.X)
df_Y = pd.get_dummies(df.Y)
# one hot encode the month variable
df_month = pd.get_dummies(df.month)
# one hot encode the day variable
df_day = pd.get_dummies(df.day)
# rename the columns in both dataframe
df_X.columns = ['1_X', '2_X', '3_X', '4_X', '5_X', '6_X', '7_X', '8_X', '9_X']df_Y.columns = ['2_Y', '3_Y', '4_Y', '5_Y', '6_Y', '8_Y', '9_Y']
# concatenate the two dataframes together
df_xy = pd.concat([df_X, df_Y], axis=1)
# concat them with the main dataframe
df = pd.concat([df_xy, df_month, df_day, df], axis =1)

Conversion to Another Unit is a form of feature engineering I personally use when dealing with numerical variables. This technique often tend to improve the accuracy of models as the feature is seen in a different form by the model. The con of this technique is redundancy can easily be introduced into the model.

Numerical Features to be converted to other units are — Temperature(Celcius) and Windspeed(Km/hr).

Note: 1 degree Celsius = 33.8 degrees Fahrenheit, 1km/hr to seconds is 0.277777778m/sec

# convert temperature in celcius degree to fahreneit
df['temp_F'] = df['temp'] * 33.8
# convert wind speed in km/hr to m/sec
df['wind_m/s'] = df['wind'] * 0.277777778

Binary Encoding is a memory-efficient encoding scheme as it uses fewer features than one-hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.

The day feature would be categorized into weekdays and weekends using binarizing.

df['weekends'] = df['day'].apply(lambda x: 1 if x == 'sun' or x == 'sat' else 0)

Predicting the Fire Spread

Note: This dataset is so small, and it is so noisy, that fitting the test set is really, really easy. Reshuffling the data, or tweaking parameters can make a huge difference on the test set fit just by random chance. Implicitly fitting the test set can give misleading estimates of true generalization performance. Therefore, it’s really important fit the training set “blind” to the test set.

The hyper parameter settings below are best on a few iterations of training with some guided attempts driven by the documentation on the LightGBM website. They are far from the optimum. In a real application we would adjust these to see the impact on loss — over numerous iterations.

import sklearn 
import lightgbm as lgb
import random
random.seed(1)
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split
# split dataset into train and test
X = df.drop(columns = ['area'], axis =1)
y = df[['area']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# define hyperparameters for training
hyper_params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': ['l2', 'auc'],
'learning_rate': 0.5,
'feature_fraction': 0.9,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'verbose': 0,
"max_depth": 8,
"num_leaves": 128,
"max_bin": 512,
"num_iterations": 100000,
"n_estimators": 1000
}
# train Model
gbm = lgb.LGBMRegressor(**hyper_params)
gbm.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric='l1',
early_stopping_rounds=1000)
# Make Predictions and Observe Accuracy of Model using RMSE as Metric
y_pred = gbm.predict(X_train, num_iteration=gbm.best_iteration_)
print('The rmse of prediction is:', round(mean_squared_log_error(y_pred, y_train) ** 0.5, 5))
The rmse of prediction is: 1.79094

Feature Importance

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable. Feature importance plays an important role in a predictive modeling project, including providing insight into the data, insight into the model, and the basis for dimensionality reduction and feature selection that can improve the efficiency and effectiveness of a predictive model on the problem.

The Importance of each feature at predicting the area of forest size using the model can be determined using the function below;

# function to determine feature importance
def get_lgbm_varimp(model, train_columns, max_vars=50):

if "basic.Booster" in str(model.__class__):
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
else:
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T
cv_varimp_df.columns = ['feature_name', 'varimp']cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)cv_varimp_df = cv_varimp_df.iloc[0:max_vars]
cv_varimp_df = cv_varimp_df.set_index('feature_name')
return cv_varimp_df.plot.bar()# view feature importance
get_lgbm_varimp(gbm, df_lists)

Features such as Relative humidity, DMC, ISI, FFMC, weekends, windspeed, DC and day of the week — saturday contributed to the predicting power of the model. These features can be furthered studied and possible features can be engineered from them.

Importance of this model?

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store