Getting Started Predicting Forest Fire Size using Machine Learning

9 min readDec 25, 2020

The nature of wildfire is much more alarming than people can imagine. While companies and government are busy taking things to the internet, State of the Art technology could be used in containing the situation. This definitely would result to saving lives and properties.
Data and AI over the years have played a vital role in filling some serious gaps and its role in emergency situations like wildfire is also becoming popular. And with new advances in the domain, with more and more data getting generated every year, these sought after tech could be used in forecasting massive wildfire.

The used dataset contains 517 fires from the Montesinho natural park in Portugal. For each incident weekday, month, coordinates, and the burnt area are recorded, as well as several meteorological data such as rain, temperature, humidity, and wind.

Data Information

X — x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y — y-axis spatial coordinate within the Montesinho park map: 2 to 9
month — month of the year: “Jan” to “dec”
day — day of the week: “mon” to “sun”
FFMC — FFMC (loose material moisture content) index from the FWI system: 18.7 to 96.20
DMC — DMC index from the FWI system: 1.1 to 291.3
DC — DC index from the FWI system: 7.9 to 860.6
ISI — ISI index from the FWI system: 0.0 to 56.10
temp — the temperature in Celsius degrees: 2.2 to 33.30
RH — relative humidity in %: 15.0 to 100
wind — wind speed in km/h: 0.40 to 9.40
rain — outside rain in mm/m2 : 0.0 to 6.4
area — the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).

Missing Attribute Values: None

Definition of Features

The Duff Moisture Code (DMC) represents fuel moisture of decomposed organic material underneath the litter. It may provide insight to live fuel moisture stress.
The Drought Code (DC) represents drying deep into the soil. It approximates moisture conditions for the equivalent of 53-day (1272 hour) timelag fuels. It is unitless, with a maximum value of 1000.
The Fine Fuel Moisture Code (FFMC) represents fuel moisture of forest litter fuels under the shade of a forest canopy. It is intended to represent moisture conditions for shaded litter fuels, the equivalent of 16-hour timelag. It ranges from 0–101. Subtracting the FFMC value from 100 can provide an estimate for the equivalent (approximately 10h) fuel moisture content, most accurate when FFMC values are roughly above 80.
The Initial Spread Index (ISI) is analogous to the NFDRS Spread Component (SC). It integrates fuel moisture for fine dead fuels and surface windspeed to estimate a spread potential. ISI is a key input for fire behavior predictions in the FBP system. It is unitless and open ended.

For a regression task this difficult, a wrong thing to do would be to dive straight into predicting burnt areas. “Why?”

Let import libraries and view our data;

# Import Libraries
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
import seaborn as sns
from scipy.stats import norm# load in the data
data = pd.read_csv('C:/Forest_Fire/forestfires.csv')
data.head()

Done!..We would follow the “Data Science Process”.

We already obtained the data, and it is in a clean format. The next process is to “Explore the data”. Exploring the data would help one discover meaningful insights, and these insights can be derived by asking meaningful questions using the data. Let’s get started!

At what month did the largest forest fires occur?

sns.set(rc={'figure.figsize':(15.7,7.27)})
plt.scatter(x=df["month"], y=df["area"], color = 'red')
plt.xlabel("Month", size = 15)
plt.ylabel("Area", size = 15)
plt.title("Month With the highest forest burnt area", size = 15)

Answer: The month September had the largest forest fires in terms of burnt areas, next to August and then July. What does this imply? Weather is obviously one of the factors that affects forest fire size. Also, we could create awesome features using month.

Do most forest fires occur during weekdays or weekends?

days_of_fire = df['day'].value_counts()
days_of_fire = days_of_fire.to_frame()
days_of_fire = days_of_fire.reset_index()
days_of_fire.columns = ['day', 'fire_count']
days_of_fire.plot.bar(x = 'day', y= 'fire_count', color = 'orange')
plt.xlabel("Day", size = 15)
plt.ylabel("Fire Count", size = 15)
plt.title("Fire Count for each Day in the Montesinho park", size = 15)

Answer: Majority of the forest fires occured during days when there was less working activity (Sunday, Friday and Saturday). This could imply that forest fires are most likely to happen on week-ends than week days. We could also engineer some features from the day of the week.

Does high temperatures imply larger forest fire in terms of area?

temp_by_area = df.groupby(['temp'], sort=True)['area'].max()
temp_by_area = temp_by_area.to_frame()
temp_by_area = temp_by_area.reset_index()
temp_by_area.columns = ['temp', 'area']
temp_by_area.plot(y ='area', x = 'temp')

Answer: Larger forest fire were associated with high temperature.

Note: Relationships between other variables can still be decided. A quick way to do that would be to plot the pairwise relationship between features/variables present in the data.

sns.pairplot(data=df, diag_kind='kde', vars={'DC','FFMC','DMC','temp','X','Y','area',
                                             'ISI', 'RH', 'wind', 'rain'})
plt.show()

Though I can’t really see straightaway visible relationships. An approach I would strongly recommend in determining relationships between features/variables would be to find their “correlation”. Correlation coefficient are easily interpreted. The correlation coefficient measures the strength and direction of a linear relationship between two variables.

df.corr(method ='pearson')

According to this publication, arson is the cause for about 42% of human-caused wildfires in Portugal. This could supply more understanding to why some outliers are present in the data.

Feature Engineering

Hot- Encoding is a form of feature engineering that transforms categorical features to a format that works better with classification and regression algorithms. It’s very useful in methods where multiple types of data representation is necessary.

Features to be hot-encoded would be all categorical variables within the data. These are- X, Y, Month, Day.

# one hot encoding coordinates X and Y
df_X = pd.get_dummies(df.X)
df_Y = pd.get_dummies(df.Y)# one hot encode the month variable
df_month = pd.get_dummies(df.month)# one hot encode the day variable
df_day = pd.get_dummies(df.day)# rename the columns in both dataframe
df_X.columns = ['1_X', '2_X', '3_X', '4_X', '5_X', '6_X', '7_X', '8_X', '9_X']df_Y.columns = ['2_Y', '3_Y', '4_Y', '5_Y', '6_Y', '8_Y', '9_Y']# concatenate the two dataframes together
df_xy = pd.concat([df_X, df_Y], axis=1)# concat them with the main dataframe
df = pd.concat([df_xy, df_month, df_day, df], axis =1)

Conversion to Another Unit is a form of feature engineering I personally use when dealing with numerical variables. This technique often tend to improve the accuracy of models as the feature is seen in a different form by the model. The con of this technique is redundancy can easily be introduced into the model.

Numerical Features to be converted to other units are — Temperature(Celcius) and Windspeed(Km/hr).

Note: 1 degree Celsius = 33.8 degrees Fahrenheit, 1km/hr to seconds is 0.277777778m/sec

# convert temperature in celcius degree to fahreneit
df['temp_F'] = df['temp'] * 33.8# convert wind speed in km/hr to m/sec
df['wind_m/s'] = df['wind'] * 0.277777778

Binary Encoding is a memory-efficient encoding scheme as it uses fewer features than one-hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.

The day feature would be categorized into weekdays and weekends using binarizing.

df['weekends'] = df['day'].apply(lambda x: 1 if x == 'sun' or x == 'sat' else 0)

Predicting the Fire Spread

Note: This dataset is so small, and it is so noisy, that fitting the test set is really, really easy. Reshuffling the data, or tweaking parameters can make a huge difference on the test set fit just by random chance. Implicitly fitting the test set can give misleading estimates of true generalization performance. Therefore, it’s really important fit the training set “blind” to the test set.

The hyper parameter settings below are best on a few iterations of training with some guided attempts driven by the documentation on the LightGBM website. They are far from the optimum. In a real application we would adjust these to see the impact on loss — over numerous iterations.

import sklearn 
import lightgbm as lgb
import random
random.seed(1)
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split# split dataset into train and test
X = df.drop(columns = ['area'], axis =1)
y = df[['area']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# define hyperparameters for training
hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': ['l2', 'auc'],
    'learning_rate': 0.5,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.7,
    'bagging_freq': 10,
    'verbose': 0,
    "max_depth": 8,
    "num_leaves": 128,  
    "max_bin": 512,
    "num_iterations": 100000,
    "n_estimators": 1000
}# train Model
gbm = lgb.LGBMRegressor(**hyper_params)
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        early_stopping_rounds=1000)# Make Predictions and Observe Accuracy of Model using RMSE as Metric 
y_pred = gbm.predict(X_train, num_iteration=gbm.best_iteration_)
print('The rmse of prediction is:', round(mean_squared_log_error(y_pred, y_train) ** 0.5, 5))The rmse of prediction is: 1.79094

Feature Importance

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable. Feature importance plays an important role in a predictive modeling project, including providing insight into the data, insight into the model, and the basis for dimensionality reduction and feature selection that can improve the efficiency and effectiveness of a predictive model on the problem.

The Importance of each feature at predicting the area of forest size using the model can be determined using the function below;

# function to determine feature importance
def get_lgbm_varimp(model, train_columns, max_vars=50):
    
    if "basic.Booster" in str(model.__class__):
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
    else:
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).Tcv_varimp_df.columns = ['feature_name', 'varimp']cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)cv_varimp_df = cv_varimp_df.iloc[0:max_vars]
    cv_varimp_df = cv_varimp_df.set_index('feature_name')return cv_varimp_df.plot.bar()# view feature importance
get_lgbm_varimp(gbm, df_lists)

Features such as Relative humidity, DMC, ISI, FFMC, weekends, windspeed, DC and day of the week — saturday contributed to the predicting power of the model. These features can be furthered studied and possible features can be engineered from them.

Importance of this model?

The possibility of providing an accurate estimate of the area or size of a forest fire is quite low. However, it is possible to get decent estimates of the size of a particular fire spreading, given the meager data that we have.
We know from the data that the majority of fire starts are during late summer. This analysis builds on top of that by showing that a fire start during hot, dry weather is more likely to spread than one in colder, wetter weather. Not exactly the most groundbreaking discovery.
However, it’s possible for forest managers to gather more data that would account for hill slope and aspect, type of vegetation, deforestation extent, land-use, detailed weather data and other factors. Such robust data can better be used in making accurate predictions compared to the scanty data we worked with.
More precise fire-size predictions could also help determine whether a particular incipient fire needs to be contained or not, which would lead to significant cost savings.