DECISION TREES : UNBOXING THE BLACK BOX

Being a civil engineer myself and thus belonging from a non coding background I understand how difficult it is sometimes for those like me who are trying to learn the concepts of machine learning.

So here I will try to explain the concepts of decision trees which is a very simple but powerful model in machine learning so that everyone who reads this can understand the basic concepts and how to apply it.

1. A brief description of decision trees & their use

In a very simple language decision trees can be described as a network of decisions where we make a decision at each step and that decision leads us to another until a final decision is reached or it can also be described as a set of nested if-else statements.

Don’t know what an if-else statement is?

Well some of you might be scared as to it is some very high level concept used by programmers and is beyond our level of understanding. Let me tell you that you are wrong my friend. It is as simple as making your daily decisions in a regular life. You are using the logic of if-else statements since a long time in your day to day life. For example when you decided that today I will not go to school but your mind started reasoning with if-else statements , if I won’t go to school then parents might scold me, the next day my teacher might punish me, but if I go(this is the else statement) I may get bored the whole day.

Jokes apart, we will put this logic to use in a real-world application with real data in the next section, before that lets know what do we use decision trees for.

Uses: Decision trees are used for classification of data sets based on some given features or characteristics. For example:

Given the length, width and height of a t-shirt classify it into a size like Small, Medium or Large.

They can also be used to predict real values for example given the age, weight and gender of a person predict his/her height.

Technically decision trees are supervised machine learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable or class of the data by learning simple decision rules based on the features of data set.

2. Understanding the data set

Here I will use a very simple and easy to understand data set created by me for the ease of computation and interpretation.

So the data set given below consists of 3 criteria (the first three columns) based on which a company decides whether to shortlist a candidate for interview or not.

If we create an automated model and provide it to the company they just have to feed the 3 criteria related data of the candidates into the model and it will classify which candidates to call for interview and which to reject.

Isn’t it fascinating? Of course it is!

As said above the data set here has 3 criteria(first 3 columns except the serial no.) which are ‘Matriculation in science’, ‘Graduate student’ and ‘Graduation CGPA ≥ 5’ and the last column is the decision based on these three criteria whether to call the candidate or not viz. ‘call for job interview’.

Still intimidated by the number of yes and no in the table? Let me explain with a few data points from the table.

Look at the data in sr. no. 1 of the data set, it says that the candidate has matriculation in science? : yes, is the candidate a graduate? : yes, is the graduation CGPA of candidate ≥ 5? : yes. since all three criteria are satisfied the candidate will be called for interview? : yes.

Now look at sr. no.4 the candidate is a graduate student? yes, graduation CGPA ≥ 5? : yes, but candidate has a matriculation in science? no, since this criteria is not satisfied the candidate will be called for interview? : no.

In sr.no.5 and sr.no.12 the other 2 criteria are not satisfied so candidate is not being called for interview.

So by studying the data set we can say that only when all three criteria are satisfied a candidate will be called for an interview.

The above decision making can be visualized using a branched out flow chart as below:

Isn’t this even simpler to understand? This is what a decision tree is.

While the above decision tree is what we created using our logic and understanding of the data set, the machine/computer model does the same in a slightly different way. So now lets understand how the ML model builds a decision tree. It makes use of some thing called ENTROPY and INFORMATION GAIN. So let us first understand how to derive these two values.

3. Entropy and information gain

Entropy is the measure of dominance of a single class in data set.

The more is the dominance of single class in a data set the lesser is the entropy.

The lesser is the dominance of a single class in data set, i.e. the data set has equal proportion of all classes, the more is the entropy.

Lets consider our own data set, we have total 30 data points or samples and 2 classes yes or no(call for interview: yes, call for interview: no). Out of the 30 samples we have 12 samples of class ‘yes’ and 18 samples of class ‘no’.

Class ‘yes’ samples are only 6 less from class ‘no’ samples so dominance of a single class is not significant and hence entropy will be high.

Entropy is calculated using the below formula:

H(y) is entropy for given y classes

k is the number of classes (in our case 2 i.e. yes and no)

P(yi) is the probability of a class yi

lg is log to the base 2.

Derivation of this formula is beyond the scope of this article. So now lets calculate the entropy for our data set.

Probability of class yes p(yi =yes) = 12/30 i.e. out of total 30 samples 12 samples of yes class and p(yi = no) = 18/30.

H(y) = -(12/30)*lg(12/30)- (18/30)*lg(18/30) = 0.97

Entropy has the highest value 1 in case of 2 class data sets when both the classes are of equal proportion and lowest value of 0 when data set have data of only one class. So 0.97 is quite high entropy value.

Now lets branch out our data set as per our decision tree in Figure-2.

The first branch out is based on whether the candidate has matriculation in field of science or not. From above figure-3 if yes then we get a dataset say D1, if no we get another data set say D2.

Now for D1 & D2 we get the entropy as 0.97 and 0.00 respectively

Now we calculated the weighted entropy of these 2 data sets and sum them up, weightage of D1 is no. of sample in D1/total no. of samples in entire data set i.e. 20/30 similarly for D2 weightage is 10/30.

Weighted entropy = (20/30)*0.97 + (10/30)*0 = 0.64

So lets summarize till now we calculated the entropy of entire dataset H(y) = 0.97 then we did the first branch out with whether the student has matriculated or not in field of science and got 2 branched out data sets , for which we calculated the weighted entropy = 0.64

Now the difference between the entropy of entire data set and the weighted entropy of branched out data sets is called the information gain IG.

In our case IG = 0.97- 0.64 = 0.33.

4. How the ML model builds decision trees

Now that we know how to calculate entropy and information gain let me tell you how the machine learning model uses these values to build a decision tree.

The ML model considers branching out from a criteria/feature based on the IG value, the feature/criteria which has the maximum IG value on branching out is considered as the parent node for branching out, also the next level branching out is done using the same concept in our data set the IG values using the three features as parent node are as follows:

IG(Matriculation in science) = 0.33 (calculated in previous section)

IG(Graduate student) = 0.28 (calculate this as an exercise)

IG(Graduation CGPA >= 5) = 0.477 (calculate this as an exercise)

So we get the highest value of IG as 0.477 when we branch out using ‘Graduation CGPA ≥5’ as the parent node and that is what the ML model will do exactly and we will see that practically in next section. So unlike using our logic where our parent node for branching out was ‘Matriculation in science’ the ML model branches out using the feature where we get the highest IG value.

And we keep continuing this till we get the nodes with zero entropy and these final nodes are called leaf nodes and the number of time we have to branch out till reaching the leaf node from parent node is called the depth of the decision tree. The more the depth the more complicated our decision tree is and less the depth lesser complex is our decision tree.

This depth of decision tree is the hyper parameter in case of decision tree model which we have to tune using cross-validation in order to find the optimum or best depth.

Enough of theoretical chit-chat right? So lets get into action and write the code for classification of data using decision trees.

Here I assume a few pre-requisites that you are familiar with Jupyter notebooks, basic Sci-kit learn, basic pandas and basic python for ML.

5. Code for building a classifier using decision trees

a) We read the csv file of our data set and save it as a data frame ‘df’ and visualize the first 10 rows of it.

b) Lets get the information of all columns/features of our data set.

c) As we can see that the 3rd column or the feature ‘Graduation CGPA ≥5’ has only 21 non-null values and 9 null values which are null for candidates who have not graduated, since they have no graduation they do not have any graduation CGPA. We can fill these null values with 0.

d) The sci-kit learn decision tree model cannot classify based on categorical data i.e. our type of data which has 2 categories yes and no. We have to convert these categorical type data into numeric data so that the sklearn decision tree can interpret it.

e) Now we separate the class labels/dependent variables from the independent variables/features as X and y.

f) Now we split the data into training set and test set with 80:20 proportion.

g) Now we perform five fold cross-validation using random search cv on our training data set Xtrain & ytrain in order to find the best hyperparameter i.e. depth of our decision tree. For this we pass a list of hyper-params [2,3,4,5,8,10] as we can see from below that cross-validation yields the best hyper-parameter max_depth(depth of decision tree) as 2. Also observe that we have selected the criterion = ‘entropy’ for decision tree so the tree will be built using entropy & IG. It can also be built using another criterion i.e. gini impurity for which we have to pass the criterion = ‘gini’. For accuracy we have selected the F1-score performance metric.

h) Now let us train the model using the optimum hyper-parameter derived in step g) and get the predictions for Xtrain and Xtest.

Using the F1_score metric we will also get the accuracy of test data predictions and train data predictions.

i) Now let us plot the decision tree for our Xtrain data set.

As we can see from the decision tree it has a depth 2 which was selected as optimum by cross-validation. Also as discussed in our theories the leaf nodes have an entropy of 0.

Observe another thing here that the decision tree has selected a threshold value for each feature here it is 0.5 if any feature has value < 0.5 then for our case it means feature has value 0 or ‘no’ and if this condition is false means that feature has value 1 or ‘yes’.

j) Now finally lets build the decision tree for entire data set using the optimum decision tree model.

From the above decision tree we can now verify our entropy value we calculated as 0.97 is correct, also the parent node is with the feature ‘Graduation CGPA ≥5’ as we stated in section 4 that when branching out with this feature as parent node we get the highest IG value.

References:

1. sci-kit learn documentation

2. Lecture videos from Applied AI course of Applied Roots

--

--