Decision Tree Algorithm Facts and Fundamentals

Decision Tree

What is the decision tree?

It could be a sort of machine learning algorithm. Which comes under Supervised learning(having a predefined target variable) that is used in classification problems.  A decision tree works for both categorical and continuous output variables. Here, we’ll split the data into two or more homogeneous sets(sub-populations). Splitting is based on the foremost significant splitter/ differentiator in input variables.

Important Terminology used for decision Tree: 

Let us explore the basic terms used with Decision Trees:

Root Node:  It represents the overall population or sample, and this further gets divided into two or more homogeneous sets. 

Splitting: It is a way of dividing a node into two or more sub-nodes. 

Decision Node: once a sub-node splits into further sub-nodes, then it is referred to as a decision node. 

Leaf/ Terminal Node: Nodes that do not split, is termed Leaf or Terminal node.

Pruning: when we tend to take away sub-nodes of a decision node, this methodology is termed pruning. You may be able to say the alternative method of splitting.

Branch / Sub-Tree: A subsection of the whole tree is termed a branch or sub-tree.

Parent and Child Node:– A node that split into sub-nodes is termed the parent node of sub-nodes. where sub-nodes are the child of the parent node.

Types of Decision Trees:

Based on the type of dependent variable, we are classifying the decision tree into two types:

Categorical variable Decision Tree: The decision Tree has a categorical target variable, then it is referred to as a categorical-variable decision tree. 

Continuous Variable Decision Tree: The decision Tree has a continuous-target variable, then it is said to be a continuous-variable decision tree. 

Regression Trees vs Classification Trees:

Generally, we realize that the terminal nodes (or leaves) lie at the lower part of the decision tree. This suggests that decision trees are usually drawn upside down such that the leaves are the bottom & the roots are the tops (shown below).

Both the trees work virtually the same as one another. Let’s look at the primary differences & similarity between classification and regression trees:

  1. Regression trees are used, once the dependent variable is continuous. Classification trees are used, once the dependent variable is categorical.
  2. In the case of the regression tree, the value obtained by terminal nodes within the training data is the mean response of observation falling in that region. Thus, if an unseen data observation falls in that region, we will make its prediction with the mean value.
  1. In the case of a classification tree, the value (class) obtained by a terminal node within the training data is the mode of observation falling in that region. Along these lines, if an unseen data perception falls in that district, we will make its prediction with mode value.
  2. Both the trees partition the predictor space (independent variables) into particular and non-overlapping regions. For the sake of simplicity, you can think of these regions as high dimensional boxes or boxes.
  3. Both the trees follow a top-down eager methodology known as recursive binary splitting. We call it top-down because it begins from the top of the tree when all the observations are available in a single region and successively splits the predictor space into two new branches down the tree. It is known as greedy because the algorithm cares (looks for the best variable available) about only the current split, and not about future splits which will lead to a better tree.
  4. This splitting process is continuing until a user-defined stopping criterion is reached. For example, we can tell the algorithm to stop once the number of observations per node becomes less than 50.
  5. In both cases, the splitting process brings about completely grown trees until the halting standards are reached. But, the completely grown tree is probably going to overfit data, leading to poor accuracy on unseen data. This brings Pruning. Pruning is one of the strategies used to handle overfitting. We’ll get to know more about it in the following section


Let’s say we have got a sample of 30 students with 3 variables Gender (Boy/ Girl), Class( IX/ X), and Height (5 to 6 ft). 15 out of those 30 play cricket in free time. Now, I would like to build a model to predict who can play cricket during their free time? In this problem, we would like to segregate students who play cricket in their free time based on a highly significant input variable among all three.

This is where the decision tree helps, it will segregate the students supporting all values of three variables and identify the variable that creates the most effective homogeneous sets of students (which are heterogeneous to each other). In the picture below, you will see that the variable Gender is ready to identify the best homogeneous sets compared to the other two variables.

How will a tree decide where to split?

The decision of creating strategic splits heavily affects a tree’s accuracy. The decision criteria are different for classification and regression trees.

Decision trees use multiple algorithms to determine whether to split a node into two or more sub-nodes. The creation of sub-nodes will increase the homogeneity of resultant sub-nodes. alternatively, we will say that purity of the node will increase concerning the target variable. The decision tree splits the nodes on all available variables and then selects the split which ends up in the most homogeneous sub-nodes.

Let’s look at the four most ordinarily used algorithms in the decision tree:

  1. Gini Index
  2. Information Gain(Entropy)
  3. Chi-square
  4. Reduction In variance 

Tree Pruning:

As mentioned earlier, the technique of setting constraints could be a greedy-approach. In alternative words, it will check for the most effective split instantaneously and move forward until one in every of the desired stopping condition is reached. Let’s consider the following case when you’re driving:

There are 2 lanes:

  1. A lane with cars moving at 80km/h
  2. A lane with trucks moving at 30km/h

At this instant, you are the yellow car and you have got two choices:

  1. Take a left and overtake the other 2 cars quickly
  2. Keep moving in the present lane

Let’s analyze these choices. In the former choice, you’ll immediately overtake the car ahead and reach behind the truck and start moving at 30 km/h, trying to find a chance to move back right. All cars originally behind you move ahead within the meantime. This may be the optimum choice if your objective is to maximize the distance covered in the next say 10 seconds. In the later choice, you sale through at the same speed, cross trucks then overtake maybe depending on the state of affairs ahead. Greedy you!

This is exactly the difference between the normal decision tree & pruning. A decision tree with imperatives won’t see the truck ahead and embrace a greedy approach by taking a left. On the opposite hand, if we tend to use pruning, we in effect look a few steps ahead and make a choice.

So we all know pruning is better. However, the way to implement it in the decision tree? The concept is simple.

  1. We first build the decision tree model to a large depth.
  2. Then we start at the bottom and start removing leaves that are giving us negative returns when compared from the top.
  3. Suppose a split is giving us an increase of say -10 (loss of 10), at the point the following split gives us a gain of 20. A simple decision tree can stop at step 1 but in pruning, we are going to see that the overall gain is +10 and keep both leaves.

Note: sklearn’s decision tree classifier does not currently support pruning. Progressed packages like xgboost have embraced tree pruning in their implementation. However, the library rpart in R provides a function to prune. Good for R users!


Easy to Understand:  If someone from a non-analytical background, then also it is very easy to understand the output of the decision tree. To read and interpret a decision tree does not require any statistical knowledge. Its graphical illustration is extremely intuitive and users will easily relate their hypothesis.

Useful in Data exploration: A decision tree is probably the quickest approach to identify the most significant variables and relations between two or more variables.  With the assistance of decision trees, we can make new variables/features that have a superior capacity to predict the target variable.   It could be utilized in the data investigation stage. For example, we are chipping away at a problem where we have data available in many variables. there, the decision tree can assist with detecting the most significant variable.

Less data cleaning required: It doesn’t need a lot of data cleaning contrasted with some other modeling strategies. It is not impacted by outliers and missing values to a reasonable degree.

The data type is not a constraint: It can deal with both numerical and categorical variables.

Non-Parametric Method:The decision tree is viewed as a non-parametric strategy. it means the decision trees have no assumptions about the spatial distribution and the classifier structure.


Overfitting:  Overfitting is one of the handiest troubles for decision tree models. This problem gets solved by setting limitations on model parameters and pruning.

Not fit for continuous variables:  While working with continuous numerical variables, the decision tree loses data when it classifies variables into various classes.