Machine Learning: Decision Tree With Python.
Lets take above plot. We have two classes one calss of black circle and one of blue sqaure . Is it possile to draw a single line?. I think NO
You will need more than one line to divide the above data plot in two classes.Some thing like below image.
We need two lines for dividing data into two classes, one according to the threshold of "X" and one accoring to the thrershold of "Y".
So basicaaly we can say that Decision Tree Classifier, repetitively divides the working area(plot) into sub part by identifying lines. (repetitively because there may be two distant regions of same class divided by other as shown in image below).
Now question is when this diving process will stop?.
- 1.Either it has divided into classes that are pure (only containing members of single class ).
- 2.Some criteria of classifier attributes are met.
Impurity is "when we have a traces of one class division into other. This can arise due to following reason:
- 1.We run out of available features to divide the class upon.
- 2.We tolerate some percentage of impurity (we stop further division) for faster performance. (There is always trade off between accuracy and performance)..
Entropy is the measure of impurity or in other words it is degree of randomness of elements. Mathematically, it can be calculated with the help of probability of the items as:
It is negative summation of probability times the log of probability of item x.
For example, if we have items as number of dice face occurrence in a throw event as 1123, the entropy is p(1) = 0.5 p(2) = 0.25 p(3) = 0.25 entropy = - (0.5 log(0.5)) - (0.25 log(0.25)) -(0.25 * log(0.25) = 0.45
3. Information Gain
In complex problems of clasiffication or regression we always have more than two features. What feature should we select for division? Perhaps one that gives us less impurity.Suppose we divide the classes into multiple branches as follows, the information gain at any node is defined as:
Information Gain (n) = Entropy(x) — ([weighted average] * entropy(children for feature))
Little more explanation!
Suppose we have following class to work with intially
Suppose we divide them based on property: divisible by 2
Entropy at root level : 0.66
Entropy of left child : 0.45 , weighted value = (4/9) * 0.45 = 0.2
Entropy of right child: 0.29 , weighted value = (5/9) * 0.29 = 0.16
Information Gain = 0.66 - [0.2 + 0.16] = 0.3
Check what information gain we get if we take decision as prime number instead of divide by 2. Which one is better for this case?
Decision tree at every stage selects the one that gives best information gain. When information gain is 0 means the feature does not divide the working set at all.
Lets solve an example¶
Suppose we have a following data for playing a golf on various conditions.
Now if the weather condition is given as :
Outlook : Rainy, Temperature: Cool, Humidity: High, Windy: False
Should we play golf?
We have outcomes at beginning as NNYYYNYN (Y = Yes and N = No) taken in given order. Entropy at this root node is 0.3
Now try to divide on various predictors outlook, temperature, humidity and Windy.
Calculate the information gain in each case. Which one has highest information gain?
For example, if we divide based on Outlook, we have divisions as Rainy : NNN (entropy = 0) Sunny : YYN (entropy = 0.041) Overcast : YY (entropy = 0)
So information gain = 0.3 - [0 + (3/8)*0.041 + 0] = 0.28
Try out for other cases.
The information gain is max when divided based on Outlook.
Now the impurity for Rainy and Overcast is 0. We stop for them here.
Next we need to separate Sunny,
If we divide by Windy, we get max information gain. Sunny YYN Windy? Yes : N No : YY
So decision tree look like something as shown in image below.
No the prediction data is
Outlook : Rainy, Temperature: Cool, Humidity: High, Windy: False
Flowing down from tree according to result, we first check Rainy?
So answer is No, we don't play golf.
Lets Stop Theory
Dividing efficiently based on maximum information gain is key to decision tree classifier. However, in real world with millions of data dividing into pure class in practically not feasible (it may take longer training time) and so we stop at points in nodes of tree when fulfilled with certain parameters (for example impurity percentage). We shall see this in coding exercise.
Numpy arrays and pandas dataframes will help us in manipulating data. As discussed above, sklearn is a machine learning library. The cross_validation’s train_test_split() method will help us by splitting data into train & test set.
The tree module will be used to build a Decision Tree Classifier. Accutacy_score module will be used to calculate accuracy metrics from the predicted class variables.
For importing the data and manipulating it, we are going to use pandas dataframes. First of all, we need to download the dataset. All the data values are separated by commas.
After downloading the data file, we will use Pandas read_csv() method to import data into pandas dataframe. Since our data is separated by commas “,” and there is no header in our data, so we will put header parameter’s value “None” and sep parameter’s value as “,”.
We are saving our data into “balance_data” dataframe.
For checking the length & dimensions of our dataframe, we can use len() method & “.shape”.
Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.
The above snippet divides data into feature set & target set. The “X ” set consists of predictor variables. It consists of data from 2nd column to 5th column. The “Y” set consists of the outcome variable. It consists of data in the 1st column. We are using “.values” of numpy converting our dataframes into numpy arrays.
Let’s split our data into training and test set. We will use sklearn’s train_test_split() method.
The above snippet will split data into training and test set. X_train, y_train are training data & X_test, y_test belongs to the test dataset.
The parameter test_size is given value 0.3; it means test sets will be 30% of whole dataset & training dataset’s size will be 70% of the entire dataset. random_state variable is a pseudo-random number generator state used for random sampling. If you want to replicate our results, then use the same value of random_state
Decision Tree Training
Now we fit Decision tree algorithm on training data, predicting labels for validation dataset and printing the accuracy of the model using various parameters.
DecisionTreeClassifier(): This is the classifier function for DecisionTree. It is the main function for implementing the algorithms. Some important parameters are:
1.Criterion: It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes “gini” value.
2.Splitter: It defines the strategy to choose the split at each node. Supports “best” value to choose the best split & “random” to choose the best random split. By default, it takes “best” value.
3.max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value. a.If an integer is inputted then it considers that value as max features at each split.
4.max_depth: The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.
5.min_samples_split: This tells above the minimum no. of samples reqd. to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.
6.min_samples_leaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.
7.max_leaf_nodes: It defines the maximum number of possible leaf nodes. If None then it takes an unlimited number of leaf nodes. By default, it takes “None” value.
8.min_impurity_split: It defines the threshold for early stopping tree growth. A node will split if its impurity is above the threshold otherwise it is a leaf.
Let’s build classifiers using criterion as gini index & information gain. We need to fit our classifier using fit(). We will plot our decision tree classifier’s visualization too.
Now, we have modeled 2 classifiers. One classifier with gini index & another one with information gain as the criterion. We are ready to predict classes for our test set. We can use predict() method. Let’s try to predict target variable for test set’s 1st record.
In this article, we have learned how to model the decision tree algorithm in Python using the Python machine learning library scikit-learn. In the process, we learned how to split the data into train and test dataset. To model decision tree classifier we used the information gain, and gini index split criteria. In the end, we calucalte the accuracy of these two decision tree models.