Explained
Decision trees are a form of supervised learning that can be used for both classification and regression purposes. In my experience, they are typically utilized for classification purposes. The model takes in an instance and then goes down the tree, testing significant features against a determined conditional statement. Depending on the result, it will go down to the left or right child branch and onward after that. Typically the most significant features in the process will fall closer to the root of the tree.
Decision trees are becoming increasingly popular and can serve as a strong learning algorithm for any data scientist to have in their repertoire, especially when coupled with techniques like random forests, boosting, and bagging. Once again, use the video below for a more in-depth look into the underlying functionality of decision trees.
Now that you know a little more about decision trees and how they work, let’s go ahead and implement one in Python.
Getting Started
from sklearn import tree
df = pd.read_csv(‘iris_df.csv’)
df.columns = [‘X1’, ‘X2’, ‘X3’, ‘X4’, ‘Y’]
df.head()
Implementation
from sklearn.cross_validation import train_test_split
decision = tree.DecisionTreeClassifier(criterion=’gini’)
X = df.values[:, 0:4]
Y = df.values[:, 4]
trainX, testX, trainY, testY = train_test_split( X, Y, test_size = 0.3)
decision.fit(trainX, trainY)
print(‘Accuracy: \n’, decision.score(testX, testY))
Visualization
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus as pydot
dot_data = StringIO()
tree.export_graphviz(decision, out_file=dot_data)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())