Short-Answer Problems

These concepts can appear on the short-answer part of the tests. As part of this homework, answer the following questions, usually just several sentences that include the definition.

  1. What is meant by the term homogeneous group in the context of classification?
    A homogeneous group consists of samples that are in the same group, such as Male or Female body type.

  2. What is one statistic for assessing homogeneity? What is its range and how is it interpreted?
    The Gini index is a primary statistic for which to evaluate homogeneity of classification. The value of the coefficient ranges from 0 for maximum equality to 1 for maximum inequality with no benefit obtained from the classification system.

  3. What is the root node of a decision tree? What does it represent?
    The root node is the beginning node, before any classification takes place. Its membership is the number of samples in each group as the analysis begins, such as the number of Men and Women in the analysis.

  4. What is a leaf from a decision tree? What does it represent?
    A leaf is a node at the bottom of the decision tree, a final classification in which no more splits are taken.

  5. Consider the following scatterplot as related to forecasting body type according to Gender.
    To forecast Gender, would a decision tree algorithm choose the Waist or Shoe feature to make the first split? Why? About where would the split occur (the decision boundary)? Why?

    The algorithm would choose the Shoe size feature because a split at about 8 ¼ does a good, though not perfect, job of separating Male and Female body types. There is no split on the Waist feature that attains any decent accuracy of differentiation.

  6. How is it that given a decision tree with enough levels of depth, the model can recover the correct class value (e.g., Gender) with perfect accuracy?
    Increasing the complexity of the model ensures better fit on the training data. With a decision tree analysis, the analyst can add enough depth (splits) to the model to correctly classify everything. Of course, such classification will not generalize beyond the training data, so such a model is useless to deploy.

  7. When conducting a machine learning analysis, how can the analyst detect overfitting?
    The fit indices will look great on the training data, and much worse on the testing data, the indicator of real-world performance.

  8. What is the distinction between a model parameter and a hyper-parameter? Give an example of each.
    A model parameter is a characteristic of a specific model estimated from the training data, such as the slope coefficients of a regression model. A hyper-parameter is a characteristic of any one model that is set by the analyst, but could vary across models, such as the depth of a decision tree.

  9. How is hyper-parameter tuning related to fishing? When is OK to do so?
    Hyper-parameter tuning is searching for the best parameter setting, such as the number of features in a model, without any real theoretical reason for choosing the value. Instead, use modern computers to grind away at a large range of possibilities, choosing the best. This procedure is OK as long as the searching is done on training data and the testing on completely different data.

  10. A machine learning analyst investigates fit for a decision tree model with 2, 3 and 4 features at depths of 2 and 6, with a 3-fold cross-validation.

  1. How many distinct models are analyzed?
    Three features and two depths lead to the analysis of 3x2=6 models._
  2. How many analyses are performed?
    The 3-fold cross-validation subjects each model to three analyses, so 6x3=18 analyses. Each analysis estimates the parameter values of a model._
  3. How many hyper-parameters are investigated?
    Hyper parameter is a general characteristic of a model. This grid search investigated two hyper-parameters: tree depth and number of features.
  1. What is the relation of a random forest estimator/model to a decision tree?
    The random forest is an evaluation of many different decision trees. The algorithm constructs a series of decision trees where each tree is based on a different random sample of (a) the data, with replacement, and (b) the available features, with the final model an average of the different trees.

  2. What is local optimization (such as regarding the decision tree solution)? What is its primary disadvantage?
    Given the initial model configuration, such as the first split in a decision tree analysis, a different final model likely emerges than if another first split had been taken on a different variable. The problem is that the splitting process to move down one more layer of depth is done locally, without any “long-view” of where the process is going. So an optimal first split may have been barely more optimal than if another variable and split had been chosen, but perhaps the second alternative would ultimately had led to a series of splits that achieved a more successful level of classification.

  3. When classifying customers to identify those most likely to churn (exist as a customer), which of the three-classification metrics is the most useful: accuracy, recall, or precision? Why?
    Errors cost money. There are two kinds of errors in the classification problem: false positives and false negatives. We cannot make a final assessment As to the appropriate fit index until we know the cost of each of these two errors to the business.
    The analysis of customer churn is primarily concerned with not losing existing customers. Unless the amount of resources dedicated to those customers predicted as likely to leave is not excessive, better to avoid the false positives. That is, OK to have some customers predicted to churn who do not, than miss those who do churn.
    As such, the most relevant fit index may be sensitivity (recall), but then again cannot say for sure until the costs of the errors are known.

  4. What makes machine learning a 2nd-decade 21st century technology, as opposed to, say, the 1990’s?
    Computer power. Having massively more computer power allows more intensive analyses of algorithms that existed for decades, such as applying hyper-parameter tuning to multiple regression. Further, new estimation algorithms have been developed, such as random forest, that are only possible with much computer power. (Cheap laptops today are more powerful in terms of numerical crunch power than the fastest super-computers just 15 years ago.)