Introduction

Know what data mining is and what is its relation to AI, etc.

Data mining is the process of extracting patterns and insights from a data set. The algorithms used in data mining are an essential part in AI.

Know how AI and machine learning differ from traditional computer science

Traditional computer science focuses on a program, and is often deterministic in nature. A program encodes a set of rules to apply on some input data. AI and machine learning are different in that data is the most important component, not the program. Errors are part of the landscape, and determinism is not the case.

Know why generalization is important

Generalization is important because it is the concept of predicting an outcome based on unseen, real-world data. In data mining, models can be built upon training data to predict on unseen data. Simply “seeing” all of the data is not practical, and generalization is how we deal with “combinatorial explosion”.

Random variables, measures of central tendency, and distributions

Know what a r.v. is and how we measure central tendencies

Know what makes a central tendency robust

Study the distributions and know their impact when given a vector

A random variable is a variable whose values are chosen based on the outcome of some random phenomenon

Distribution functions of interest:

Quick run down of central tendencies:

Now for the measures of dispersion:

Different distributions of interest:

Understand multivariate central tendencies, including CDF, variance, etc.

A bivariate can be geometrically envisioned as two column vectors. The variance of the bivariate can be calculated by finding the sum of the individual variances of the column vectors. There are also some other measures of central tendencies for multivariates:

Why have both covariance and correlation though?

Linear regression

Understand the theoretic underpinnings of regression, the geometric interpretation of regression and how to use the regression equation for prediction

Linear Regression: A statistical process to estimate the relationship among variables:

What is linear regression used for? Predicting and forecasting are some of the uses of linear regression, as a linear model can be used to predict the fitted values of data that has not been seen yet.

The simple linear regression equation is given by:

\[ Y \approx \beta_{0} + \beta_{1} X \]

The betas are known as weights (or parameters/coefficients). A dataset is used to help produce this weights, such that \(Y\) can be predicted for values of \(X\). As these betas are estimates, error between the actual value (what the dataset says the value should be) and the fitted value (what the model says the value would be) is expected. This error is known as the residual, and is formally defined as: \(\varepsilon_{i} = y_{i} - \hat{y_{i}}\).

The goal intuitively would be to minimize these residuals. \(RSS\), or the residual sum of squares (it is what it sounds like), is the metric that will be optimized. The optimization problem in other words is to minimize the residual sum of squares (least squares estimation technique).

Those weights can be determined analytically via calculus. Once the weights are determined, the linear regression equation can be used to find fitted values, or the predictions.

Understand how to evaluate a regression model using concepts such as residual analysis, \(R^{2}\), etc.

Residual analysis is a tool used to see how well a model fits the data.

The following properties of the residuals indicate a good model:

Understand how we determine how good a regression model is using the four questions we studied for evaluating the fit of the regression model

  1. Is at least one of the predictors useful in predicting the response?
    • A high F-statistic (> 1) tells us that at least one of the predictors is useful
  2. Do all the predictors help explain the response, or is only a subset of predictors useful?
    • Low p-values (\(\alpha = 0.05\) typically) mean that there is some relationship between the predictor and the dependent variable
    • The goal here is feature selection, which is a hard problem.
      • Forward selection: Start with no predictors, and add
      • Backward selection: Start with all predictors, and remove (i.e. the ones with the highest p-values)
  3. How well does the model fit the data?
    • RSE, \(R^{2}\), and RMSE are good statistics for determining whether the model is a good fit
      • \(RSE\) is the residual standard error, \(RMSE\) is the root mean square error. A low error value corresponds to a good model fit
      • \(R^{2}\) is the coefficient of determination. A high value (close to 1) is an indicator of a good model
  4. Given a set of predictor values (the \(\beta\)’s), what response value should we predict and how accurate is our prediction?
    • To understand the accuracy of our predictions, confidence intervals are of importance. A confidence interval represents the range of results that would be expected to contain the population parameter of interest. The significance level \(\alpha\) is the probability of making the wrong decision given that the null hypothesis is true. The confidence level intuitively is \(1 - \alpha\), which represents the probability that if an experiment was repreated several times, the results would be the same. A higher confidence level means a higher range, since we are more confident if we have more values to choose from.

Components of learning:

Understand the theory behind how we induce learning (unknown target function, hypothesis set, learning algorithms, etc.)

Principal idea is to induce an approximate function from the hypothesis set which is approximately equal to the unknown target function via a learning algorithm which takes in training examples

Know the difference between supervised, unsupervised, and reinforcement learning

Understand what is the curse of dimensionality and be able to explain it

The curse of dimensionality refers to when data has too many features. Why is this a problem?

Understand basic data types used in R and concepts such as binarization and discretization

R data types:

Data transformations:

Decision trees

Understand how decision trees are induced, including concepts such as “pure nodes”, information gain, entropy, Gini index, etc. are used to create a decision tree.

Decision trees are classifiers which involve a path of decisions in a tree. The decisions are partitioning the different features of the data. The question is, how to choose the best partitioning scheme for the data?

Know how to evaluate a decision tree through confusion matrices and ROC curves

The confusion matrix is a widely used measure to evaluate the performance of a model.

Confusion Matrix

Can evaluate different statistics like true positive rate (sensitivity) and true negative rate (specificity).

The ROC (Receiver Operating Characteristics) curve) is also a good option. ROC curve plots TPR against FPR. A higher area under the curve is generally an indicator of good performance of the model.

The focus is on the predictive capability of the model, not other factors like training time.

Understand how model selection and model checking work through cross validation, etc.

Model Checking – If we divide a dataset into a training set and test set, this could result in test points which are not representative of the population. How do we combat this?

Model Selection – Tradeoff between complexity

Understand concepts such as overfitting, underfitting, bias variance decomposition, class imbalance and how to potentially deal with it, how to prune trees, why we need to prune trees, etc.

Overfitting is when the model is fit too much to the training data set, such that it does not predict well on real world data. Underfitting is the opposite, where the model is not trained eugh to predict on real world data.

Pruning is used to mitigate overfitting. Nodes can be trimmed in a bottom-up fashion. There are two approaches:

Class imbalance is the norm, not the exception. Correctly classifying the minority class has greater value than correctly classifying the majority class. Imbalanced classes though present a number of problems to classificaiton algorithms:

Understand what are ensemble methods and how Random Forest works.

The idea behind ensemble methods is to create multiple classifiers and combine them.

When are these ensemble methods better than their normal counterparts? The individual classifiers of ensemble methods should demonstrate some instability, that is to say their predictions should be independent.

Perceptrons and Neural Networks

Know the strengths and limitations of a perceptron (and neural networks)

Perceptrons are linear classifiers:

Know how a perceptron (and neural networks) performs computations using the weights and inputs

Perceptron has:

Know how a perceptron makes a prediction

\(\text{sign}(w^{T} * x)\)

Understand the concept of an activation function in neural neworks

Activation Functions

Understand the concept of a “deep neural network” (layers)

Layers are needed if the data must be separated using a non-linear boundary.

Understand how a neural network makes a prediction

Understand backpropagation (forward pass), including how to compute the output of a neural network

For a neuron \(h\), the network input to the neuron is:

\[ z_{h} = w^{T} x = \sum_{i=1}^{n} = w_{1} x_{1} + ... + w_{n} x_{n} \]

The output is simply

\[ \sigma (z_{h}) \]

Where \(\sigma\) is the activation function (i.e. sigmoid)

The first layer’s output are just the predictor inputs.

Calculating average error:

\[ \frac{1}{n} * \sum_{i=1}^{n} (\hat{y_{1}} - y_{1})^{2} \]

How to reduce error?