Neural Networks (Gradient Descent and Backpropagation)
See midterm guide for notes on perceptrons and feed-forward nets
Gradient Descent
The gradient vector in gradient descent describes the weight deltas needed to be applied to the weights. To be precise, the gradient vector is a column vector composed of values which describe the partial derivative of the cost function with respect to each of the weights. It is accompanied with a learning rate (\(\eta\)) typically (to scale up/down the gradient vector):
Find new weights by subtracting the scaled gradient vector:
\[
w \rightarrow w' = w -\eta \nabla C.
\]
Backpropagation
See HW7 Part1.1, pretty decent review. Crude review:
Goal: Find gradient vector
To find weights that are immediately preceding the outputs (so not a hidden layer), just use chain rule. For example, to calculate \(w_{4}\) in the homework: \[
\frac{\partial E}{\partial w_{4}} = \frac{\partial E}{\partial \sigma (z_{o})} * \frac{\partial \sigma (z_{o})}{\partial z_{o}} * \frac{\partial z_{o}}{\partial w_{4}}
\]
\[
\frac{\partial E}{\partial w_{4}} = (2 * (y - \hat{y}) * -1) * (\sigma (z_{o}) (1 - \sigma (z_{o}) )) * z_{o}
\]
\[
\frac{\partial E}{\partial w_{4}} = (2 * (0 - 0.998) * -1) * (0.998) (1 - 0.998) )) * 4
\]
\[
\frac{\partial E}{\partial w_{4}} = 0.020
\]
To find weights in hidden layers, i.e. \(w_{1}\) in the homework:
\[
\frac{\partial E}{\partial w_{1}} = \frac{\partial E}{\partial \sigma (z_{o})} * \frac{\partial \sigma (z_{o})}{\partial z_{o}} * \frac{\partial z_{o}}{\partial \text{ReLU} ( z_{1})} * \frac{\partial \text{ReLU} (z_{1})}{\partial z_{1}} * \frac{\partial z_{1}}{\partial w_{1}}
\]
\[
\frac{\partial E}{\partial w_{1}} = (2 * (y - \hat{y}) * -1) * (\sigma (z_{o}) (1 - \sigma (z_{o}) )) * w_{4} * \text{ReLU}'(z_{1}) * x
\]
\[
\frac{\partial E}{\partial w_{1}} = (2 * (0 - 0.998) * -1) * (0.998) (1 - 0.998) )) * 0.5 * 1 * 4
\]
\[
\frac{\partial E}{\partial w_{1}} = 0.010
\]
Clustering
The big picture: Dividing data into meaningful groups
K-means Clustering
Algorithm (parameters: \(K\)):
- Select \(K\) centroids
Loop:
- Assign each point to its closest centroid
- Recompute centroids
- Break Loop if centroids did not change
Problem: Choosing any initial centroids can lead to bad results, choosing points in a natural cluster can lead to a good solution
Bisecting K-means Clustering
Start with cluster containing all points, then split into two clusters. Select one of the clusters (the one with higher SSE) to split until you have \(K\) clusters.
Cluster analysis
- Want to maximimize distance between clusters, but minimize distance within clusters.
- Within Cluster SS is the sum of squares of each point of a cluster to its centroid
- Total SS is the sum of squares of each point in the data set to the global cluster mean
- Between SS is (Total SS - the total within cluster SS)
- Total within cluster SS is just the sum of all Within Cluster SS
- Variance is the Between SS / Total SS
- Higher the ratio, the more variance is explained by the clusters.
Agglomerative Hierarchical Clustering
Algorithm:
- Compute proximity matrix if necessary
Loop:
- Merge the closest two clusters
- Update proximity matrix to reflect the proximity between the new cluster and the others
- Break if one cluster remains
How to define distance to find closest?
- Single link “min”, minimum distance between clusters
- Complete link “max”, maximum distance between clusters
Density-based clustering
It is what it sounds like, clustering based on data point density. Parameters:
- A radius, \(\varepsilon\) (“eps”) or a neighborhood
- Number of neighbouring points (MinPts, defined by user)
A \(\varepsilon\)-neighborhood is a cluster with points all within \(\varepsilon\) from the source. If the neighborhood contains at least MinPts other points, the source point is considered a high density area.
- Core point: synonymous to point that is high density
- Border point: a point that’s not a core point, but in a core point’s neighborhood
- Noise point: Neither border nor core
DBSCAN Algorithm:
- Lable all points as Core, Border, or Noise
- Eliminate Noise
- Put edge between all core points that are within Eps of each other
- Make each group of connected core points into its own cluster
- Assign each border point to one of the clusters of its associated core points
Cluster Validation
- Internal: Used to measure the goodness of a clustering structure without respect to external information.
- SSE for example
- Cohesion: how close are objects in the same cluster, measured by within cluster SSE
- Separation: how well separated are the clusters
- External: Match cluster with external results (class labels)
- Relative: Used to compare two different clustering algorithms or clusters
Silhouette value: measures the degree of confidence in the clustering assignment of a particular observation, with well-clustered observations having values near 1 and poorly clustered observations having values near -1. See slides for actual formula.
Dunn index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. This should be maximized.
Association Rule Mining
The big picture: observing which items co-occur across transactions.
Example of an association rule:
{Beer, Bread} -> {Milk}
That is to say, beer and bread co-occur with milk.
Question: What our are requirements to form strong association rules? What are the quantitative thresholds to determine which rules should present, and which should not?
Transaction Databases
Transactions can be represented in various database formats:
- Binary database: Transactions each contain an itemset, where any the presence of an item of the itemset is denoted as 1 or 0 (1 being present, 0 being not present)
- Transaction database: Transactions each contain an itemset, where present items are listed (any items not present are omitted).
- Vertical database: Items are listed as individual columns, with each column containing a set of transactions the item is present in.
Association Rule Mining
Itemset measures and preliminaries:
- Support of an itemset is defined as the number of transactions which contain the itemset (as an (improper) subset)
- Relative support is the fraction of transactions that contain the itemset (containing over total)
- Thus, this is a ratio/percentage
- \(F^{(k)}\) denotes the set of \(k\)-itemsets, and \(F\) denotes the set of all itemsets
The problem:
Given a minimum support threshold (minsup) find all itemsets \(X\), s.t. \(sup(X) >= minsup\).
In other words:
- Find the number of transactions (\(n\)) in which an itemset is present
- Accept itemset if \(n >= minsup\)
Naive algorithm to mine all frequent itemsets incurs a runtime complexity of \(O(2^{|I|} * |D| * |I|)\), where the exponential term is the number of candidate itemset subsets, and the remaining terms represent the support computation (must go through each itemset for each transaction).
See HW9 (Chapter 5 Exercise 9a) for an example of a itemset lattice, a hierarchical-like diagram which begins with the empty set \(k = 0\), and branches to \(k + 1\) if is considered a frequent itemset. Children which are omitted are considered “pruned”.
The next step is to find strong association rules. Steps:
- Find frequent itemsets – this is what was shown above, use support threshold to filter out candidates. This is computationally expensive if you consider the power set of \(I\), where \(I\) represents all items.
- Extract high-confidence rules from the frequent itemsets – Similar process, but using a confidence value. This measure will be explained later.
Apriori Approach
The Apriori Approach aims to optimize the frequent itemset generation step. As explained earlier, this step is (naively) computationally expensive. This is because we consider the power set. The optimization Apriori employs is based on whether subsets be prematurely filtered out. This property is demonstrated best in the itemset lattices. The Apriori principle says:
- If an itemset is frequent, its subsets (parents in the itemset lattice) must all be frequent.
- If an itemset is infrequent, its supersets (children) are infrequent (the optimization step, the pruning in the itemset lattice).
This reduces our search space, and reduces the runtime complexity. The theoretical complexity remains the same as bruteforce (as all itemsets may be frequent), but in practice the runtime is lower due to the pruning that occurs.
Strong Association Rules
A rule is considered strong if its confidence (\(conf\)) is greater than or equal to the minimum confidence (\(minconf\)), where \(minconf\) is a user-specified threshold.
What is confidence? Confidence states something about an association rule \(X \rightarrow Y\). Namely, it tells the ratio of the number of transactions (support) of the union of \(X\) and \(Y\) to the support of just \(X\):
\[
conf(X \rightarrow Y) = \frac{sup(X \cup Y)}{sup(X)}
\]
What does this value practically tell us? It tells us how often \(X\) and \(Y\) coexist in transactions when compared to transactions where only \(X\) is present. In other words, a high confidence means that the number of transactions where \(X\) and \(Y\) appear is close to the number of transactions where only \(X\) appears.
Confidence is neat, so how do we generate association rules using this measure? First, consider the frequent itemsets that were generated. Choose an arbitrary itemset as the antecedent (the \(X\) in the association rule), and choose \(Y\) as the missing elements (only choose elements which appear in the frequent itemsets for \(Y\)).
From there, calculate the rule’s confidence, and if it’s less than the minconf, reject the rule (it is not strong). Thus, if the rule is considered as weak, any subsets of the antecedent can also be removed from the frequent itemsets search space (the support of subsets will only be greater, which means a lower confidence).
Implications of Thresholds
- If \(minsup\) too high, interesting low-support items may be missed.
- If \(minsup\) too low, too much information..
- High-confidence can be misleading, the consequent’s support might be ignored
Lift
Lift is a metric which accounts for the consequent, it is defined as:
\[
lift(X \rightarrow Y) = \frac{conf(X \rightarrow Y)}{rsup(Y)}
\]
A lift close to 1 means the support of the rule is expected. We seek for values 1 (above expectation) or much lower than 1 (below expectation).
Recommender Systems
The goal behind recommender systems
- Consider the set of customers \(X\)
- Consider the set of items \(S\)
- Can we assign ratings for each (\(X\), \(S\)) pair (Cartesian product between customers and items)? In other words, can we assign a rating for each movie/item for each customer?
- As a rating \(R\) can be assigned for each customer-item pair, a utility matrix can be formed, where the columns represent items, and the rows represent customers. The values are the ratings.
Question: How to populate the utility matrix?
- Ask people to rate themselves? (Turns out this doesn’t work well)
- Learn from user actions
Goal: Can we recommend something you haven’t watched that you will probably like?
Problems and Approaches
Problems:
- The matrix is sparse
- People don’t watch everything
- Even when they do decide to watch something, they often won’t rate it themselves
- New customers = blank slate
- A new customer in the utility matrix is an empty row
Approaches:
- Content-based (the focus of the final)
- Recommend movies similar to those the customer watched
- Collaborative
- Person A has similar taste to Person B, recommend a movie Person B watched to Person A?
- Hybrid
Content-based
Need to build two profiles:
- User profile
- Item profile
Item Profiles
First, start with the item profile. What defines a movie? How can we say two movies are similar? A profile defines a set of features which we can use to do this comparison.
- For example, in movies, these features can be considered: director, actors, length, genre
How can we determine if a movie exhibits a feature? A text mining heuristic can be considered, such as TF-IDF (Term Frequency * Inverse Document Frequency).
Term frequency (simply the ratio of occurrences of the term over the total terms in the document):
\[
TF_{ij} = \frac{f_{ij}}{max_{k} f_{kj}}
\]
Where:
- \(max_{k} f_{kj}\) represents the number of terms in the document \(j\)
- \(f_{ij}\) is the frequency of a term (crudely, a feature) in the document (text container) \(j\)
- \(n_{i}\) is the number of documents which mention \(i\)
- \(N\) is total number of documents
- Inverse Document Frequency: \(IDF_{i} = \log \frac{N}{n_{i}}\)
- Normalizes common terms like articles, pronouns, etc.
The TF-IDF score is the product of the term frequency and the inverse document frequency:
\[
w_{ij} = TF_{ij} \times IDF_{i}
\]
A document’s profile is then a set of words with the highest TF-IDF scores, with the scores listed in the profile.
User Profiles
Using item profiles, user profiles can be built. Consider a user has rated items which have profiles \(i_{1}...i_{n}\). A simple (weighted) average of the set of item profiles can be taken to construct the user profile.
- Normalize user ratings by subtracting rating by user’s mean of a particular feature
- Take average of normalized ratings
Then, the resulting user profile vector can be compared to item profiles. Similar profiles are good. How to check for similarity though?
- Cosine similarity: Where \(x\) is the user profile and \(i\) is the item profile: \[
U(x, i) = cos(\theta) = \frac{(x \cdot i)}{(|x| |i|)}
\]
- 1 mean similar, 0 or -1 means not similar (in the unit circle, cosine is 1 when angle is small).
Review: Content-Based Recommendations
Pros:
- Don’t need data from other users
- Can recommend items that are unpopular/new
- Can explain why certain items were recommended
Cons:
- Feature set can be hard to determine
- Bubble effect
- Might just recommend stuff that is within the user’s bubble, doesn’t allow for diverse content recommendations
- New customer = blank slate
- Users have to start off with an average profile
Principle Component Analysis
The big picture: Reduce dimensions (decomplexify feature space) without loss of information
- PCA is a dimensionality reduction technique
- It is a transformation which results in a smaller number of components that maximizes the variation of the original data set, but in low dimension space
The idea is to choose dimensions which maximize variance. i.e. the example in the slideshow demonstrates a reduction from 3D to 2D, because the z dimension does not show any variance.
An example:
- Dataset of multiple physical traits of people
- Height, weight, arm length, leg length, hair color, waist circumference, hip circumference, chest circumference
The principal components could conceivably be:
Question: Which components are contributing the most variance?
See propotion of variance contributed by each principal component:
\[
\frac{PC_{i}}{\text{total variation}}
\]
Where the total variation is the sum of the eigenvalues. We can find the variance explained by a principal component by finding its eigenvalue.
Summary
- PCA can be used as a dimensionality reduction technique
- A subset of the principal components may explain a high variance, which means the other attributes can be filtered out.
- Observations much be transformed from natural represntation to PC scores before attempting to use in a model
Social Network Analysis
Characteristics of large graphs, centrality measures and their interpretation.
Power Law in Networks
Scale-free networks are networks whose degree distribution follows the power law
In contrast to lattice networks who have very similar number of degrees across nodes, scale-free networks have varying degrees (look up the distribution of a scale-free network’s degrees: very many nodes with only a few links, and a few hubs with large number of links).
Graph-theoretic Measures
- Eccentricity is the maximum distance between two arbitrary nodes
- Diameter is the maximum eccentricity of the entire graph
- Radius is the minimum eccentricity of the entire graph
- Consider \(v\) a central point if its eccentricity is the radius
These measures are compute-intensive
Centrality Measures
Normalized degree centrality:
\[
C^{*}_{D}(v) = \frac{1}{n - 1} * C_{D}(v)
\]
Where:
- \(C_{D}(v) = \text{deg}(v)\), the degree centrality
- Where deg is the degrees of the vertex
- \(n\) is the number of nodes
Pros: Easy to compute and interpret
Cons: Just counts nearest neighbors, not sophisticated
- Disregards importance of connections, the quality of connections is not considered
Closeness centrality:
\[
C_{C}(v) = \frac{1}{\sum_{j} d(v, j)}
\]
Where:
- \(d(v, j)\) is the shortest path from \(v\) to \(j\)
Normalized version: \(C^{*}_{c}(v) = (n - 1) * C_{C}(v)\)
Where
- \(n\) is the number of nodes
This tells us that a node with high closeness can quickly interact with others (does not have to hop much).
Betweenness centrality:
Finds nodes which act as bridges/connect social circles.
\[
C_{B}(v) = \sum_{s \neq t \neq v} \frac{\sigma_{st}(v)}{\sigma_{st}}
\]
Where:
- \(\sigma_{st}(v)\) repesent the number of shortest paths from \(s\) to \(t\) that must go through \(v\)
- \(\sigma_{st}\) repesent the number of shortest paths from \(s\) to \(t\)
To normalize, divide by the maximum possible relevant paths, aka \(2 * ((n - 1) \text{choose} 2)\) (the factor of two only applies for undirected graphs, since you can consider point S to point T and point T to point S in an undirected graph).
Eigenvector centrality considers the importance/quality of neighbors. It has a neat recursive definition, and a parameter \(\lambda\) which is a fixed constant.
Katz centrality penalizes based on higher distances. Namely, it has an attenuation parameter \(\beta\), which penalizes nodes which are further.
Social Network Analysis
Power Law in Networks
Scale-free networks are networks whose degree distribution follows the power law
In contrast to lattice networks who have very similar number of degrees across nodes, scale-free networks have varying degrees (look up the distribution of a scale-free network’s degrees: very many nodes with only a few links, and a few hubs with large number of links).
Graph-theoretic Measures
These measures are compute-intensive
Centrality Measures
Normalized degree centrality:
\[ C^{*}_{D}(v) = \frac{1}{n - 1} * C_{D}(v) \]
Where:
Pros: Easy to compute and interpret
Cons: Just counts nearest neighbors, not sophisticated
Closeness centrality:
\[ C_{C}(v) = \frac{1}{\sum_{j} d(v, j)} \]
Where:
Normalized version: \(C^{*}_{c}(v) = (n - 1) * C_{C}(v)\)
Where
This tells us that a node with high closeness can quickly interact with others (does not have to hop much).
Betweenness centrality:
Finds nodes which act as bridges/connect social circles.
\[ C_{B}(v) = \sum_{s \neq t \neq v} \frac{\sigma_{st}(v)}{\sigma_{st}} \]
Where:
To normalize, divide by the maximum possible relevant paths, aka \(2 * ((n - 1) \text{choose} 2)\) (the factor of two only applies for undirected graphs, since you can consider point S to point T and point T to point S in an undirected graph).
Eigenvector centrality considers the importance/quality of neighbors. It has a neat recursive definition, and a parameter \(\lambda\) which is a fixed constant.
Katz centrality penalizes based on higher distances. Namely, it has an attenuation parameter \(\beta\), which penalizes nodes which are further.
Centrality Measures