Data Mining Tutorial 1

1. Linear Algebra

Data is often represented as a table where the rows are the individual data elements and the columns are indicators or features which characterize the data. For example each student in a class is represented as a row of a table and the columns capture information like student id, age, gender, address, previous background etc. Mathematically a table of numbers is represented as a matrix. Google also uses the matrix, albeit a very large one, to represent the WWW. In fact all graphs can be represented as matrices. Mathematical techniques to manipulate matrices fall under the area known as Linear Algebra. The area of Linear Algebra plays an important role in data mining, statistics and machine learning.
1. What is the result of the following matrix vector multiplication?
\begin{equation} \left[ \begin{matrix} 1 & 2 & 3 & 4 \\ 2 & 3 & 4 & 1 \\ 3 & 4 & 1 & 2 \\ 4 & 1 & 2 & 3 \end{matrix} \right] \left[ \begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix} \right] = \left[ \begin{matrix} 30 \\ 24 \\ 22 \\ 24 \end{matrix} \right] \end{equation}
A = [1 2 3 4; 2 3 4 1; 3 4 1 2; 4 1 2 3]
x = [1 2 3 4]'
y = A * x
2. Eigenvalues are an important property of matrices. Later in the unit we will see how eigenvalues can be used to extract some important properties of data. Consider the following matrix
\begin{equation} \left[ \begin{matrix} 1 & 2 \\ 3 & 4 \end{matrix} \right] \end{equation}
A = [1 2; 3 4]
D = eig(A)
D = -0.3723; 5.3723
3. Suppose a matrix A has size mxn. Suppose b is a column vector of length n. Use the O-notation to represent the space required to store the matrix A Use the O-notation to represent the number of operations required to carry out the multiplication of A with b
Space: $O(m \times n)$
Operations: $O(m \times n)$

2. Statistics

Statistics is the original language of data and information. Throughout the unit we will be using statistics to build models of data.
1. Consider the set of numbers $\{-3, -2, -1, 0, 1, 2, 3\}$. What is mean and standard deviation of the set.
S = [-3 -2 -1 0 1 2 3]
m = mean(S)
s = std(S)

m = 0, s = 2.1602
2. Assume the above set of numbers were generated from a Normal distribution $N(\mu, \sigma)$. What is the maximum likelihood estimators of $\mu$ and $\sigma$.
p = mle(S)
p = 0; 2
3. A coin is tossed 20 times resulting in 14 heads. What is the systematic, statistical and frequentist way of making the claim "this is a fair coin."
\begin{equation} f(k) = \left( \begin{matrix} 20 \\ k \end{matrix} \right) 0.5^{k}(1-0.5)^{20-k} \end{equation} \begin{equation} p = \sum_{k=14}^{20} f(k) = 0.0577 \ge 0.05 \end{equation} \begin{equation} p = \sum_{k=18}^{20} f(k) = 0.0002 \le 0.05 \end{equation}

3. Database Management

For large data we cannot ignore external memory ("hard drive") while designing data mining algorithms.
1. The hard drive is a mechanical device. The cost to access data from disk can be broken up into three components: seek time (s), rotational latency(r) and transfer time (t). What is the ordering between s, r and t.
s > r > t
2. The cost of sorting N items in main memory is typically $O(N \log_2 N)$. Assume each block of the disk can hold B items. What is the cost of external memory sorting?
$O(\frac{N}{B} \log_{\frac{N}{B}}N)$
3. What does the "B" mean in B-Trees. Why is the "B" important.
B stands for balanced, i.e. paths from root to leaf have the same length. So the time to access each leaf is the same.

4. Calculus

Calculus, as expected, is one of the main mathematical tools in Data Mining.
1. Why is the gradient of a function the direction of maximum ascent or descent?
2. Find the minimum of $f(x) = x_1^2 + x_2^2$ subject to the inequality constraint $h_1(x) = -2x_1 -x_2 + 10 \leq 0$ and $h_2(x) = -x_1 \leq 0$.
input = -10: 0.5: 10;
[X, Y] = meshgrid(input, input);
Z = X.^2 + Y.^2;
mesh(X,Y,Z);
surf(X,Y,Z);

% f(x) = x1^2 + x2^2
H = [2 0; 0 2];
f = [0; 0];
A = [-2 -1; -1 0];
b = [-10; 0];
[x, fval] = quadprog(H, f, A, b)

5. Algorithms

Algorithms are the bread and butter of all computing. While there are all sorts of algorithms, computer scientists have categorized algorithms into some common "Patterns" including Greedy Approach, Divide and Conquer, Dynamic Programming, Graph Flows. All these patterns appear frequently in Data Mining.
1. Consider a directed graph with five node $v_0, v_1, \dots, v_4$. The edge set is given by ${(v_0, v_1, 2), (v_0, v_4, 4), (v_1, v_2, 3), (v_2, v_4, 1), (v_2, v_3, 5), (v_4, v_3, 7), (v_3, v_0, 8)}$. Design a greedy algorithm to find the shortest path from $v_0$ to all other nodes.
greedy.m

function [dists, predecessor] = greedy(G)

count = size(G, 1);
dists = -1 * ones(count, 1);
predecessor = -1 * ones(count, 1);
done = -1 * ones(count, 1);

next = 1;
dists(1) = 0;
for i = 1:count
% Update distances
for j = 1:count
if j ~= i && G(next, j) ~= -1
if dists(j) == -1
dists(j) = G(next, j) + dists(next);
predecessor(j) = next;
elseif dists(j) > (dists(next) + G(next, j))
dists(j) = (dists(next) + G(next, j));
predecessor(j) = next;
end
end
end
done(next) = 1;

% Find index of next element to consider
for j = 1:count
if done(j) == -1
next = j;
break;
end
end
end

end


G = [0 2 -1 -1 4; -1 0 3 -1 -1; -1 -1 0 5 1; 8 -1 -1 0 -1; -1 -1 -1 7 0];
[dists, predecessor] = greedy(G)