Data Mining Tutorial 1
1. Linear Algebra
Data is often represented as a table where the rows are the individual data
elements and the columns are indicators or features which characterize the data.
For example each student in a class is represented as a row of a table and the
columns capture information like student id, age, gender, address, previous
background etc. Mathematically a table of numbers is represented as a matrix.
Google also uses the matrix, albeit a very large one, to represent the WWW. In
fact all graphs can be represented as matrices. Mathematical techniques to
manipulate matrices fall under the area known as Linear Algebra. The area of
Linear Algebra plays an important role in data mining, statistics and machine
learning.
1. What is the result of the following matrix vector multiplication?
\begin{equation}
\left[
\begin{matrix}
1 & 2 & 3 & 4 \\
2 & 3 & 4 & 1 \\
3 & 4 & 1 & 2 \\
4 & 1 & 2 & 3
\end{matrix}
\right]
\left[
\begin{matrix}
1 \\ 2 \\ 3 \\ 4
\end{matrix}
\right]
=
\left[
\begin{matrix}
30 \\ 24 \\ 22 \\ 24
\end{matrix}
\right]
\end{equation}
A = [1 2 3 4; 2 3 4 1; 3 4 1 2; 4 1 2 3]
x = [1 2 3 4]'
y = A * x
2. Eigenvalues are an important property of matrices. Later in the unit we
will see how eigenvalues can be used to extract some important properties
of data. Consider the following matrix
\begin{equation}
\left[
\begin{matrix}
1 & 2 \\ 3 & 4
\end{matrix}
\right]
\end{equation}
A = [1 2; 3 4]
D = eig(A)
D = -0.3723; 5.3723
3. Suppose a matrix A has size mxn. Suppose b is a column vector of length
n. Use the O-notation to represent the space required to store the
matrix A Use the O-notation to represent the number of operations
required to carry out the multiplication of A with b
Space: $O(m \times n)$
Operations: $O(m \times n)$
2. Statistics
Statistics is the original language of data and information. Throughout the
unit we will be using statistics to build models of data.
1. Consider the set of numbers $\{-3, -2, -1, 0, 1, 2, 3\}$. What is mean
and standard deviation of the set.
S = [-3 -2 -1 0 1 2 3]
m = mean(S)
s = std(S)
m = 0, s = 2.1602
2. Assume the above set of numbers were generated from a Normal
distribution $N(\mu, \sigma)$. What is the maximum likelihood estimators of $\mu$
and $\sigma$.
p = mle(S)
p = 0; 2
3. A coin is tossed 20 times resulting in 14 heads. What is the
systematic, statistical and frequentist way of making the claim "this
is a fair coin."
\begin{equation}
f(k) = \left( \begin{matrix} 20 \\ k \end{matrix} \right) 0.5^{k}(1-0.5)^{20-k}
\end{equation}
\begin{equation}
p = \sum_{k=14}^{20} f(k) = 0.0577 \ge 0.05
\end{equation}
\begin{equation}
p = \sum_{k=18}^{20} f(k) = 0.0002 \le 0.05
\end{equation}
3. Database Management
For large data we cannot ignore external memory ("hard drive") while
designing data mining algorithms.
1. The hard drive is a mechanical device. The cost to access data from
disk can be broken up into three components: seek time (s), rotational
latency(r) and transfer time (t). What is the ordering between s, r and
t.
s > r > t
2. The cost of sorting N items in main memory is typically $O(N \log_2 N)$.
Assume each block of the disk can hold B items. What is the cost of
external memory sorting?
$O(\frac{N}{B} \log_{\frac{N}{B}}N)$
3. What does the "B" mean in B-Trees. Why is the "B" important.
B stands for balanced, i.e. paths from root to leaf have the same length.
So the time to access each leaf is the same.
4. Calculus
Calculus, as expected, is one of the main mathematical tools in Data Mining.
1. Why is the gradient of a function the direction of maximum ascent or descent?
2. Find the minimum of $f(x) = x_1^2 + x_2^2$ subject to the inequality
constraint $h_1(x) = -2x_1 -x_2 + 10 \leq 0$ and $h_2(x) = -x_1 \leq 0$.
input = -10: 0.5: 10;
[X, Y] = meshgrid(input, input);
Z = X.^2 + Y.^2;
mesh(X,Y,Z);
surf(X,Y,Z);
% f(x) = x1^2 + x2^2
H = [2 0; 0 2];
f = [0; 0];
A = [-2 -1; -1 0];
b = [-10; 0];
[x, fval] = quadprog(H, f, A, b)
5. Algorithms
Algorithms are the bread and butter of all computing. While there are all
sorts of algorithms, computer scientists have categorized algorithms into
some common "Patterns" including Greedy Approach, Divide and Conquer,
Dynamic Programming, Graph Flows. All these patterns appear frequently in
Data Mining.
1. Consider a directed graph with five node $v_0, v_1, \dots, v_4$. The edge set
is given by ${(v_0, v_1, 2), (v_0, v_4, 4), (v_1, v_2, 3), (v_2, v_4, 1), (v_2,
v_3, 5), (v_4, v_3, 7), (v_3, v_0, 8)}$. Design a greedy algorithm to find the
shortest path from $v_0$ to all other nodes.
greedy.m
function [dists, predecessor] = greedy(G)
count = size(G, 1);
dists = -1 * ones(count, 1);
predecessor = -1 * ones(count, 1);
done = -1 * ones(count, 1);
next = 1;
dists(1) = 0;
for i = 1:count
% Update distances
for j = 1:count
if j ~= i && G(next, j) ~= -1
if dists(j) == -1
dists(j) = G(next, j) + dists(next);
predecessor(j) = next;
elseif dists(j) > (dists(next) + G(next, j))
dists(j) = (dists(next) + G(next, j));
predecessor(j) = next;
end
end
end
done(next) = 1;
% Find index of next element to consider
for j = 1:count
if done(j) == -1
next = j;
break;
end
end
end
end
G = [0 2 -1 -1 4; -1 0 3 -1 -1; -1 -1 0 5 1; 8 -1 -1 0 -1; -1 -1 -1 7 0];
[dists, predecessor] = greedy(G)