Algorithms & Data Mining

Theme leader: Professor Sanjay Chawla

Algorithms Sub-Theme

One of theoretical foundations of cloud and distributed are algorithms that solve occurring problems in the cloud effectively and efficiently. Most practical problems in cloud and distributed computing can be reduced to well-defined problems in algorithms. In particular, we study algorithms in the area including:

  1. Scheduling Problems
  2. Bin-packing Problems
  3. Assignment Problems
  4. Network Flow Problems

Most flavors of above problems are NP-hard and cannot be optimally solved in polynomial time. Hence, we will investigate approximation algorithms that give a quality bound on the solution (e.g., the computed schedule is not worse than 50% of the optimal schedule). We will use techniques including linear programming, probabilistic algorithms, and convex optimization to get a handle on these algorithmic problems.

Data Mining Sub-Theme

Data Mining is a sub-discipline of Computer Science for scaling statistical inference to large and high dimensional data sets. Data Mining consists of four meta-tasks:

  1. Classification
  2. Clustering and Feature Selection
  3. Pattern and Rule Mining
  4. Outlier or Anomaly Detection

Data Mining tasks leverage techniques from high performance and distributed computing to scale statistical inference to previously unimaginable levels. For example, consider the problem of network anomaly detection where the objective is to identify the sources of unusual network traffic activity. In order to carry out this task several sub-tasks need to be carried out in a distributed fashion and executed in a high performance setting:

  1. Data needs to be sampled and aggregated from distributed servers and periodically sent to a centralized monitoring server.
  2. To keep up with the pace of network traffic the centralized monitoring server must have the ability to carry out a high throughput statistical analysis.
  3. The centralized server itself maybe an instance of a high performance computing and may involve its own distributed set up.
  4. The analysis and inference from the centralized monitoring server needs to be communicated back to distributed nodes as statistical models for anomaly alert generation.

Map-Reduce is another computational paradigm which has arisen to deal with large data computation in a distributed setting. The essence of Map-Reduce is to take “computation to data” and propagate aggregated data analysis up a hierarchy of computation nodes. Map-Reduce is destined to be the computational model for large scale statistical analysis in a distributed setting and the bed rock of cloud and service computing.