Issues in processing large data sets

Instructor: Prof. Dan Stefanescu

Affiliation: Department of Mathematics and Computer Science, Suffolk University, Boston, Massachusetts

Duration: 16 hours

Period: November 17 - 25, 2005

Place: Dipartimento di Ingegneria dell'Informazione: Elettronica, Informatica, Telecomunicazioni, via Diotisalvi, meeting room

Credits: 4

Contacts: Prof. Beatrice Lazzerini


Aims

Fueled by advances in computing and storage hardware, data is acquired and retained at unprecedented rates. We shall examine parallel and sequential techniques for computing with very large data sets. Parallel methods offer the advantage of incremental computational power, but also the challenges of new computational models and algorithms. A number of issues in parallel algorithms, hardware and software, will be discussed with an emphasis on the Bulk Synchronous Processing (BSP) computational model. Sequential methods, while working on familiar territories, need to reorganize established algorithms in order preserve scalability on modern processors. Data mining and, in particular, market basket analysis provide a fertile ground to evaluate these issues. A number of methods for computing frequent itemsets will be discussed and evaluated with respect to efficiency and scalability.

Syllabus

  • Processing large collections of data - introduction
  • Parallel processing - introduction, computation models
  • Parallel algorithm design
  • Parallel hardware, software
  • BSP model
  • BSP programming, BSP library
  • Designing BSP algorithms, matrix multiply, FFT
  • Implementing communication subsystems for parallel computation
  • Shared memory model
  • Synchronization algorithms
  • Data mining - introduction
  • Mining association rules in large databases - Apriori algorithm for frequent itemsets
  • Frequent pattern (FP) growth algorithm for frequent itemsets
  • Efficiency and scalability approaches for frequent itemset computation
  • Multilevel, multidimensional, correlation rules, conclusions