The course encompasses two main learning outcomes:
- Students will identify the core concepts of distributed systems; that is, the way in which several machines can be orchestrated to correctly solve complex problems in an efficient, reliable and scalable manner.
- Students will examine how existing systems have applied the core concepts of distributed systems, and will additionally apply such concepts in developing
sample systems.
Understanding the Core Concepts of Distributed Systems
Students will learn the core concepts that comprise any distributed system. They will recognize the system constraints, trade-offs and appropriate techniques for building distributed systems that best serve the computing needs of different classes of applications. In particular, students will learn the following concepts:
- Access & Location Transparency
- Task Parallelization
- Fault-tolerance
Access & Location Transparency
Exposing the capabilities of machines, yet hiding their details is one of the first steps in designing distributed systems. Such systems penetrate economies and masses which transparently leverage their powers. For instance, in the Internet, which is a successful distributed system, a simple browser interface will allow you to explore information scattered over wide-geographies. In this course, students will examine how to abstract data and machine locations (which may reside at different physical places) as well as data and machine replications.
Specifically, students will study the following topics:
- Processes and Communication: Students will explain and contrast the communication mechanisms between different processes and systems.
- Naming: Students will identify why entities and resources in distributed systems should be named, and examine the naming conventions as well as some naming resolution mechanisms.
Task Parallelization
Traditional algorithms that work on a single processor are inefficient - or even fail to work - in a system where multiple machines are working in parallel. In distributed systems, problems/jobs can be solved using parallelization. Generally a job is split into multiple tasks, and all tasks are executed in parallel on different machines. The tasks may access common resources, such as data contained in a shared file. Consequently, two main challenges emerge. First, we ought to ensure that the concurrently running tasks are coordinated and synchronized in a manner that correctly achieves the job's goal. Second, we can potentially replicate and place resources across multiple computers in a way that allows tasks to access them more effectively.
Specifically, students will study the following topics:
- Concurrency and Synchronization: Students will identify issues on how to coordinate and synchronize multiple tasks in a distributed system.
- Caching, Replication and Consistency: Students will understand how replication and caching of resources can optimize performance and scalability, as well as examine various models that allow maintaining consistency of replicated and cached data.
Fault-tolerance
In distributed systems, a failure of a single or a part of a computer (or what is known as partial failure) is very likely. If such a failure is not tolerated, the whole system might come to a grinding halt or result in a random and unpredictable behavior. Students will learn how to avoid and recover from partial failures, a concept referred to as fault-tolerance.
Practical Application of the State-of-the-Art Distributed Systems:
Students will also learn how to apply principles of distributed systems in a real-world setting. In particular, they will learn the following topics:
- Distributed Frameworks: Students will learn some of the distributed frameworks such as MapReduce, GraphLab and Pregel. These distributed frameworks allow developers to easily program distributed problems/algorithms, while ensuring correctness, fault-tolerance and efficiency.
- Distributed File Systems: Students will learn how a file can be striped and placed anywhere in a distributed system (or what is referred to as distributed file system), yet be accessed transparently- as if it is a local file.
They will examine how to apply distributed system principles to ensure transparency, consistency and fault-tolerance in distributed file systems.
- Virtualization: Students will learn the concept of system virtualization, where a state of a computer is abstracted from the underlying hardware. This allows masking the heterogeneity of the machines that comprise a distributed system, besides increasing overall system utilization and security.