CIF: Small: Coding Techniques for Distributed Machine Learning

Overview

abstract

Modern machine learning models have achieved great success and have been widely deployed across many sectors. As the size of data used to train machine learning models keeps growing, it is now routine to use distributed computing infrastructures such as the cloud. This strategy allows the computation of training to be distributed among a large number of nodes hosted in the cloud, where each node processes a partition of the whole data set. However, the performance of nodes in the cloud is often unreliable, due to system failures, resource contention, load imbalance, etc., and that unreliability can significantly delay the training process. This project pursues a coding-based framework that not only tolerates the effects of faulty nodes, but also further enhances the performance of machine learning training by dynamically taking advantage of the resources available on all nodes, whether they are faulty or not. The outcomes of this project should lead to a significant performance boost for distributed training of machine learning models.To enable the efficient use of distributed computing across unreliable infrastructure for training machine learning models from big data sets, the technical objectives of this project are divided into three levels. This project will first study coding theory for distributed matrix multiplication, a universal operation in various machine learning algorithms, and propose a coding framework with both fault tolerance and a significant performance boost. This framework will then be applied into parameter servers at the architecture level and deep neural networks at the model level, respectively. Combining these three parts, this work will lead to a practical coding framework that can efficiently scale out computation on heterogeneous unreliable nodes, where the coding schemes will be applied to distributed machine learning at different levels including fundamental arithmetic, architectures, and models.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

date/time interval

October 1, 2019 - September 30, 2022

awarded by

National Science Foundation

administered by

Telecommunications and Information Technology Institute

sponsor award ID

1910447

FIU Discovery

CIF: Small: Coding Techniques for Distributed Machine Learning Grant

Overview

abstract

date/time interval

awarded by

administered by

sponsor award ID