Scalable Deep Learning on Cloud Platforms: Challenges and Architectures
DOI:
https://doi.org/10.21590/ijtmh.04.02.03Keywords:
Scalable deep learning, cloud computing, Kubernetes, TensorFlow on Kubernetes, Apache Spark MLlib, GPUs, TPUs, distributed training, elasticity, cost-efficiency.Abstract
The sheer growth in the workloads of deep learning has placed new and unprecedented demands on scalable and efficient computational infrastructure. Cloud systems have become the first providers of large-scale distributed training through elastic resources, purpose-built accelerators, and operated machine learning services. This study investigates the use of cloud-native architectures such as Kubernetes, TensorFlow on Kubernetes, and Apache Spark MLlib as the means to deploy distributed deep learning applications that could handle the performance, elasticity, and cost-effectiveness challenges. It discusses the importance of GPUs and new TPUs in training faster, analyzes the performance of auto-scaling and orchestration policies, and outlines the trade-offs between cloud providers. Additionally, the paper also names bottlenecks like the cost of data transfer, inefficiencies in schedules, and vendor lock-in, as well as provides commentary on the early trends of serverless ML and hybrid deployments. The results show that solutions based on the cloud are essential in addressing the gap in the computation requirements and the real-world application of deep learning on a scale and making the cloud infrastructure the basis of the upcoming AI.