CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration
Conference
Jin, H, Wu, Y. (2025). CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration
.(2025), 316-323. 10.1109/ICWS67624.2025.00046
Jin, H, Wu, Y. (2025). CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration
.(2025), 316-323. 10.1109/ICWS67624.2025.00046
Large Language Models (LLMs) exhibit remarkable human-like predictive capabilities. However, it is challenging to deploy LLMs to provide efficient and adaptive inference services at the edge. This paper proposes a novel Cloud-Edge Collaboration framework for LLMs (CE-CoLLM) to tackle these challenges. First, we identify the transmission of LLM contextual data between the cloud and edge as a key performance bottleneck, which introduces substantial communication overhead that dominates overall inference latency and makes naïve cloudedge collaboration for LLMs inefficient. Second, we introduce a suite of novel techniques, including a latency-aware early exit mechanism and efficient cloud context management, into CECoLLM, which collectively reduce communication overhead and preserve LLM inference accuracy. Third, we design two adaptive inference modes to accommodate diverse edge environments: (1) a low-latency standalone edge inference mode that enables reliable edge-side independent LLM inference even under unstable network conditions, and (2) a high-accuracy cloud-edge collaborative inference mode that adaptively leverages cloud resources to enhance prediction accuracy. Extensive experiments on multiple benchmark datasets demonstrate that CE-CoLLM reduces overall inference time by up to 13.81% and offloads over 84.53% of the computational workload from the cloud to the edge, compared to conventional cloud-based LLM deployment, without sacrificing prediction accuracy. The code is provided on GitHub at https://github.com/mlsysx/CE-CoLLM.