A Scalable Framework for Heterogeneous Environmental Data Management Using Smart Data Pipeline Proceedings Paper

Poudel, P, Guan, B, Sanchez, N et al. (2025). A Scalable Framework for Heterogeneous Environmental Data Management Using Smart Data Pipeline . 10.1145/3708035.3736017

cited authors

  • Poudel, P; Guan, B; Sanchez, N; Bahreini, K; Cui, W; Lopez, A; Najafi, H; Fu, Z; Bobadilla, L; Liu, J

abstract

  • Environmental data originates from diverse sources, posing challenges in management, processing, and visualization. This paper introduces a scalable, AI-driven data pipeline framework for environmental data management and discovery. The framework integrates workflow orchestration, automated data ingestion and processing, federated storage, and seamless geospatial visualization. It employs a Ceph-based storage system to handle large, heterogeneous datasets, leveraging its fault-tolerant, distributed architecture for high-performance storage across object, block, and file interfaces. To enhance data discoverability and interoperability, the framework incorporates Generative AI (GenAI) for automated metadata generation, reducing manual annotation overhead while improving real-time processing and cross-platform integration. Additionally, the system enables interdisciplinary collaboration through standardized metadata structures and scalable data federation. A case study using buoy data validates the framework’s capabilities, including data processing, cleaning, and visualization. By addressing critical data integration and accessibility challenges, the system fosters a scalable, efficient, and intelligent research data-sharing ecosystem for environmental science studies.

publication date

  • July 18, 2025

Digital Object Identifier (DOI)