The Earth Data Analytic Services (EDAS) Framework

Thomas Maxwell

Abstract

Faced with unprecedented growth in Earth data volume and demand, NASA has developed the Earth Data Analytic Services (EDAS) framework, a high-performance big data analytics framework built on Apache Spark. This framework enables scientists to execute data processing workflows combining common analysis operations close to the massive data stores at NASA. The data is accessed in standard (NetCDF, HDF, etc.) formats in a POSIX file system and processed using vetted Earth data analysis tools (ESMF, CDAT, NCO, etc.). EDAS utilizes a dynamic caching architecture, a custom distributed array framework, and a streaming parallel in-memory workflow for efficiently processing huge datasets within limited memory spaces with interactive response times. EDAS services are accessed via a WPS API being developed in collaboration with the ESGF Compute Working Team to support server-side analytics for ESGF. New analytic operations can be developed in Python, Java, or Scala (with support for other languages planned). Client packages in Python, Java/Scala, or JavaScript contain everything needed to build, submit, manage, and visualize big data analysis workflows from the user’s desktop computer or to develop web applications with embedded analytics.

The EDAS architecture brings together the tools, data storage, and high-performance computing required for timely analysis of large-scale data sets, where the data resides. It is currently deployed at NASA and available for public use. Another NASA EDAS deployment supports the Collaborative REAnalysis Technical Environment (CREATE) project, which centralizes numerous global reanalysis datasets onto a single analytics platform. These services enable scientists and decision makers to access remote model/reanalysis data archives and investigate trends, variability, anomalies, and other features of local and global earth system dynamics.