Analyzing and Visualizing Earth Data at the Speed of Supercomputers

Laura Carriere

Overview

As the availability and volume of Earth data grow, researchers spend more time downloading and processing their data than doing science. The NASA Center for Climate Simulation (NCCS) has developed the Earth Data Analytics Service (EDAS), a high-performance big data analytics framework built on Apache Spark, to allow researchers to leverage our compute power to analyze large datasets located at the NCCS through a web-based interface, thereby eliminating the need to download the data. The NCCS has also processed data from five reanalysis centers into a common format to facilitate workflows as part of the Collaborative REAnalysis Technical Environment (CREATE) and developed an online visualization tool, CREATE-V, to allow scientists to visualize and compare the data.

Project Details

EDAS provides access to a suite of “canonical operations”—min, max, sum, average, anomaly, and standard deviation—that researchers can string together to develop various workflows. EDAS uses a dynamic caching architecture, a custom framework, and a streaming parallel in-memory workflow for efficiently processing huge datasets within limited memory spaces at interactive response times. EDAS is also a front end to analytics using ClimateSpark with the Spatiotemporal Indexing Approach (SIA), all of which can be accessed via a Web Processing Service (WPS) API using applications written by the user.

CREATE automates the monthly acquisition and processing of reanalysis data and makes it available through the Earth System Grid Federation (ESGF), Thematic Real-time Environmental Distributed Data Services (THREDDS), and CREATE-V, allowing users to get a first look at data of interest, retrieve values at any location, and compare different data models. Building on CREATE’s success, eight ocean reanalyses and NASA Global Modeling and Assimilation Office (GMAO) aerosol forecast data are also available.

Results and Impact

EDAS allows users to compute close to the data. The NCCS tested commonly used operations, such as generating a global temperature average and plotting diurnal cycles, using standard methodologies and then reproduced this workflow in EDAS. The EDAS workflow completed 15 to 50 times faster than standard tools.

CREATE users can write one scientific workflow and no longer need to preprocess the data provided by the centers. This can reduce the “time to science” from weeks to days and is particularly valuable for graduate students trying to meet deadlines and professors trying to teach science rather than wrangle data. CREATE data is available for analysis through EDAS.

CREATE-V provides the ability to examine past events such as droughts, heat waves, and hurricanes. It uses EDAS to display plots of monthly anomalies and yearly cycles. CREATE-V can display multiple reanalysis results in a side-by-side comparison view, allowing researchers to determine if one reanalysis performs better than another. This supports the improvement of current reanalyses and informs the choice of which reanalysis to use for research projects.

Why HPC Matters

Underlying these software tools is the NCCS Data Analytics and Storage System (DASS), which moves the compute close to the data and provides access to Spark/MapReduce technologies for parallelization of data access and efficient caching mechanisms to improve data access.

What’s Next

Future EDAS work includes adding new operations and workflows to improve the value to scientists. Plans include processing more ensemble variables and publishing them through CREATE and CREATE-V.