PrintPrint

Earth Data Analytics Service (EDAS)

What can EDAS do?

As the availability and volume of Earth data grow, researchers spend more time downloading and processing their data than doing science. The NASA Center for Climate Simulation (NCCS) has developed the Earth Data Analytics Service (EDAS), a high-performance big data analytics framework built on Apache Spark, to allow researchers to leverage our compute power to analyze large datasets located at the NCCS through a web-based interface, thereby eliminating the need to download the data. 

EDAS provides access to a suite of “canonical operations”—min, max, sum, difference, average, root mean square, anomaly, and standard deviation— that researchers can combine to develop various workflows. EDAS uses a dynamic caching architecture, a custom framework, and a streaming parallel in-memory workflow for efficiently processing huge datasets within limited memory spaces at interactive response times. These operations and datasets can be accessed via a Web Processing Service (WPS) API using applications written by the user.

EDAS allows users to compute close to the data. Performance tests of commonly used workflows produced results 15 to 50 times faster than standard tools in our environment.
 

EDAS Architecture - What are the components of EDAS?

There are three components to the EDAS architecture:

  1. Client run software, either Jupyter Notebook or Python scripts, that is installed on the user's system via conda.
  2. War file that runs out of tomcat.  It listens for external requests, parses them, and submits them to the analytics server.
  3. Backend analytics code.  This utilizes Sparc to partition the work to the available worker nodes.

The user invokes the client to access the tomcat server which forwards the request to the backend server.  Results are returned to the user's system as NetCDF files.

EDAS is a local NCCS implementation of the Earth System Grid Federation's (ESGF) Compute Working Team (CWT) project to expose ESGF distributed compute resources via an API and a set of analytical operations.

The EDAS client esgf-compute-api, is installed via conda.
1) If you don't already have Anaconda installed on your local system, start by installing the correct version for your environment.
2) Run Conda to install the EDAS client and the necessary dependencies:
Detailed instructions are available here.

conda create -n edas -c conda-forge -c acme -c uvcdat uvcdat pyzmq psutil lxml

For installs on MacOS 10.12, you will also need to run:

conda install six requests urllib3 This should update your path so that you can access the commands but some shells may need to be updated manually, for example:  export PATH=${HOME}/anaconda2/bin:${PATH} # for [ba]sh setenv PATH ${HOME}/anaconda2/bin:${PATH} # for [t]csh The conda environment is based on the one used for LLNL's UV-CDAT.  While these directions should be sufficient for setting up conda in your environment, additional guidance is available on their site.
3) Initialize shell environment for EDAS and add the branch of the CWT API package that contains the modifications for the NCCS version of the API: source activate edas  git clone https://github.com/ESGF/esgf-compute-api.git cd esgf-compute-api git checkout updates_for_EDAS git pull python setup.py install
4) Either start a Jupyter Notebook to access the API (Jupyter Notebook software is installed during the above process),
From within the EDAS conda environment (i.e. after running source activate edas): jupyter notebook OR execute the EDAS commands from any python script.
The EDAS API calls can be made to this address: https://edas.nccs.nasa.gov/wps/cwt

Essentially EDAS API accepts three parameters; variable, domain, operation.  The client software is documented in a number of places.

The general ESGF CWT API is described in two documents:
API Description (describes Inputs (variable, domain, operation) and Outputs)
ESGF WPS EXTENSION API Summary (Original Definition Document) 
       
Module documentation:  Detailed documentation will be available Nov 20, 2017. 

The NCCS has currently made a subset of their earth data holdings available through EDAS.  This list will grow substantially as the service matures and resources become available.

Run this WPS GetCapabilities call to get a dynamic list of collections: https://edas.nccs.nasa.gov/wps/cwt?request=GetCapabilities&identifier=coll
V1.0 Collections:

Collection Name Description
cip_merra_mth CREATE-IP NASA GMAO MERRA Monthly
cip_merra2_mth CREATE-IP NASA GMAO MERRA2 Monthly
cip_eraint_mth CREATE-IP ECMWF ERA-Interim Monthly
cip_cfsr_mth CREATE-IP NOAA NCEP CFSR Monthly
cip_jra25_mth CREATE-IP JMA JRA-25 Monthly
cip_jra55_mth CREATE-IP JMA JRA-55 Monthly
cip_20crv2c_mth CREATE-IP NOAA ESRL 20CRv2c Monthly
cip_merra_6hr CREATE-IP NASA GMAO MERRA 6-hourly
cip_merra2_6hr CREATE-IP NASA GMAO MERRA2 6-hourly
cip_eraint_6hr CREATE-IP ECMWF ERA-Interim 6-hourly
cip_cfsr_6hr CREATE-IP NOAA-NCEP CFSR 6-hourly
cip_jra55_6hr CREATE-IP JMA JRA-55 6-hourly
iap-ua_era40_tas1hr IAP-UA Reprocessed ERA-40 1-hr Surface Temperature
iap-ua_eraint_tas1hr IAP-UA Reprocessed ERA-Interim 1-hr Surface Temperature
iap-ua_merra_tas1hr IAP-UA Reprocessed MERRA 1-hr Surface Temperature
iap-ua_nra_tas1hr IAP-UA Reprocessed NRA 1-hr Surface Temperature


Additional information on the above datasets is available:
CREATE-IP
CREATE-IP Datasets

The NCCS has made the following initial set of operations available through EDAS.  This list will grow substantially as the service matures and resources become available.

Run this WPS GetCapabilities call to get a dynamic list of operations: https://edas.nccs.nasa.gov/wps/cwt?request=GetCapabilities

Operation Type Description EDAS Kernel Name or Workflow
Min Computes minimum element value from input variable data over specified axes and roi CDSpark.min
Max Computes maximum element value from input variable data over specified axes and roi CDSpark.max
Sum Computes the sum of the array elements along the given axes (∑n) CDSpark.sum
Diff Computes element-wise diffs for a pair of input variables over specified roi CDSpark.eDiff
Average Computes (weighted) means of element values from input variable data over specified axes and roi (∑n/n) CDSpark.ave
RMS Computes root mean square of input variable over specified axes and roi CDSpark.rms
Anomaly Computes an anomaly of the input variable data Workflow: CDSpark.ave + CDSpark.eDiff
Standard Deviation Quantification of the amount of variation of the data in the given range (√(∑(x-xave)2/n)) Workflow: CDSpark.ave + CDSpark.eDiff + CDSpark.rms

Workflows are simply a combination of kernels that are run in succession.  See the Example Code section below.

The EDAS client provides some rudimentary plotting and printing routines:
mpl_timeplot:  Plot a time series.
mpl_spaceplot:  Plot a lat/long image.
print_Mdata:  Print the resultant metadata.

The output files are downloaded to the client's system and placed in /tmp.

You will get an output message that gives you the filename that has been downloaded and the location on the same file on our THREDDS server.  Additional OPeNDAP functionality may be available in future releases.  The message will contain the word "HREFS". 

Demo Scripts

Hints for easier usage:
Time ranges can be specified either by indices wherein each increment represents a timestep: 'time': {'start':0, 'end':100, 'crs':'indices’} or by dates: time":{"start":"1980-01-01T00:00:00","end":"1980-12-31T23:00:00”,"crs":"timestamps"} 

Start and end times must go from earlier to later.

Lat and long values must go from smaller to larger.

Version 1.0 of the EDAS API is now available. Version 1.1 will have some additional features and data collections and will be available Dec 11, 2017.

EDAS is currently somewhat verbose.  Jupyter Notebooks have been known to repeat EDAS messages if cells aren't cleared and the kernel restarted, making the verbosity more noticeable.

Important Informational Messages: [2017-10-26 18:28:43,031][wps.py[execute:476]] HREFS: - This message will provide you with the location of the output file on the NCCS THREDDS server. [2017-10-12 13:41:36,866][wps.py[download_result:416]] STATUS: QUEUED - The EDAS server is busy running a previously submitted command.  Your command has been queued and will run shortly.  If you command is queued for an excessively long period of time, 10 minutes or more, please contact the NCCS at support@nccs.nasa.gov Please include "EDAS" in the subject of your email. [2017-10-12 13:41:36,866][wps.py[download_result:416]] STATUS: EXECUTING - The EDAS server is executing your command. [2017-10-12 13:41:37,880][wps.py[download_result:416]] STATUS: COMPLETED - Your job has completed and your results are available.

Syntax errors are provided as needed.

Support

E-Mail us at support@nccs.nasa.gov with subject line: EDAS.