Parallel Analog Ensemble

How to format observations for AnEn

2019-11-18T00:00:00+00:00

Introduction
Access

Introduction

This short tutorial walks you through the steps of converting observations stored in a CSV file to an R list that have the required variables by RAnEn.

It is recommended to use binder and .Rmd files will guide you through the script line by line.

You will learn the followings:

Formatting observations for RAnEn
RAnEn::writeNetCDF
RAnEn::readObservations

Access

This tutorial can be accessed on binder. Please click here to start an interactive session and go over the tutorial under RAnalogs/examples. Or you can download the repository and use the R markdown file directly.

Running Large Scale Analog Ensemble on Cheyenne

2019-04-15T00:00:00+00:00

Background
A Brief Introduction to the Problem
Workflow

Background

This showcase was originally created for the 2019 Software Engineering Assembly (now Improving Scientific Software) conference. During the conference, the presentation was about running large scale Analog Ensemble (AnEn) on NCAR Cheyenne supercomputer systems and the hands-on workshop worked through some basic examples of RAnEn locally on a desktop and then showcased the workflow of running large scale AnEn on Cheyenne.

Some helpful information link is provided below:

This post summarizes the second part of the workshop, Analog Ensemble at Scale.

A Brief Introduction to the Problem

To generate AnEn for wind speed for one month of July 2018 using 1 year of search data in 2017, since the North America Mesoscale (NAM) model is used, we are dealing with about 838 GB of model data including forecasts and analysis. In total, there are 262,792 grid points in the model domain. This big domain is decomposed (broken) row-wise into 50 chunks so that we can generate AnEn for each chunk of the domain in parallel.

Please find the scripts used in this post here.

Please refer to the help page on building AnEn on Cheyenne if you would like to check out the tools used in this tutorial.

Workflow

In this workshop, 3 steps are involved for AnEn generation and visualization after you have built/accessed the tools and collected data:

Step 1: Generate AnEn for each domain chunk. Each domain chunk is associated with a job, and a set of configuration files. Each configuration file specifies to analogGenerator which part of the file should be read. A general configuration file is also used to specify some of the common parameters that are shared across all domain chunks like weights and observation id.
Step 2: Reshape AnEn results from all days per chunk to all chunks per day. It is generally more convenient for verification and visualization when all grid points are include in the same file. This is achieved by reorganizing the data files from separate chunks into a complete domain with day intervals.
Step 3: Visualize AnEn results. At this point, each NetCDF file should have a daily forecast for the entire model domain which is easy to visualize. An R script is prepared to generate the figures.

This is the video showing how the entire workflow looks like. Click here if the video is not showing up correctly.

Please connect Weiming Hu if you would like a copy of the script.

Thanks.

How to Automate Data Preprocessing for AnEn Computation on Cheyenne

2019-03-01T00:00:00+00:00

Introduction
Data Preparation
Scripts
Results
Appendix

Introduction

This tutorial shows how to use the data preprocessing tools (gribConverter, windFieldCalculator) in the AnEn package to reformat the data to the correct form that can be directly used by a number of computation tools (similarityCalculator, analogGenerator, RAnEn) to generate analog ensembles. Addition to that, this tutorial also shows how to automate and parallelize the process on Cheyenne supercomputers.

This tutorial assumes the basic knowledge on bash script language and that AnEn package has already been successfully installed. More information of how to install AnEn on Cheyenne can be found here.

This tutorial also assumes that you have already built the AnEn tools. Instructions for building the tools can be found here.

Data Preparation and Goals

A large collection (~5.4TB) of data from North American Mesoscale Forecast model have been downloaded for the time period from October, 2008, to July, 2018. The original files are .g2.tar files and an example for the file name is nam_218_2008102900.g2.tar. Files have already been arranged by YearMonth in each folder as follows:

> ls
200810  200907  201004  201101  201110  201207  201304  201401  201410  201507  201604  201701  201710  201807
200811  200908  201005  201102  201111  201208  201305  201402  201411  201508  201605  201702  201711  
200812  200909  201006  201103  201112  201209  201306  201403  201412  201509  201606  201703  201712  
200901  200910  201007  201104  201201  201210  201307  201404  201501  201510  201607  201704  201801  
200902  200911  201008  201105  201202  201211  201308  201405  201502  201511  201608  201705  201802
200903  200912  201009  201106  201203  201212  201309  201406  201503  201512  201609  201706  201803
200904  201001  201010  201107  201204  201301  201310  201407  201504  201601  201610  201707  201804
200905  201002  201011  201108  201205  201302  201311  201408  201505  201602  201611  201708  201805
200906  201003  201012  201109  201206  201303  201312  201409  201506  201603  201612  201709  201806

Original NAM forecast files are organized by day, cycle time, and lead time. Each file is a compilation of parameters at all available locations/grid points. However, data that AnEn requires have a different format. This format requires the file to have parameters, grid points, times, and lead times information included. Our goal is convert the model output to this format.

Since the total file size exceeds 5 TB, it would be a better practice to avoid a huge file, but to have it broken down to chunks. Therefore, the files are grouped by month.

Scripts

I have prepared two scripts. The first script is the resource PBS script. This script does the following tasks:

Identifies which folders are currently being processed by search for a lock file and which folders have already been processed by searching for the expected output data file;
Selects only one folder that has not yet been processed;
Untars tarballs in a temporary folder;
Convert submessages to independent messages using grib_copy (reference: grib_copy example 4);
Converts grb2 files to NetCDF files;
Computes and adds wind direction and speed fields to the NetCDF file;
Exits normally.

#!/bin/bash

# The name of the task
#PBS -N process_each_month

# The project account
#PBS -A MY.PROJECT.ACCOUNT

# The time resources requested
#PBS -l walltime=10:00:00

# The queue type
#PBS -q regular

# Combine standard output and errors
#PBS -j oe                           

# The computing resources requested
#PBS -l select=1:ncpus=1:mem=109GB:ompthreads=1

# I would like to receive an email when tasks
# (a)bort, (b)egin, and (e)nd.
#
#PBS -m abe

# And this is the email
#PBS -M my.email@server.com

# These are the available folders. The folder names are also going to be the names of NetCDF files.
declare -a arr=("200810" "200811" "200812" "200901" "200902" "200903" "200904" "200905" "200906" "200907" "200908" "200909" "200910" "200911" "200912" "201001" "201002" "201003" "201004" "201005" "201006" "201007" "201008" "201009" "201010" "201011" "201012" "201101" "201102" "201103" "201104" "201105" "201106" "201107" "201108" "201109" "201110" "201111" "201112" "201201" "201202" "201203" "201204" "201205" "201206" "201207" "201208" "201209" "201210" "201211" "201212" "201301" "201302" "201303" "201304" "201305" "201306" "201307" "201308" "201309" "201310" "201311" "201312" "201401" "201402" "201403" "201404" "201405" "201406" "201407" "201408" "201409" "201410" "201411" "201412" "201501" "201502" "201503" "201504" "201505" "201506" "201507" "201508" "201509" "201510" "201511" "201512" "201601" "201602" "201603" "201604" "201605" "201606" "201607" "201608" "201609" "201610" "201611" "201612" "201701" "201702" "201703" "201704" "201705" "201706" "201707" "201708" "201709" "201710" "201711" "201712" "201801" "201802" "201803" "201804" "201805" "201806" "201807")

# Define the configuration file for gribConverter.
# The file can be found at 
# https://github.com/Weiming-Hu/AnalogsEnsemble/blob/master/apps/app_gribConverter/example/commonConfig.cfg
#
converterConfig=/glade/u/home/wuh20/scratch/data/forecasts/forecasts.cfg

# Define the output destination
destDir=/glade/u/home/wuh20/flash/forecasts_new/

# Define the lock file name
lockFile=.lock

for month in "${arr[@]}"; do
    # This is the data folder
    monthDir=/glade/u/home/wuh20/scratch/data/forecasts/$month
    
    # Whether this folder has already been processed
    if [ -f $destDir/$month\.nc  ]; then
        echo Month $month has been processed. Skip this month.
        continue
    fi
    
    # Check whether this directory exists
    if [ ! -d $monthDir  ]; then
        echo Directory not found: $monthDir
        exit 1
    fi
    
    cd $monthDir
    
    # Lock this directory
    if [ -f $lockFile  ]; then
        echo Directory $monthDir is in process. Skip this directory.
        continue
    else
        echo Lock directory $monthDir
        touch $lockFile
    fi
    
    # Create a folder to store the original files
    if [ ! -d original-extract-files ]; then
        echo Create folder for original extract files ...
        mkdir original-extract-files
    fi
    
    # Unpack tar files
    echo Extracting from tar files ...
    if [ -f log_extract  ]; then
        rm log_extract
    fi

    for tarFile in *.g2.tar; do
        tar --skip-old-files -xvf $tarFile -C original-extract-files >> log_extract
    done
    
    echo flattening messages with submessages ...
    for file in `ls original-extract-files`; do
        if [ ! -f $file ]; then
            /glade/u/home/wuh20/github/AnalogsEnsemble/dependency/install/bin/grib_copy original-extract-files/$file $file
        fi
    done
    
    # Convert grb2 files
    echo Converting grb2 files ...
    if [ ! -f $month-original.nc ]; then
        /glade/u/home/wuh20/github/AnalogsEnsemble/output/bin/gribConverter -c $converterConfig --folder ./ -o $month-original.nc -v 3 > log_converter
    fi

    # Add wind fields
    echo Adding wind fields ...
    if [ ! -f $month\.nc ]; then
        /glade/u/home/wuh20/github/AnalogsEnsemble/output/bin/windFieldCalculator --file-in $month-original.nc --file-type Forecasts --file-out $month\.nc -U 1000IsobaricInhPaU -V 1000IsobaricInhPaV --dir-name 1000IsobaricInhPaDir --speed-name 1000IsobaricInhPaSpeed -v 3 > log_wind
    fi

    # Move the data file elsewhere
    echo Moving data to $destDir
    mv $month\.nc $destDir
    
    # Cleaning
    echo Cleaning ...
    rm -rf original-extract-files
    rm $month-original.nc
    rm *.grb2
    
    echo Releasing the folder lock
    rm $lockFile
   
    # Each job only process one month
    echo Finished processing month $month
    exit 0
done

The first pbs script pretty much defines all the tasks that should be done. However, tasks for each month are entirely independent from each other and can be fully parallelized. Therefore, I decided that each task only processes one folder instead of continuing to the next folder available to avoid confusion between tasks. The following script simply deal with batch submitting the tasks to Cheyenne scheduler.

We have another problem here that if two tasks are started simultaneously, there is a slight possibility that they will process the same folder and the folder lock mechanism based on file creating might not work. A simple workaround for that is to ensure submitting a new job when there is no queueing tasks meaning all tasksing have been started.

#!/bin/bash

# Define the total number of jobs to create.
totalJobs=118

# Define the counter start.
submittedJobs=0

while true; do
    # Get the number of queued jobs by looking at the queue status looking for the symbols
    number=`qstat | grep "Q regular" | wc -l`
    
    echo The number of queued jobs: $number
    echo The number of submitted jobs: $submittedJobs
    if (( number == 0  )); then
        echo There is no queued jobs. Submit a new one.
        qsub batch_process.pbs
        submittedJobs=$((submittedJobs + 1))
        if (( submittedJobs == totalJobs  )); then
            echo $submittedJobs jobs submitted. Done!
            exit 0
        fi
    fi
    sleep 10
done

Results

By the completion of the scripts, we would have the following files in our destination folder:

> ls
200810.nc  200910.nc  201010.nc  201110.nc  201210.nc  201310.nc  201410.nc  201510.nc	201610.nc  201710.nc
200811.nc  200911.nc  201011.nc  201111.nc  201211.nc  201311.nc  201411.nc  201511.nc	201611.nc  201711.nc
200812.nc  200912.nc  201012.nc  201112.nc  201212.nc  201312.nc  201412.nc  201512.nc	201612.nc  201712.nc
200901.nc  201001.nc  201101.nc  201201.nc  201301.nc  201401.nc  201501.nc  201601.nc	201701.nc  201801.nc
200902.nc  201002.nc  201102.nc  201202.nc  201302.nc  201402.nc  201502.nc  201602.nc	201702.nc  201802.nc
200903.nc  201003.nc  201103.nc  201203.nc  201303.nc  201403.nc  201503.nc  201603.nc	201703.nc  201803.nc
200904.nc  201004.nc  201104.nc  201204.nc  201304.nc  201404.nc  201504.nc  201604.nc	201704.nc  201804.nc
200905.nc  201005.nc  201105.nc  201205.nc  201305.nc  201405.nc  201505.nc  201605.nc	201705.nc  201805.nc
200906.nc  201006.nc  201106.nc  201206.nc  201306.nc  201406.nc  201506.nc  201606.nc	201706.nc  201806.nc
200907.nc  201007.nc  201107.nc  201207.nc  201307.nc  201407.nc  201507.nc  201607.nc	201707.nc  201807.nc
200908.nc  201008.nc  201108.nc  201208.nc  201308.nc  201408.nc  201508.nc  201608.nc	201708.nc
200909.nc  201009.nc  201109.nc  201209.nc  201309.nc  201409.nc  201509.nc  201609.nc	201709.nc

And each file has the correct format for AnEn computation.

> ncdump -h 201801.nc 
netcdf \201801 {
dimensions:
	num_parameters = 17 ;
	num_chars = 50 ;
	num_stations = 262792 ;
	num_times = 31 ;
	num_flts = 53 ;
variables:
	char ParameterNames(num_parameters, num_chars) ;
	double ParameterWeights(num_parameters) ;
	char ParameterCirculars(num_parameters, num_chars) ;
	char StationNames(num_stations, num_chars) ;
	double Xs(num_stations) ;
	double Ys(num_stations) ;
	double Times(num_times) ;
	double FLTs(num_flts) ;
	double Data(num_flts, num_times, num_stations, num_parameters) ;
}

Building AnEn on NCAR Cheyenne

2019-02-17T00:00:00+00:00

Introduction
Building AnEn

Introduction

This short tutorial walks you through the steps of building the AnEn C++ program on NCAR Cheyenne Supercomputers.

Building AnEn

Several things to be noted before we carry on:

Most of the dependencies are already available on Cheyenne, so I’m going to load them directly. Boost, however, is not available, so I tell cmake to build it for me.
I will be installing PAnEn into a user space folder after the successful building. You can change the argument CMAKE_INSTALL_PREFIX.
Notice the argument CMAKE_INSTALL_RPATH. This is needed because the modules are not in system path. When we install programs, cmake by default removes build-time run path, so we need to specify the run-time path for install and where the executable should be looking for libraries.

# Download the source files
wget https://github.com/Weiming-Hu/AnalogsEnsemble/archive/master.zip

# Unzip the tarball
unzip master.zip

# Go to the source folder
cd AnalogEnsemble-master

# Clean modules
module purge

# Load required modules
module load gnu/9.1.0 netcdf/4.7.3 ncarenv/1.3 cmake/3.16.4 eccodes/2.12.5

# Carry an out-of-tree build
mkdir build
cd build

# Generate build system
cmake -DCMAKE_INSTALL_PREFIX=../../release -DBUILD_BOOST=ON -DCMAKE_PREFIX_PATH="$NCAR_ROOT_ECCODES;$NETCDF" -DCMAKE_INSTALL_RPATH="$NCAR_ROOT_ECCODES/lib;$NETCDF/lib" ..

# Build
make -j 16

# Test
make test

# Instal
make install

# Show help message
cd ../../release/bin
./anen

If you log out and log back in, you need to at least load the GNU module for anen to work.

module load gnu/9.1.0

If you encountered any problems, please open a ticket here.

Operational Search with RAnEn

2019-02-12T00:00:00+00:00

Introduction
Access

Introduction

Prediction accuracy of the Analog Ensemble depends on the quality of analogs. Presumably, better analogs will generate better predictions. In an operational model, it is likely that the historical forecasts in the near past are the most similar to the current forecast. Therefore, in operational mode, as each day passes, it is added to the historical repository.

This article shows an example of how to use RAnEn with an operational search. It is strongly suggested to go over the demo 1 prior to this tutorial.

Access

NetCDF File Types and Variables for Analog Ensemble Applications

2019-01-16T00:00:00+00:00

Updates on 2021/12/16

I used R to generate the file format messages below. If you are using ncdump or python, you should reverse the dimension orders. For example, Data would be [num_flts, num_times, num_stations, num_parameters].
For character-related variables, like ParameterNames and StationNames, there are two storing options. They can be stored as a character matrix shown below, or they can be store as a string vector. In that case, the format would be string StationNames(num_stations).

Under the apps directory, there are several C++ programs that implements different phases of generating analog ensembles, including calculating standard deviations, calculating similarity metrics, and selecting analog forecasts, and some other programs for data pre-processing. Currently, all input and output files are in NetCDF format. This articles documents variables and dimensions expected in each file type based on the file type, for example, Forecasts, Observations, Similarity, and so on.

File Types

The defined file types include:

Forecasts
Observations
Analogs
Similarity
StandardDeviation
Matrix

Each file type is associated with a list of expected dimensions and a list of expected variables. Those variables and dimensions are required to ensure the correctness and performance of C++ program. Some variables can also be very helpful during visualization.

Forecasts

An example Forecasts file includes the following content:

9 variables (excluding dimension variables):
   char ParameterNames[num_chars,num_parameters]   (Contiguous storage)  
   double ParameterWeights[num_parameters]   (Contiguous storage)  
   char ParameterCirculars[num_chars,num_parameters]   (Contiguous storage)  
   char StationNames[num_chars,num_stations]   (Contiguous storage)  
   double Xs[num_stations]   (Contiguous storage)  
   double Ys[num_stations]   (Contiguous storage)  
   double Times[num_times]   (Contiguous storage)  
   double FLTs[num_flts]   (Contiguous storage)  
   double Data[num_parameters,num_stations,num_times,num_flts]   (Contiguous storage)  

5 dimensions:
   num_parameters  Size:17
   num_chars  Size:50
   num_stations  Size:262792
   num_times  Size:31
   num_flts  Size:53

ParameterNames are the names of each parameters in the forecasts.
ParameterWeights are the corresponding weight for each parameter in the forecasts to be used when computing forecast similarity.
ParameterCirculars are the names of the circular parameters.
StationNames are the names of the forecast stations or grid points.
Xs are the x coordinates of the forecast stations or grid points.
Ys are the y coordinates of the forecast stations or grid points.
Times are the time representation of forecasts. It is the number of seconds since the origin, 1970-01-01 00:00:00 UTC by default.
FLTs are the time representation of forecast lead times. It is the number of seconds since the initialization of the forecast model.
Data is a 4-dimensional array that stores the actual forecast values.

Observations

An example Observations file looks pretty much similar Forecasts, except that the variable Data is a 3-dimensional array without forecast lead times.

8 variables (excluding dimension variables):
   char ParameterNames[num_chars,num_parameters]   (Contiguous storage)  
   double ParameterWeights[num_parameters]   (Contiguous storage)  
   char ParameterCirculars[num_chars,num_parameters]   (Contiguous storage)  
   char StationNames[num_chars,num_stations]   (Contiguous storage)  
   double Xs[num_stations]   (Contiguous storage)  
   double Ys[num_stations]   (Contiguous storage)  
   double Times[num_times]   (Contiguous storage)  
   double Data[num_parameters,num_stations,num_times]   (Contiguous storage)  

4 dimensions:
   num_parameters  Size:15
   num_chars  Size:50
   num_stations  Size:262792
   num_times  Size:496

Analogs

An example Analogs file includes the following content:

10 variables (excluding dimension variables):
    double Analogs[num_stations,num_times,num_flts,num_members,num_cols]   (Contiguous storage)  
    char StationNames[num_chars,num_stations]   (Contiguous storage)  
    double Xs[num_stations]   (Contiguous storage)  
    double Ys[num_stations]   (Contiguous storage)  
    double Times[num_times]   (Contiguous storage)  
    double FLTs[num_flts]   (Contiguous storage)  
    char MemberStationNames[num_chars,member_num_stations]   (Contiguous storage)  
    double MemberXs[member_num_stations]   (Contiguous storage)  
    double MemberYs[member_num_stations]   (Contiguous storage)  
    double MemberTimes[member_num_times]   (Contiguous storage)  

8 dimensions:
    num_stations  Size:10
    num_times  Size:100
    num_flts  Size:10
    num_members  Size:5
    num_cols  Size:3
    num_chars  Size:50
    member_num_stations  Size:10
    member_num_times  Size:1000

Analogs is a 5-dimensional array that stores analog forecasts. More information about analogs can be found at here.
FLTs is the time representation of the analog forecasts. It is the number of seconds since the initialization of the forecast model.
StationNames are the names of stations for analog forecasts.
Xs are the x coordinates of stations for analog forecasts.
Ys are the y coordinates of stations for analog forecasts.
Times is the time representation of the analog forecasts. It is the number of seconds since the origin, 1970-01-01 00:00:00 UTC by default.
MemberStationNames are the names of stations for analog members. This can be used together with the search station index in the fifth dimension to get the exact details of search station used.
MemberXs are the x coordinates of stations for analog members. This can be used together with the search station index in the fifth dimension to get the exact details of search station used.
MemberYs are the y coordinates of stations for analog members. This can be used together with the search station index in the fifth dimension to get the exact details of search station used.
MemberTimes is the time representation of the search times. This can be used together with the search time index in the fifth dimension to know what historical time this member belongs to.

Similarity

An example Similarity file includes the following content:

13 variables (excluding dimension variables):
    double SimilarityMatrices[num_cols,num_entries,num_flts,num_times,num_stations]   (Contiguous storage)  
    char ParameterNames[num_chars,num_parameters]   (Contiguous storage)  
    double ParameterWeights[num_parameters]   (Contiguous storage)  
    char ParameterCirculars[num_chars,num_parameters]   (Contiguous storage)  
    char StationNames[num_chars,num_stations]   (Contiguous storage)  
    double Xs[num_stations]   (Contiguous storage)  
    double Ys[num_stations]   (Contiguous storage)  
    double Times[num_times]   (Contiguous storage)  
    double FLTs[num_flts]   (Contiguous storage)  
    char SearchStationNames[num_chars,search_num_stations]   (Contiguous storage)  
    double SearchXs[search_num_stations]   (Contiguous storage)  
    double SearchYs[search_num_stations]   (Contiguous storage)  
    double SearchTimes[search_num_times]   (Contiguous storage)  

9 dimensions:
    num_stations  Size:10
    num_times  Size:100
    num_flts  Size:10
    num_entries  Size:100
    num_cols  Size:3
    num_parameters  Size:10
    num_chars  Size:50
    search_num_stations  Size:10
    search_num_times  Size:100

SimilarityMatrices is a 5-dimensional array that stores similarity metric values.
ParameterNames are names of parameters used to calculate the similarity.
ParameterWeights are weights of parameters used to calculate the similarity.
ParameterCirculars are names of circular parameters.
StationNames are names of stations or grid points for which similaity is generated.
Xs are x coordinates of stations or grid points for which similaity is generated.
Ys are y coordinates of stations or grid points for which similaity is generated.
Times is the time representation of the similarity. It is the number of seconds since the origin, 1970-01-01 00:00:00 UTC by default.
FLTs is the time representation of the similarity. It is the number of seconds since the initialization of the forecast model.
SearchTimes are times for the complete search period. This can be used together with the search time index in the fifth dimension to know what historical forecast this similarity is generated from.
SearchStationNames are stations names for the complete search data. This can be used together with the search station index in the fifth dimension to know what station/grid point is used to generate similarity.
SearchXs are x coordinates for the complete search stations. This can be used together with the search station index in the fifth dimension to know what station/grid point is used to generate similarity.
SearchYs are y coordinates for the complete search stations. This can be used together with the search station index in the fifth dimension to know what station/grid point is used to generate similarity.

StandardDeviation

An example StandardDeviation file includes the following content:

8 variables (excluding dimension variables):
    double StandardDeviation[num_parameters,num_stations,num_flts]   (Contiguous storage)  
    char ParameterNames[num_chars,num_parameters]   (Contiguous storage)  
    double ParameterWeights[num_parameters]   (Contiguous storage)  
    char ParameterCirculars[num_chars,num_parameters]   (Contiguous storage)  
    char StationNames[num_chars,num_stations]   (Contiguous storage)  
    double Xs[num_stations]   (Contiguous storage)  
    double Ys[num_stations]   (Contiguous storage)  
    double FLTs[num_flts]   (Contiguous storage)  

4 dimensions:
    num_parameters  Size:10
    num_stations  Size:10
    num_flts  Size:10
    num_chars  Size:50

StandardDeviation is a 3-dimensional array that stores standard deviation values.
ParameterNames are the names of parameters.
ParameterWeights are the weights of parameters.
ParameterCirculars are the names of circular parameters.
StationNames are the names of stations or grid points.
Xs are the x coordinates of stations or grid points.
Ys are the y coordinates of stations or grid points.
FLTs are the forecast lead times.

Matrix

File type Matrix is designed for time mapping matrix between forecast times/forecast lead times and observation times. It is usually in text file format.

References

All the example output is generated using R package ncdf4.

Profile AnEn

2019-01-08T00:00:00+00:00

Introduction
Result Preview
Preparation and Clarification
Profiling with TAU
Profiling with gprof
Profiling with valgrind
Sequel on TAU Installation

Introduction

This file documents the process of profiling analysis of the weather forecast technique Analog Ensemble.

Result Preview

These figures are generated using TAU profiler and the visualization tools paraprof.

The following figure is generated from gprof.

Preparation and Clarification

Please note a couple of placeholders in this tutorial. It is recommended to use the absolute full path to replace them.

[Allocation Name] is the project name you are attached to. It shows up every time when you log onto ICS.
[Analog Ensemble Source Dir] is the root directory of Analog Ensemble programs. You can download it from Github.
[TAU Source Dir] is the folder all TAU source files are extracted to. You can download TAU here;
[Profile Data Dir] is the folder with profile data and a configuration file. Please generate the profile data using the R script generateAnEnInput.R by running Rscript generateAnEnInput.R in a console. The R package ncdf4 is required. The configuration file is config.cfg.

Profiling with TAU

Build with `TAU`

Similar to gprof, we need to build the program with tau compilers. Please install tau first. Here, I assume that tau is already available. Wondering how to install TAU, please jump to the last section.

# Build AnEn programs
cd [Analog Ensemble Source Dir]
mkdir build && cd build

# Generate the make system. We are installing to a specific location to avoid any program clashing
CC=taucc CXX=taucxx cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=../release_tau ..

# Sometimes, TAU might not be able to find some packages. So you might need to add -DCMAKE_PREFIX_PATH to guide tau compilers
CC=taucc CXX=taucxx cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=../release_tau -DCMAKE_PREFIX_PATH=/usr/lib/x86_64-linux-gnu .. 

# Build
make -j 2

# Test
make test

# Install
make install

Profiling

To collect profiler data, run the program normally. It is necessary to run the program with the exact command for gprof.

cd [Profile Data Dir]
OMP_NUM_THREADS=1 [Analog Ensemble Source Dir]/release_tau/bin/anen_grib -c config.cfg

Visualization

Profile files have names like profile.0.0.*. We can use the following tools to visualize the results.

# For text visualization
pprof

# For graphic visualization
paraprof

Profiling with `gprof`

Please note that gprof might have the highest sampling error among the three solutions here.

Build with `gprof`

To profile the program with gprof, we only need to build the program with the extra flag -pg.

# Go to our root directory and carry an out-of-tree build
cd [Analog Ensemble Source Dir]
mkdir build && cd build

# Generate the make system. We are installing to a specific location to avoid any program clashing
cmake -DCMAKE_CXX_FLAGS='-pg' -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=../release_gprof ..

# Build
make -j 2

# Test
make test

# Install
make install

Let’s check the program is built successfully.

# Change the directory to the installation folder
cd ../release/bin
./anen_grib

# The following file should be automatically generated.
file gmon.out

Profiling

Run the program normally.

cd [Profile Data Dir]
OMP_NUM_THREADS=1 [Analog Ensemble Source Dir]/release_gprof/bin/anen_grib -c config.cfg

This should generate a gmon.out file.

Visualization

To visualize the gprof output, we can convert the text file to a dot graph and then an image. I’m using the gprof2dot program which is written in python.

# Install the graphviz if you do not have it
sudo apt install graphviz

virtualenv env -p python3
source env/bin/activate
pip install gprof2dot

# -w for wrapping function names
# -s for stripping detailed function information to reduce texts
#
gprof [Analog Ensemble Source Dir]/build/release/bin/anen_grib gmon.out | gprof2dot -w -s | dot -Tpng -Gdpi=500 -o profile-gprof.png

Profiling with `valgrind`

valgrind is very accurate because it runs your program in a virtual environment. But it does introduces a lot of overhead (10x ~ 80x slower).

Check if you have already installed the profiler tools. To install them, you can use sudo apt install kcachegrind valgrind.

Build

No extra configurations are needed. Just build the program as you normally would.

# Go to our root directory and carry an out-of-tree build
cd [Analog Ensemble Source Dir]
mkdir build && cd build

# Generate the make system. We are installing to a specific location to avoid any program clashing
cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=../release_valgrind ..

# Build
make -j 2

# Test
make test

# Install
make install

Profiling

Run the executable with valgrind.

cd [Profile Data Dir]
export OMP_NUM_THREADS=1
time valgrind --tool=callgrind [Analog Ensemble Source Dir]/release_valgrind/bin/anen_grib -c config.cfg

Visualization

Some profile data files with names like callgrind.out.* should have been generated. Use kcachegrind to visualize them. Choose the latest one if you have multiple of them. Usually this is because you have run the command multiple times.

kcachegrind [callgrind.out.* profile data file]

Sequel on TAU Installation

I found TAU profiler to be very powerful and convenient to use. It is a piece of software from the University of Oregon. The video walks you through the installation and I followed it. There might be typos so be careful when reading and watching.

For TAU_OPTIONS, you can find the references here. At this point, I have successfully built TAU with the visualizer paraprof.

The Analog Ensemble Technique Explained

2018-12-14T00:00:00+00:00

Schematic Diagram

The following schematic diagram shows the four steps to generate a four-member ensemble forecast.

Step 1: The process starts with a current deterministic multivariate prediction and a set of historical predictions from a deterministic weather model. The multivariate prediction includes surface temperature, humidity, wind speed, and so on. Corresponding observations to each historical forecasts are also collected.
Step 2: A number of historical predictions are identified based on their similarity to the current multivariate prediction. This similarity is also time-dependent, meaning that, instead of point-to-point comparison, it also compares the trend of each weather variable within a short time range.
Step 3: The corresponding observations associated with the identified historical predictions are selected.
Step 4: These observations become ensemble members in the final forecast.

Simplified Example for Temperature Forecasts

Please navigate through the following slides to see the example.

This is an embedded Microsoft Office presentation, powered by Office Online.

Animation credited to Laura Clemente-Harding and Guido Cervone

Step 1: A deterministic model has been running for a week and a new prediction is generated from the model. Red dots are temperature observations, and black dots are model predictions.
Step 2: By comparing current and historical model predictions (black dots), most similar past forecasts are identified.
Step 3: The corresponding observations are selected that are associated with the identified past predictions.
Step 4: These observations become ensemble members in the final forecast.

Of course, in reality, the similarity metric is a time-dependent and multivariate metric.

References

Search Space Extension with RAnEn

2018-11-24T00:00:00+00:00

Introduction
Access

Introduction

This article demonstrates how to use the search space functionality within the RAnEn package. If you haven’t done so, please read the instructions for basic usage of RAnEn first. This article skips the part that has been covered in the previous article.

The classic AnEn technique searches for the most similar historical foreasts at its current location. Therefore, only forecasts from the current station/grid point will be traversed and compared. This search style is referred to as the Independent Search (IS). Another possible search style is extended search, which is referred to as Search Space Extension (SSE). It simply indicates that forecasts at nearby stations/grid points are included in the search process. As a result, the search space is significantly larger when using the search space extension.

There are currently two ways to define what nearby locations to be included into the search. Users can set the nearest number of neighbors to be included and/or a distance threshold. The two restraints can be used together.

You will learn how to use these functions:

generateAnalogs

Access

Basics of RAnEn

2018-11-04T00:00:00+00:00

Introduction
Access

Introduction

This article walks you through the basic usage of the RAnEn library. This exercise uses short-term surface temperature forecasts as an example. Recommend using binder and the corresponding .Rmd file will guide you through the script line by line.

You will learn how to use these functions:

generateConfiguration
generateAnalogs
verify* functions

Parallel Analog Ensemble

How to format observations for AnEn

Introduction

Access

Running Large Scale Analog Ensemble on Cheyenne

Background

A Brief Introduction to the Problem

Workflow

How to Automate Data Preprocessing for AnEn Computation on Cheyenne

Introduction

Data Preparation and Goals

Scripts

Results

Building AnEn on NCAR Cheyenne

Introduction

Building AnEn

Operational Search with RAnEn

Introduction

Access

NetCDF File Types and Variables for Analog Ensemble Applications

Introduction

File Types

Forecasts

Observations

Analogs

Similarity

StandardDeviation

Matrix

References

Profile AnEn

Introduction

Result Preview

Preparation and Clarification

Profiling with TAU

Build with TAU

Profiling

Visualization

Profiling with gprof

Build with gprof

Profiling

Visualization

Profiling with valgrind

Build

Profiling

Visualization

Sequel on TAU Installation

The Analog Ensemble Technique Explained

Schematic Diagram

Simplified Example for Temperature Forecasts

References

Search Space Extension with RAnEn

Introduction

Access

Basics of RAnEn

Introduction

Access

Build with `TAU`

Profiling with `gprof`

Build with `gprof`

Profiling with `valgrind`