Developer Guide

How to launch interactive applications: QGis?

Instead of launching applications in a virtual desktop environment, it is often more convenient if you can immediately start an application. The X2Go program explained on the previous page supports this. Just follow the same steps for setting up a connection as described there (MAKE ANCHOR ), but now select 'Published applications' under 'Session type' when you set up your session.

Now click on the circle icon after launching this session as shown here:

Published Apps1

This show a window where you can select an application, clicking start will launch QGis directly, as if it were running on your machine. However, you do have access to the Terrascope EO data archive!

lauching qgis

How to use Spark for distributed processing?

To speed up processing, it is often desirable to distribute the work over a number of machines. The easiest way to do this is to use the Apache Spark processing framework. The fastest way to get started, is to read the Spark documentation: https://spark.apache.org/docs/2.3.3/quick-start.html.

The Spark version installed on our cluster is 2.3.3 so it is recommended that you stick with this version. It is however not impossible to use a newer version if really needed. Spark is also installed on your virtual machine, so you can run 'spark-submit' from the command line after setting the following 2 environment variables:

export SPARK_MAJOR_VERSION=2
export SPARK_HOME=/usr/hdp/current/spark2-client

To run jobs on the Hadoop cluster, the 'cluster' deploy-mode has to be used, and you need to authenticate with Kerberos. For the authentication, just run 'kinit' on the command line. You will be asked to provide your password. Two other useful commands are 'klist' to show whether you have been authenticated, and 'kdestroy' to clear all authentication information. After some time, your login will expire, so you'll need to run 'kinit' again.

A Python Spark example is available which should help you to get started.

Resource management

Spark jobs are being run on a shared processing cluster. The cluster will divide available resources among all running jobs, based on certain parameters.

Memory

To allocate memory to your executors, there are two relevant settings:

The amount of memory available for the Spark 'Java' process: --executor-memory 1G

The amount of memory for your Python or R script: --conf spark.yarn.executor.memoryOverhead=2048

If you need more detailed tuning of the memory managment inside the Java process, you can use: --conf spark.memory.fraction=0.05

Number of parallel jobs

The number of tasks that are processed in parallel can be determined dynamically by spark. Therefore you should use these parameters:

--conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true

Optionally, you can set upper or lower bounds:
--conf spark.dynamicAllocation.maxExecutors=30 --conf spark.dynamicAllocation.minExecutors=10

If you want a fixed number of executors, use:
--num-executors 10

We don't recommend this, as it reduces the ability of the cluster manager to optimally allocate resources.

Dependencies

A lot of commonly used Python dependencies are preinstalled on the cluster, but in some cases, you want to provide your own.

The first thing you need to do this, is to get a package containing your dependency. PySpark supports zip, egg, or whl packages. The easiest way to get such a package is by using pip:

pip download Flask==1.0.2

This will download the package, and all of its dependencies. Pip will prefer to download a wheel if one is available, but may also return a ".tar.gz" file, which you will need to repackage as zip or wheel.

To repackage a tar.gz as wheel:

tar xzvf package.tar.gz

cd package

python setup.py bdist_wheel

Note that a wheel may contain files that are dependent on the version of Python that you are using, so make sure you use the right (2.7 or 3.5) Python to perform this command.

Once the wheel is available, you can include it in your spark-submit command:

--py-files mypackage.whl

Notifications

If you want to receive a notification (e.g. an email) when the job reaches a final state (succeeded or failed), you can add a SparkListener on the SparkContext for Java or Scala jobs:

SparkContext sc = ... sc.addSparkListener( new SparkListener() { ... @Override public void onApplicationEnd(SparkListenerApplicationEnd applicationEnd) { // send email } ... });

You can also implement a SparkListener and specify the classname when submitting the Spark job:

spark-submit --conf spark.extraListeners=path.to.MySparkListener ...

In PySpark, this is a bit more complicated as you will need to use Py4J:

class PythonSparkListener(object): def onApplicationEnd(self, applicationEnd): // send email # also implement other onXXX methods class Java: implements = ["org.apache.spark.scheduler.SparkListener"]
sc = SparkContext() sc._gateway.start_callback_server() listener = PythonSparkListener() sc._jsc.sc().addSparkListener(listener) try: # your Spark logic goes here ... finally: sc._gateway.shutdown_callback_server() sc.stop()

In a future release of the JobControl dashboard, we will add the possibility to send an email automatically when the job reaches a final state.

What if I don't see a background map in the Terrascope viewer?

We have noticed that sometimes the background map does not appear in the viewer. The cause of this is that the user has tracking blocker software installed. Please read this document to find out how to resolve this issue

How to write Python scripts?

Here are some important tips when working with Python.

Recommended versions

We preinstalled and configured Python 3.6 on all user VMs and on the processing cluster. This is currently the default version. Python 3.5 support is also available on the VMs and the cluster.

To switch to the Python 3.5 environment, you should run:

scl enable rh-python35 bash

Once inside this environment, all commands will default to using Python 3.5.

To use the Python 3.6 environment, you should run it using the Python3.6 binary in your VM

Installing packages

Even though we already provide a number of widely used Python packages by default, you probably need to install some new ones at some point. We recommend doing this using the pip package manager. For example:

pip install --user owslib

Installs the owslib library. The '--user' argument is required to avoid needing root permissions and to ensure that the correct Python version is used. Do not use the yum package manager to install packages!

More advanced options are explained here: https://proba-v-mep.esa.int/documentation/manuals/python-package-management

On the processing cluster

To run your code on the cluster, you need to make sure that all dependencies are also available. However, it is not possible to install specific packages on the cluster, but feel free to request the installation of a specific package.

If more freedom is needed, it is also possible to submit your dependencies together with your application code. This is simplified if you are already using a Python virtual environment for your development.

A non-exhaustive list of currently installed standard Python packages is given below.

affine, catalogclient, Cython, dask, dataclient, docker, Fiona, GDAL, geojson, geopandas, h5py, matplotlib, netCDF4, numpy, pandas, pyproj, rasterio, scipy, seaborn, sentinelhub, sentinelsat, tensorflow, xarray.

The complete list of standard installed packages can be obtained using the command 'pip3.6 freeze'.

Sample project

Two sample Python/Spark projects are available on Bitbucket; they show how to use Spark (https://spark.apache.org/) for distributed processing on the Terrascope Platform.

The basic code sample implements an (intentionally) very simple computation: for each PROBA-V tile in a given bounding box and time range, a histogram is computed. The results are then summed and printed. The computation of the histograms runs in parallel.

The project's README file includes additional information on how to run it and inspect its output.

The advanced code sample implements a more complex computation: it will calculate the mean of a time series of PROBA-V tiles and output it as a new set of tiles. The tiles are being split up into sub-tiles for increased parallelism. A mean is just an example an operation that can be applied to a time series.

Application specific files

Auxiliary data

When your script depends on specific files, there are a few options:

Use the --files and --archives options of the spark-submit command to put local files in the working directory of your program, as explained here.
Put the files on a network drive, as explained here.
Put the files on the HDFS shared filesystem, where they can be read directly from Spark, or can be placed into your working directory using --files or --archives.

The second option is mostly recommended for larger files. The other options are quite convenient for distributing various smaller files. Spark has a caching mechanism in place to avoid unneeded file transfers across multiple similar runs.

Compiled binaries

When using compiled code, such as C++, code compiled on your VM will also work on the cluster. Hence you can safely use the compilation instructions of your tool to compile a binary, and then distribute it in a similar way as any other type of auxiliary data. In your script, you may need to configure environment variables such as PATH and LD_LIBRARY_PATH to ensure that your binaries can be found and are being used.

How to get data to the Terrascope platform?

Introduction

Each Terrascope VM user has it's own Public and Private folder, available under /data/users/Public/<username> and /data/users/Private/<username>. There is also a link to those folders in your Terrascope VM home folder (/home/<username>).

The Public folder can be used to share data with other Terrascope users as all Terrascope users can read the data, but it is only writeable by you. The Private folder is only readable and writeable by you. These folders are accessible from the Terrascope user VMs, the Hadoop cluster, and the notebooks.

Data can be uploaded by SFTP, as described below.

Importing small amounts of data on the VM

To import data to the user VM with sizes of a couple of Gigabytes or less, the easiest way would be to use an SFTP client.
For Windows, WinSCP is a good choice.
Download WinSCP (or any other SFTP client) and start it up.

Connect WinSCP to our SFTP server: filetransfer.terrascope.be.
The login credentials are the same as your VM, meaning your Terrascope username and password.

In the SFTP server you will find the folders that are shared over the entire Terrascope cluster. Your Private folder will be under /data/users/Private/<username> and you Public folder will be under /data/users/Public/<username>.

FAQ