How to run the ASKAP pipelines
==============================

Loading the pipeline module
---------------------------

The pipeline scripts are now accessed through a specific module on
galaxy - ``askappipeline``. This is separate to the main ``askapsoft``
module, to allow more flexibility in updating the pipeline scripts. To
use, simply run::

  module load askappipeline

This provides access to the main pipeline executable *processASKAP.sh*.
  
Some parts of the pipeline make use of other modules, which are loaded
at the appropriate time. The beam footprint information is obtained by
using the *schedblock* tool in the askappy module, while beam
locations are set using *footprint* from the same module.

For the case of either BETA data or observations made with a
non-standard footprint, the *footprint* tool will not have the correct
information, and the legacy tool *footprint.py* is used (this is a
holdover from the original ACES software repository).

Once loaded, the askappipeline module will set an environment variable
**$PIPELINEDIR**, pointing to the directory containing the scripts. It
also defines **$PIPELINE_VERSION** to be the version number of the
currently-used module.

Pipeline configuration
----------------------

The pipeline is configured with a range of input parameters, all of
which have some default value. The default is changed by one of the
following methods:

* a configuration file via::

   processASKAP.sh -c myInputs.sh
     
* a template configuration plus the scheduling-block IDs::

   processASKAP.sh -s SB_SCIENCE -b SB_1934 -p SB_PB -t template.sh

* or a template configuration *and* a configuration file specific to
  the observation being processed::

   processASKAP.sh -t template.sh -c myInputs.sh

Parameters are decided by applying, in order, the pipeline defaults,
the template (if given) then the configuration file (so that
parameters given the configuration file have precedence over the
corresponding value in the template). Scheduling block IDs given on
the command-line take precendence over any given in the configuration
(or template) files.

The template and configuration files are shell scripts that define
environment variables. A configuration file could look something like
this::

  #!/bin/bash -l
  #
  # Example user input file for ASKAP processing.
  # Define variables here that will control the processing.
  # Do not put spaces either side of the equals signs!
  # control flags
  SUBMIT_JOBS=true
  DO_SELFCAL=false
  # scheduling blocks for calibrator & data
  SB_1934=507
  SB_SCIENCE=514
  # base names for MS and image data products
  MS_BASE_SCIENCE=B1740_10hr.ms
  IMAGE_BASE_CONT=i.b1740m517.cont
  # other imaging parameters
  NUM_PIXELS_CONT=4096
  NUM_TAYLOR_TERMS=2
  CORES_PER_NODE_CONT_IMAGING=15

This file should define enough environment variables for the scripts
to run successfully. Mandatory ones, if you are starting from scratch,
are the locations of either the SBs for the observations or the
specific MSs.

It is possible for processASKAP.sh to read the template from the SB
parset, where it looks for the parameter
``common.cp.processing_template``. If this option is used, the SBID
command-line options *must* be given::

  processASKAP.sh -s SB_SCIENCE -b SB_1934

Giving the template on the command-line (with ``-t``) will override
the template provided in the SB parset. If the SB parset does not
specify a parset, a warning will be given and processing will continue
without it (using the defaults and any configuration file given).

There are two other command-line parameters that are related to how the jobs are run.
One relates to the ``QUEUE``, which is the name of the slurm partition on which
the bulk of the jobs will be run. While the template or config can
specify a value for ``QUEUE``, this is overridden by using the ``-q
<QUEUE>`` option. For example::

  processASKAP.sh -s SB_SCIENCE -b SB_1934 -q work

This will send the jobs to the *work* partition instead of the default
(*askaprt*). The execeptions to this are jobs that require access to
the /askapbuffer filesystem - pipeline launch/relaunch jobs, initial
raw data access jobs, or submission to CASDA. These are sent to the
queue indicated by ``QUEUE_OPS``, which defaults to *askaprt*, and
shouldn't in general be altered.

The other, ``-S <version>``, provides a singularity version that can
be used to override the default value. By default on setonix, jobs
that are run on the *askaprt* partition use a singularity version
ending in '-askap', which mounts the /askapbuffer filesystem. Jobs
that run on the work partition, which don't have visibility to that
filesystem, use singularity versions ending in '-mpi'. Providing a
version directly with the ``-S`` option will use that version for *all*
jobs, irrespective of the partition they are run on.

When run, the pipeline configuration parameters will be archived in
the *slurmOutputs* directory (see below). It will be called
*pipelineConfig__<time>.sh*, where *<time>* will be the timestamp of
the pipeline run. This file will combine the user configuration and
the template configuration, so that you'll be able to keep a record of
exactly what you have run.

**Important Note: The input file is a bash script, so formatting
matters. Most importantly in this case, you can not have spaces either
side of the equals sign when defining a variable.**

User-defined Environment Variables
----------------------------------

The user is able to specify environment variables that directly relate
to the parset parameters for the individual ASKAPsoft tasks. The input
parameters are named differently, so that they are tied more obviously
to specific tasks, and to distinguish between the the same parset
parameter used for different jobs (eg. the preconditioning definition
could differ for the continuum & spectral-line imaging).

The input environment variables are all given with upper-case names,
with an underscore separating words (eg. ``VARIABLE_NAME``). This will
distinguish them from the parset parameters.

The following pages list the environment variables defined in
defaultConfig.sh – these can all be redefined in your input file to
set and tweak the processing. The default value of the parameter (if
it has one) is listed in the tables, and in many of the tables the
parset parameter than the environment variable maps to is given.

* :doc:`ControlParameters`
* :doc:`DataLocationSelection`
* :doc:`BandpassCalibrator`
* :doc:`ScienceFieldPreparation`
* :doc:`ScienceFieldContinuumImaging`
* :doc:`ScienceFieldMosaicking`
* :doc:`ContinuumSourcefinding`
* :doc:`SpectralLineSourcefinding`
* :doc:`ScienceFieldSpectralLineImaging`
* :doc:`archiving`


What is created and where does it go?
-------------------------------------

Any measurement sets, images and tables that are created are put in an
output directory specified in the input file (if not provided, they go in
the directory in which *processASKAP.sh* is run). There will be a file
called *PROCESSED_ON* that holds the timestamp indicating when the
script was run (this timestamp is used in various filenames). Also
created are a number of subdirectories which hold various types of
files. These are:

* *slurmFiles/* – the files in here are the job files that are submitted
  to the queue via the sbatch command. When a job is run, it makes a
  copy of the file that is labelled with the job ID.
* *metadata/* – information about the measurement sets and the beam
  footprint are written to files here. See below for details on files. 
* *parsets/* – any parameter sets used by the askapsoft applications
  are written here. These contain the actual parameters that are used
  by the various programs. These are labeled by the job ID.
* *logs/* – the logs that are written by the askapsoft applications
  themselves are put here.
* *slurmOutputs/* – the stdout and stderr from the slurm job itself
  are written to these files. Such files are usually
  *slurm-XXXXXX.out* (XXXXXX being the job ID), but these scripts
  rename the files so that the filename shows what job relates to what
  file (as well as providing the ID).
* *diagnostics/* - this directory is intended to hold plots and other
  data products that indicate how the processing went. The pipeline only
  produces a few particular types at the moment, but the intention is
  this will expand with time.
* *tools/* – utility scripts to show progress and kill all jobs for a
  given run are placed here. See :doc:`pipelineDiagnostics` for
  details.
* *clink/* - yaml files produced for CLINK notifications are written
  here (see notifications_ section below)
* *pipelineControl/* - parsets containing control and check parameters
  are kept here. See :doc:`pipelineDiagnostics` for more details.

There are two additional sets of files that are worth mentioning - the
'stats' files, and the pipeline control parsets (found in the
pipelineControl directory) - which are used to track progress and
monitor resource usage. These are described further on the
:doc:`pipelineDiagnostics` page.

Metadata directory
~~~~~~~~~~~~~~~~~~

The metadata directory contains a number of files that hold important
metadata describing the observation that is not yet available through
the measurement set. These files are used at various stages in the
processing. The files are:

* *mslist-<timestamp><index>.txt* - Results of running mslist
  (:doc:`../calim/mslist`) on a measurement set. This summarises the
  metadata of the observation, including: time range; spectral window
  (frequency range, number of channels, etc); field list (where the
  telescope was pointing); list of antennas used. When the observation
  has more than one raw MS, only one is used (as they should all have
  the same metadata), and the *<index>* addition will indicate which
  file was used. If there is a single MS in the raw directory, there
  will only be a timestamp. There may be other mslist files created
  for the intermediate MSs created during the pipeline processing,
  depending on the modes used.
* *mslist-cal-<timestamp>.txt* - As above, but for the bandpass
  calibration observation.
* *mslist-<timestamp><index>.txt.timerange* - When the
  ``DO_SPLIT_TIMEWISE=true`` option is used, this lists the boundaries
  (start and end) of each time range that the data will be split into. 
* *fieldlist-<timestamp><index>.txt* - Extracted from the mslist
  output, this is the list of fields for the science observation. This
  is used to define the footprint positions.
* *schedblock-info-<SBID>.txt* - Summary of the information from the
  scheduling block database. For each SBID, this lists the summary
  information, the parameters and the variables. This includes the
  specification for the beam footprint, which is used to form the next
  file.
* *footprintOutput-sb<SBID>-<FIELD>.txt* - A list of the locations and
  offsets of each beam in the footprint used for a given field of the
  science observation. Each field (ie. pointing direction) in the MS
  gets its own footprint file. This is used by the pipeline to define
  the locations of the individual beam images.

As detailed in :doc:`archiving`, the metadata folder will be copied
into each archived MS prior to it being tarred, and also provided in
the evaluation files made available in CASDA.
  
Measurement sets
----------------

To provide the input data to the scripts, you can provide either the
scheduling blocks (SBs) of the two observations, or provide specific
measurement sets (MSs) for each case.

The measurement sets presented in the scheduling block directories have a variety of forms:

1. The earliest observations had all beams in a single measurement
   set. For these, splitting with mssplit is required to get a
   single-beam MS used for processing.
2. Many observations have one beam per MS. If no selection of channels
   or fields or scans is required, these can be copied rather than
   split. Note that splitting of the bandpass observations are
   necessary, to isolate the relevant scan.
3. Recent large observations have more than one MS per beam, split in
   frequency chunks. The pipeline will merge these to form a single
   local MS for each beam. If splitting (of channels, fields or scans)
   is required, this is done first, before merging the local subsets.

The measurement sets that will be created should be named in the
configuration file. A wildcard %s can be used to represent the
scheduling block ID, and %b should be used to represent the beam
number in the resulting MSs, since the individual beams will be split
into separate files.

Each step detailed below can be switched on or off, and those selected
will run fine (provided any pre-requisites such as measurement sets
or bandpass solutions etc are available). If you have already created
an averaged science MS, you can re-use that with the
``MS_SCIENCE_AVERAGE`` parameter (see :doc:`ScienceFieldPreparation`),
again with the %b wildcard to represent the beam number and %s the
scheduling block ID.


Workflow summary
----------------

Here is a summary of the workflow provided for by these scripts:

* Get observation metadata from the MS and the beam footprint. This
  does the following steps:

  * Use **mslist** to get basic metadata for the observation,
    including number of antennas & channels, and the list of field
    names. (If merging is required, additional metadata files will be
    created later.)
  * Use **schedblock** to determine the footprint specification and
    other observation details.
  * Use **footprint** to convert that into beam centre positions.

* Read in user-defined parameters from the provided configuration
  file, and define further parameters derived from them.
* If bandpass calibration is required and a 1934-638 observation is
  available, we split out the relevant beams with **mssplit**
  (:doc:`../calim/mssplit`) into individual measurement sets (MSs),
  one per beam. Merging may be required as described above. Only the
  scan in which the beam in question was pointing at 1934-638 is
  used - this assumes the beams were pointed at it in order (so that
  beam 0 was pointing at in in scan 0, etc)
* The local MSs are flagged using **cflag** (:doc:`../calim/cflag`) in two
  passes: first, selection rules covering channels, time ranges, antennas & baselines, and
  autocorrelations are applied, along with an optional simple flat amplitude
  threshold; then a second pass that covers Stokes-V and dynamic
  amplitude flagging, that integrate individual spectra.
* There is an option to use the **AOFlagger** tool instead of **cflag** to
  do the flagging, with the ability to provide strategy files for each
  flagging task.
* The bandpass solution is then determined with **cbpcalibrator**
  (:doc:`../calim/cbpcalibrator`), using all individual MSs and stored
  in a single CASA table.
* If bandpass calibration has already been done in another directory,
  the bandpass (and leakage) tables and their plots are copied to the
  BPCAL directory in the current processing area, and thereafter used
  from there.
  
The science field is processed for each field name - what follows describes the steps used for each field:

* The science field data is split with **mssplit**, producing one
  measurement set per beam. You can select particular scans or fields
  here, but the default is to use everything. Each field gets its own
  directory. If the data was taken with the file-per-beam mode, and no
  selection is required, a direct copy is used instead of
  **mssplit**. Again, merging with **msmerge** may be required for
  some datasets. Any splitting that is needed in that case is done
  first.
* The bandpass solution is then applied to each beam MS with
  **ccalapply** (:doc:`../calim/ccalapply`).
* Flagging is then applied to the bandpass-calibrated dataset. The
  same procedure as for the calibrator is used, with separate user
  parameters to control it.
* The science field data are then averaged with **mssplit** to form
  continuum data sets. (Still one per beam).
* Another round of flagging can be done, this time on the averaged
  dataset.
* Each beam is then imaged individually. This is done in one of two
  ways:

  * Basic imaging with either **imager** (:doc:`../calim/imager`) or
    **cimager** (:doc:`../calim/cimager`), without any
    self-calibration. A multi-scale, multi-frequency clean is used,
    with major & minor cycles.
  * With self-calibration. First we image the field with **imager** or
    **cimager** as for the first option. **selavy**
    (:doc:`../analysis/selavy`) is then used to find bright
    components, which are then used with **ccalibrator**
    (:doc:`../calim/ccalibrator`) to calibrate the gains, and we then
    re-image using the calibration solution. This process is repeated
    a number of times.

* The calibration solution can then be applied directly to the MS
  using **ccalapply**, optionally creating a copy in the process.
* The continuum dataset can then be optionally imaged as a "continuum
  cube", using **imager** or **simager** (:doc:`../calim/simager`) to
  preserve the full frequency sampling. This mode can be run for a
  range of polarisations, creating a cube for each polarisation
  requested.
* Once the continuum image has been made, the source-finder **selavy**
  can be run on it to produce a deeper catalogue of sources.
* Fast, or transient imaging - that is, imaging in the continuum at a
  series of narrow time intervals to allow transient source
  detection - may be applied using both the deep continuum clean model
  and source catalogue.
* The beam images can optionally be smoothed to a common resolution -
  the smallest one large enough to cover all input beam images'
  PSFs. This produces new versions of the restored images, which are
  used for subsequent source-finding as the primary image.
* The beam images (continuum & continuum-cubes) may be convolved to
  ensure they all have the same PSF.
* Once all beams have been done, they are all mosaicked together using
  **linmos-mpi** (:doc:`../calim/linmos`). This is done for all image
  types (restored, convolved-restored, residual, model image). This
  applies a primary-beam correction using the nominated holography
  beam data. Use the logs to find what the beam arrangement for your
  observation was. After mosaicking, *selavy** can be run on the
  final image to create the final source catalogue.
* Additionally, spectral imaging of the full-resolution MS for
  individual beams can be done. There are several optional steps to
  further prepare the spectral-line dataset:

  * A nominated channel range can be copied to a new MS with
    **mssplit**.
  * The gains solution from the continuum self-calibration can be
    applied to the spectral-line MS using **ccalapply**.
  * The continuum can be subtracted from the spectral-line MS (using a
    model of the **selavy** catalogue or the clean model from the
    continuum imaging) using **ccontsubtract**
    (:doc:`../calim/ccontsubtract`).

* Once the spectral-line dataset is prepared, **imager** or
  **simager** is used to do the spectral-line imaging. This creates a
  cube using a large number of processors, each independently imaging
  a single channel.
* There is a further task to remove any residual continuum from the
  image cube by fitting a low-order polynomial to each spectrum
  independently.
* Source-finding with **selavy** can then be run on the
  spectral-cubes.
* A diagnostics script is run to produce QA & related
  plots. Details of diagnostic plots can be found on :doc:`validation`.
* Further validation scripts may also be run - these are typically
  provided by ASKAP Science teams, to produce validation reports that
  are sent to CASDA. Further details are on :doc:`validation`.
* Finally, if required and once all the processing is deemed complete,
  the necessary data products are identified for upload to CASDA, with
  an interchange file called observation.xml created in a local
  directory (tagged with the timestamp of the pipeline run).
* Any job errors or failures to complete will cause the pipeline to be
  re-run - if the errors were due to transient platform issues, these
  will likely succeed on subsequent passes. If it is a persistent
  error, the pipeline will only run a maximum number of times (10 by
  default) before stopping. See below for further details.
* If all jobs have completed, an additional pass with a different
  template or config file is possible (by setting
  ``SUBSEQUENT_TEMPLATE`` and/or ``SUBSEQUENT_CONFIG`` in the original
  template/config).
* Once all pipeline processing is complete, the observation.xml files
  for all pipeline runs are merged and sent to the appropriate CASDA
  area for ingest.


'Rapid' mode processing
-----------------------

The "rapid survey" mode of the telescope involves taking many
relatively short integrations of different fields within a single
scheduling block, allowing the coverage of wide areas of sky. A
consequence of this mode is that the MS sizes are quite small, so that
the pre-imaging tasks (splitting, bandpass calibration, flagging,
averaging) take very little time. This can lead to large overheads for
the slurm scheduling system, especially when combined with the large
numbers of jobs coming from the many fields.

To address this, there is a rapid pipeline mode that combines all
these pre-imaging steps into a single script (that is written to the
*tools/* directory when processASKAP.sh is run). This script is called
from within the imaging job itself, so that each beam only needs to
run the continuum imaging/self-cal job (as well as the jobs that
follow - applying calibration, Stokes V imaging, linmos), greatly
reducing the load on the slurm controller.

This mode is turned on by setting ``DO_RAPID_SURVEY=true``. There are
some constraints with this:

* Only continuum imaging is done in this mode.
* There is no merging of spectral chunks of the raw data - it is
  assumed the raw data is either in a single MS or in one MS per beam.

Otherwise, the processing is exactly the same as for the regular
pipeline case.
  

Pipeline restarts
-----------------

The pipeline makes use of the stats and pipelineControl.in files (see
:doc:`pipelineDiagnostics` for details) to determine whether there
were failed jobs, or jobs that started but did not complete (either by
timing out or being cancelled).

A 'launch' job is submitted to run after the final pipeline job, and
this job determines if there were failures in the previous run. If so,
processASKAP.sh is run again with the same command-line arguments. The
check parameters in pipelineControl.in will allow the new pipeline run
to begin only from the jobs not yet complete. If the errors were due
to transient compute-node or filesystem issues, then it is likely they
will run successfully the next time round.

If the errors are related to something with the data, it is likely the
same problem will recur. In this case, there is a limit to the maximum
number of resubmissions that can be made, encapsulated by the
``MAX_RESUBMIT`` parameter (and stored in the file given by
``SUBMIT_TRACKER``).

The restart can be forced (so that it re-submits the pipeline
regardless of whether there were failed jobs) by ensuring the file
indicated by ``FORCE_RESTART_FILE`` exists (this defaults to
``.pipeline-start``). This file can be empty (creating it with a
simple "touch .pipeline-start" is sufficient). It gets removed when
the pipeline is restarted. This will also increment the submission
counter in ``SUBMIT_TRACKER``.

.. _notifications:

Notifications of failures & completion
--------------------------------------

When run operationally (as the 'askapops' user), the pipeline makes
use of the "CLINK" Component Linking system to send out notifications
of events such as job failures and pipeline completion. The specific
events that trigger a notification are:

* Completion of all jobs in a pipeline.
* Failure to launch jobs during a processASKAP.sh execution (for
  instance, due to a slurm error).
* The failure of an individual slurm job.
* A job that has timed-out and not completed. The notification for
  this is only sent out at the pipeline restart stage (see previous
  section).
* The restart of a pipeline.

While these notifications are only sent to CLINK when the clink
executable is available (which typically means for the askapops user
alone), each notification is sent with a yaml file that is created in
the clink/ subdirectory. These yaml files are created for all users of
the pipeline, and are available for examination.