User Parameters - Pipeline & job control

Here we detail the input parameters that cover the overall process control, and the switches to turn on & off different parts of the pipeline.

Values for parameters that act as flags (ie. those that accept true/false values) should be given in lower case only, to ensure comparisons work properly.

ASKAPsoft versions

The default behaviour of the slurm jobs is to use the askapsoft module that is loaded in your ~/.bashrc file. However, it is possible to run the pipeline using a different askapsoft module, by setting ASKAPSOFT_VERSION to the module version (0.19.2, for instance). If the requested version is not available, the default version is used instead.

This behaviour is also robust against there not being an askapsoft module defined in the ~/.bashrc - in this case the default module is used, unless ASKAPSOFT_VERSION is given in the configuration file.

The pipeline looks in a standard place for the modules, given by the setup used at the Pawsey Centre. If the pipelines are being run on a different system, the location of the module files can be given by ASKAP_MODULE_DIR (this is passed to the module use command).

ACES software

A small number of tasks within the pipeline make use of tools or scripts developed by the ACES (ASKAP Commissioning & Early Science) team. These live in a subversion repository that can be checked out by users (should you have permission) and pointed to by $ACES. The preferred means of using this is, however, is to use the acesops module, which provides a controlled snapshot of the subversion tree, allowing processing to be reproducible by recording the revision number.

Use of the acesops module is the default behaviour of the pipeline, and the user does not need to load it prior to running the pipeline. To use your own copy of the subversion tree, you need to set USE_ACES_OPS=false. A particular version of the acesops module can be chosen via the ACESOPS_VERSION config parameter.

Slurm control

These parameters affect how the slurm jobs are set up and where the output data products go. To run the jobs, you need to set SUBMIT_JOBS=true. Each job has a time request associated with it - see the Slurm time requests section below for details.

Variable

Default

Description

SUBMIT_JOBS

false

The ultimate switch controlling whether things are run on the galaxy queue or not. If false, the slurm files etc will be created but nothing will run (useful for checking if things are to your liking).

SINGULARITY_VERSION

3.7.4 (galaxy) or 3.11.4-askap (setonix)

Specific version of the singularity module to use. Depends on cluster being used.

ASKAPPY_VERSION

""

Specific version of the askappy module to use.

PYTHON_MODULE_VERSION

""

The version of the python module loaded at the start of slurm jobs. This is only necessary for use with legacy versions of askapsoft (version 1.0.14 or earlier). For later versions, the “askappy” module is used to provide python. When a legacy askapsoft is used, if this is given as blank or as a version that is not available, we fall back to the system default.

NUMPY_MODULE_VERSION

1.13.3

The version of the numpy module loaded at the start of slurm jobs, and in the execution of the pipeline scripts. If given as blank or as a version that is not available, we fall back to the system default. This is only used with legacy versions of askapsoft (1.0.14 or earlier), and only when PYTHON_MODULE_VERSION is provided.

ASKAPSOFT_VERSION

""

The version number of the askapsoft module to use for the processing. If not given, or if the requested version is not valid, the version defined in the ~/.bashrc file is used, or the default version should none be defined by the ~/.bashrc file.

BPTOOL_VERSION

""

The version of the bptool module to use for the processing. Giving "" will result in the default bptool module being used.

ASKAP_MODULE_DIR

/group/askap/modulefiles (galaxy) or /software/projects/askaprt/modulefiles (setonix)

The location for the modules loaded by the pipeline and its slurm jobs. Change this to reflect the setup on the system you are running this on, but it should not be changed if running at Pawsey.

QUEUE

work

Slurm partition (“queue”) used for the bulk of the processing jobs. Can be specified via the -q option for processASKAP.sh.

QUEUE_OPS

askaprt

The queue used to run tasks that need to access the /askapbuffer filesystem - operationally important jobs such as pipeline startup, accessing raw data, or setting up CASDA archiving. Leave as is unless you know better.

CONSTRAINT

""

This allows one to provide slurm with additional constraints. While not needed for galaxy, this can be of use in other clusters (particularly those that have a mix of technologies).

ACCOUNT

""

This is the account that the jobs should be charged to. If left blank, then the user’s default account will be used.

RESERVATION

""

If there is a reservation you specify the name of it here. If you don’t have a reservation, leave this alone and it will be submitted as a regular job.

JOB_TIME_DEFAULT

24:00:00

The default time request for the slurm jobs. It is possible to specify a different time for individual jobs - see the list below and on the individual pages describing the jobs. If those parameters are not given, the time requested is the value of JOB_TIME_DEFAULT.

OUTPUT

.

The sub-directory in which to put the images, tables, catalogues, MSs etc. The name should be relative to the directory in which the script was run, with the default being that directory.

EMAIL

""

An email address to which you want slurm notifications sent (this will be passed to the --mail-user option of sbatch). Leaving it blank will mean no notifications are sent.

EMAIL_TYPE

ALL

The types of notifications that are sent (this is passed to the --mail-type option of sbatch, and only if EMAIL is set to something). Options include: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, & TIME_LIMIT_50 (taken from the sbatch man page on galaxy).

SLEEP_PERIOD

30

The length (in seconds) of a sleep call that is inserted prior to second and subsequent srun calls within the slurm jobs. This adds extra waiting time before the srun task, with the aim of reducing the likelihood of compute-node errors.

USE_ACES_OPS

true

Whether to use the acesops module to access ACES tools within the pipeline. Setting to false will force the pipeline to look in the $ACES directory defined by your environment. If $ACES is not set, then USE_ACES_OPS will be set back to true.

ACESOPS_VERSION

""

The version of the acesops module used by the pipeline. Leaving blank will make it use the default at the time.

Filesystem control

There are a couple of parameters that affect how files interact with the Lustre filesystem. We set the striping of the directory at the start to a value configurable by the user. This only affects the directories where MSs, images & tables go - parsets, logs, metadata and the rest are given a stripe count of 1.

There is also a parameter to control the I/O bucketsize of the measurement sets created by mssplit. This is particularly important in governing the I/O performance and the splitting run-time. The default, 1MB, matches the stripe size on /group, and has been found to work well.

A new parameter stman.files has been introduced in mssplit that makes use of the casacore storage manager functionality to control the number/type of files written to a measurement set. This should be useful in achieving better overall I/O performance. The pipeline now allows users to specify this parameter using FILETYPE_MSSPLIT.

There is also a parameter PURGE_FULL_MS that allows the deletion of the full-spectral-resolution measurement set once the averaging to continuum channels has been done. The idea here is that such a dataset is not needed for some types of processing (continuum & continuum cube imaging in particular), and so rather than have a large MS left lying around on the disk, we delete it. This parameter defaults to true, but is turned off if any of the spectral-line processing tasks are turned on (DO_COPY_SL, DO_APPLY_CAL_SL, `DO_CONT_SUB_SL or DO_SPECTRAL_IMAGING). The deletion is done in the averaging job, once the averaging has completed successfully. If the averaging fails it is not removed. Similarly, if DO_COPY_SL=true (so that a channel range is copied out of the spectral dataset), there is an option to remove the full MS after the copying has completed successfully - set PURGE_FULL_MS_AFTER_COPY=true to have this.

Variable

Default

Description

LUSTRE_STRIPING

4

The stripe count to assign to the data directories

BUCKET_SIZE

1048576

The bucketsize passed to mssplit (as “stman.bucketsize”) in units of bytes.

FILETYPE_MSSPLIT

combined

Can be one of separate, combined or hdf5. This controls the number and/or type of files written to the measurement sets. The default set to combined would reduce the number of files in the measurement set and should help improve I/O performance on the file system.

TILE_NCHAN_SCIENCE

1

The number of channels in the measurement set tile for the science data, once the local version is created.

TILE_NCHAN_1934

1

The number of channels in the measurement set tile for the bandpass calibrator data, once the local version is created.

PURGE_INTERIM_MS_SCI

true

Whether to remove the interim science MSs created when splitting and merging is required.

PURGE_INTERIM_MS_1934

true

Whether to remove the interim bandpass calibrator MSs created when splitting and merging is required.

PURGE_FULL_MS

true

Whether to remove the full-spectral-resolution measurement set once the averaging has been done. See notes above.

PURGE_FULL_MS_AFTER_COPY

false

Whether to remove the full-spectral-resolution measurement set once the copying of a channel range of the spectral data has completed.

Important locations

FAILURE_DIRECTORY

/askapbuffer/processing/pipeline-errors

Directory in which information about job failures is archived.

TEMPLATE_DIR

/askapbuffer/payne/askapops/SST-templates

Direcotry where the standard templates are found

PB_DIR

/askapbuffer/payne/askap-beams

Standard location of primary beam (holography) images

Control of Online Services

The pipeline makes use of two online databases: the scheduling block service, which provides information about individual scheduling blocks and their parsets; and the footprint service, which translates descriptive names of beam footprints into celestial positions.

These are hosted at the MRO, and it may be that the MRO is offline but Pawsey is still available. If that is the case, use of these can be turned off via the USE_CLI parameter (CLI=”command line interface”). If you have previously created the relevant metadata files, the pipeline will be able to proceed as usual. If the footprint information is not available, but you know what the footprint name was, you can use the IS_BETA option. See User Parameters - Mosaicking for more information and related parameters.

Variable

Default

Description

USE_CLI

true

A parameter that determines whether to use the command-line interfaces to the online services, specifically schedblock and footprint.

IS_BETA

false

A special parameter that, if true, indicates the dataset was taken with BETA, and so needs to be treated differently (many of the online services will not work with BETA Scheduling Blocks, and the raw data is in a different place).

Reprocessing without raw data present

It is possible that re-processing needs to be done (through re-running processASKAP.sh), but the raw data has been removed or is not available. Normally, various bits of metadata are obtained from the raw MSs, which are used to set up the processing. Some of this is determined through the mslist (Measurement summary and data inspection utility) tool, producing a metadata file that is later parsed. Also required is the total number of channels in the input data (summed over all MSs for a given beam).

If the raw data is not available, these metadata should be provided through the configuration file using the parameters listed below. For the pipeline run to work, the local versions of the MS data must exist - the splitting/merging will not be run again (since there is no raw data to obtain it from).

Variable

Default

Description

NUM_CHAN_1934

“”

The total number of channels in the dataset. If a beam’s data is spread over more than one MS, this is the sum of all channels for that beam. If CHAN_RANGE_1934 is used, NUM_CHAN_1934 should be the number of channels in that range.

NUM_CHAN_SCIENCE

“”

The total number of channels in the dataset. If a beam’s data is spread over more than one MS, this is the sum of all channels for that beam. If CHAN_RANGE_SCIENCE is used, NUM_CHAN_SCIENCE should be the number of channels in that range.

MS_METADATA_CAL

“”

The file produced by mslist (typically from a previous pipeline run) that shows the MS metadata for a calibration observation. The full path to the file must be given.

MS_METADATA

“”

The file produced by mslist (typically from a previous pipeline run) that shows the MS metadata for a science observation. The full path to the file must be given.

Calibrator switches

These parameters control the different types of processing done on the calibrator observation. The three aspects are splitting by beam/scan, flagging, and finding the bandpass. The DO_1934_CAL acts as the “master switch” for the calibrator processing.

Variable

Default

Description

DO_1934_CAL

true

Whether to process the 1934-638 calibrator observations. If set to false then all the following switches will be set to false.

DO_SPLIT_1934

true

Whether to split a given beam/scan from the input 1934 MS. From rev10559 onwards, users can additionally split out bandpass msdata from a specified Time-Range (see below)

DO_FLAG_1934

true

Whether to flag the splitted-out 1934 MS

DO_FIND_BANDPASS

true

Whether to fit for the bandpass using all 1934-638 MSs

Science field switches

These parameter control the different types of processing done on the science field, with DO_SCIENCE_FIELD acting as a master switch for the science field processing.

The pipeline now allows users to convolve continuum images and cubes to common resolution. The common resolution to convolve to is defined by the beam having the coarsest psf. For the coarse cubes, each frequency plane is convolved independently to a resolution common across the mosaic.

Variable

Default

Description

DO_SCIENCE_FIELD

true

Whether to process the science field observations. If set to false then all the following switches will be set to false.

DO_RAPID_SURVEY

false

Whether to use the rapid survey mode of the pipeline - suitable for continuum observations of many fields within a single scheduling block. See How to run the ASKAP pipelines for details.

DO_SPLIT_SCIENCE

true

Whether to split out the given beam from the science MS

DO_FLAG_SCIENCE

true

Whether to flag the (splitted) science MS

DO_APPLY_BANDPASS

true

Whether to apply the bandpass calibration to the science observation

DO_AVERAGE_CHANNELS

true

Whether to average the science MS to continuum resolution

SINGLE_JOB_PREIMAGING

true

WHether to run the preimaging tasks (BP application, flagging, averaging) in a single slurm job. Likewise the pre-spectral-imaging jobs (application of gains/leakages, splitting of channels and/or averaging).

DO_CONT_IMAGING

true

Whether to image the science MS

DO_SELFCAL

true

Whether to self-calibrate the science data when imaging

DO_SOURCE_FINDING_CONT

""

Whether to do the continuum source-finding with Selavy. If not given, the default value is that of DO_CONT_IMAGING. Source finding on the individual beam images is done by setting the parameter DO_SOURCE_FINDING_BEAMWISE to true (the default is false).

DO_CONTINUUM_VALIDATION

true

Whether to run the continuum validation script upon completion of the source-finding.

DO_CONTCUBE_IMAGING

false

Whether to image the continuum cube(s), optionally in multiple polarisations.

DO_APPLY_CAL_CONT

true

Whether to apply the gains calibration determined from the continuum self-calibration to the averaged MS.

DO_SPECTRAL_PROCESSING

false

Whether to do the spectral-line processing. Acts as a master switch that can turn off the following parameters.

DO_COPY_SL

false

Whether to copy a channel range of the original full-spectral- resolution measurement set into a new MS.

DO_APPLY_CAL_SL

true

Whether to apply the gains calibration determined from the continuum self-calibration to the full-spectral-resolution MS.

DO_CONT_SUB_SL

true

Whether to subtract a continuum model from the spectral-line dataset.

DO_SPECTRAL_IMAGING

true

Whether to do the spectral-line imaging

DO_JOINT_IMAGING_SPECTRAL

false

Whether to image the spectral-line data with joint imaging.

DO_SPECTRAL_IMSUB

true

Whether to do the image-based continuum subtraction.

DO_SOURCE_FINDING_SPEC

""

Whether to do the spectral-line source-finding with Selavy. If not given the default value is that of DO_SPECTRAL_IMAGING. Source finding on the individual beam cubes is done by setting the parameter DO_SOURCE_FINDING_BEAMWISE to true (default is false).

DO_MOSAIC

true

Whether to mosaic the individual beam images, forming a single, primary-beam-corrected image. Mosaics of each field can be done via the DO_MOSAIC_FIELDS parameter (default is true).

DO_ALT_IMAGER

true

Whether to use the new imager (imager) for all imaging. Its use for specific modes can be selected by the parameters DO_ALT_IMAGER_CONT, DO_ALT_IMAGER_CONTCUBE, and DO_ALT_IMAGER_SPECTRAL (which, if not given, default to the value of DO_ALT_IMAGER).

DO_CONVOLVE_CONT

true

Whether to convolve the single beam continuum images and cubes to a common resolution before mosaicing. For the MFS images, the final resolution is dictated by the beam having the coarsest psf. For coarse cubes, each frequency channel is convolved independently to a resolution defined by the beam having the coarsest psf for that frequency channel. If DO_CONVOLVE_CONT = true the source finding is done on the convolved mosaics.

Post-processing switches

After the calibration, imaging and source-finding, there are several tasks that can be done to prepare the data for archiving in CASDA, and these tasks are controlled by the following parameters.

Variable

Default

Description

DO_DIAGNOSTICS

true

Whether to run the diagnostic script upon completion of imaging and source-finding. (This is not the continuum validation, but rather other diganostic tasks).

DO_FLAG_SUMMARY_1934

true

Whether to produce a summary file showing the fraction of data flagged as a function of integration, baseline and channel for each of the Cal MSs. See Validation and Diagnostics for further description.

DO_FLAG_SUMMARY_AVERAGED

true

Whether to produce a summary file showing the fraction of data flagged as a function of integration, baseline and channel for each of the continuum-averaged MSs. See Validation and Diagnostics for further description.

DO_FLAG_SUMMARY_SPECTRAL

true

Whether to produce a summary file showing the fraction of data flagged as a function of integration and baseline for each of the spectral-line MSs. See Validation and Diagnostics for further description.

DO_VALIDATION_SCIENCE

true

Run specific science validation tasks, such as plotting the cube statistics.

DO_CONVERT_TO_FITS

true

Whether to convert remaining CASA images and image cubes to FITS format (some will have been converted by the source-finding tasks).

DO_MAKE_THUMBNAILS

true

Whether to make the PNG thumbnail images that are used within CASDA to provide previews of the image data products.

DO_STAGE_FOR_CASDA

false

Whether to tun the casda upload script to copy the data to the staging directory for ingest into the archive.

Slurm time and memory requests

Each slurm job has a time request associated with it. These default to 24 hours (24:00:00), given by the user parameter JOB_TIME_DEFAULT. You can use this parameter to set a different default. Additionally, you can set a different time to the default for individual jobs, by using the following set of parameters. Acceptable time formats include (taken from the sbatch man page): “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”

Each slurm job also may have a memory request. This has not been necessary on galaxy, since jobs are allocated an entire node, but setonix allows multiple jobs per node, with the allocation size controlled by both the requested number of tasks and the memory. There is a `JOB_MEMORY_ equivalent to each JOB_TIME_ parameter. For now, the defaults for all parameters are “0”, which implies use the entire node, but this will change as we develop appropriate values for each job. If a specific value is given, it can be a number followed by a unit indicator (one of [K|M|G|T], where the default is M for megabytes).

The following table lists the variable suffixes, which can be used to form either the JOB_TIME_ or JOB_MEMORY_ variable (e.g. there can be a JOB_TIME_SPLIT_1934 and a JOB_MEMORY_SPLIT_1934 parameter).

Variable Suffix

Description

_SPLIT_1934

Request for splitting the calibrator MS

_FLAG_1934

Request for flagging the calibrator data

_PLOT_FLAG_STATS_1934

Request for plotting the flagging statistics (for bandpass data)

_FLAG_SUMAMRY_CAL

Request for finding the flagging summary (bandpass data)

_FIND_BANDPASS

Request for finding the bandpass solution

_FIND_LEAKAGE

Request for finding the bandpass leakage solutions

_APPLY_BANDPASS_1934

Request for applying the bandpass to the bandpass data

_AVERAGE_MS_1934

Request for averaging the channels of the bandpass data

_SPLIT_SCIENCE

Request for splitting the science MS

_FLAG_SCIENCE

Request for flagging the science data

_FLAG_BADCHAN

Request for subsequent flagging of bad (nearly-totally-flagged) channels

_COMBINE_FLAG_STATS

Request for combining the flag statistics

_PLOT_SELFCAL_GAINS

Request for plotting the selfcal gains following self-calibraition

_COMBINE_PLOTS

Request for combining plots made per beam

_PLOT_FLAG_STATS

Request for plotting the flagging statistics

_FLAG_SUMMARY_AV

Request for making the flagging summary for the averaged data

_FLAG_SUMMARY_SPECTRAL

Request for making the flagging summary for the spectral data

_APPLY_BANDPASS

Request for applying the bandpass to the science data

_APPLY_LEAKAGE

Request for applying the leakage solutions to the science data

_AVERAGE_MS

Request for averaging the channels of the science data

_MSCONCAT_SCI_AV

Request for the concatenation of time window MSs (averaged data)

_MSCONCAT_SCI_SPECTRAL

Request for the concatenation of time window MSs (spectral data)

_CONT_IMAGE

Request for imaging the continuum (both types - with and without self-calibration)

_CONT_SELFCAL

Request for the selfcal job when done as a sepaarate job (when MULTI_JOB_SELFCAL=true)

_CONT_APPLYCAL

Request for the application of calibration when done as a separate job (when MULTI_JOB_SELFCAL=true)

_CONTCUBE_IMAGE

Request for imaging the continuum cubes

_SPECTRAL_SPLIT

Request for splitting out a subset of the spectral data

_SPECTRAL_APPLYCAL

Request for applying the gains calibration to the spectral data

_SPECTRAL_CONTSUB

Request for subtracting the continuum from the spectral data

_SPECTRAL_IMAGE

Request for imaging the spectral-line data

_JOINT_IMAGING_SPECTRAL

Request for the joint-imaging of spectral-line data

_SPECTRAL_IMCONTSUB

Request for performing the image-based continuum subtraction

_MS_CUT

Request for making a cut of the MS around a konwn source

_CONVOLVE

Request for convolving the per-beam images to common resolution

_LINMOS

Request for mosaicking

_SOURCEFINDING_CONT

Request for continuum source-finding jobs

_SOURCEFINDING_SPEC

Request for spectral-line source-finding jobs

_DIAGNOSTICS

Request for the diagnostics job

_FITS_CONVERT

Request for converting to and/or fixing up the FITS files

_THUMBNAILS

Request for making the thumbnail images

_VALIDATE

Request for the various validation jobs

_CASDA_UPLOAD

Request for the casdaupload job

_GATHER_STATS

Request for the final wrap-up job that gathers job statistics

_LAUNCH

Request for the relaunch job to restart the pipeline

Speed-up switches

The ASKAPsoft release 1.1.0 allows some of the processes in the imaging tasks run by the master to exploit thread-level parallelisation across spare CPUs using OpenMP. This can significantly speed up the preconditioning and the CLEANing stages.

Variable

Default

Description

NUM_OMP_THREADS_CONT_IMAGING

20

This number is passed to the OMP_NUM_THREADS environment variable defined locally for the continuum imaging jobs. The imagers, compiled with the OMP option ON, allows the PRECONDITIONING & the CLEAN stages to parallelise tasks across the spare CPUs of the master.

The default (20) is set for the galaxy cluster.

NUM_OMP_THREADS_CONT_LINMOS

20

This number is passed to the OMP_NUM_THREADS environment variable defined locally for continuum mosaic jobs. If compiled with the OMP option ON, continuum mosaic jobs would parallelise the correction and re-gridding stages across the spare CPUs of the node.

The default (20) is set for the galaxy cluster.