Diagnostics and job management ============================== Monitoring and managing jobs ---------------------------- Running the processASKAP.sh script will also create two scripts that are available for the user to run. These reside in the tools/ subdirectory, and are tagged with the date and time that processASKAP.sh was called. Symbolic links to them are put in the top level directory. These scripts are: * *reportProgress.sh* – (links to *tools/reportProgress-YYYY-MM-DD-HHMMSS.sh*) This is a front-end to squeue, showing only those jobs started by the most recent call of *processASKAP.sh*. It can take a few options: - -v : provides a list of the jobs along with a brief description of what that job is doing. - -r : shows only the jobs that are running (useful if a lot of jobs are waiting in the queue) - -c : runs with the -M flag to show jobs for a different cluster to the one on which the script is run * *killAll.sh* – (links to *tools/killAll-YYYY-MM-DD-HHMMSS.sh*) This is a front-end to *scancel*, providing a simple way of cancelling all jobs started by the most recent call of *processASKAP.sh*. It will not kill the launch script that runs at the end of a pipeline run to gather stats and re-run processASKAP.sh (although note the -x option below).This also has a few options: - -x : prevent the launch script from re-running the pipeline next time round - -p : kill only the pending jobs (with the exception of the launch script) - those jobs that are running are left alone to complete - -c : runs with the -M flag to show jobs for a different cluster to the one on which the script is run In each of these cases, the date-stamp (*YYYY-MM-DD-HHMMSS*) is the time at which *processASKAP.sh* was run, so you can tie the results and the jobs down to a particular call of the pipeline. If you have jobs from more than one call of *processASKAP.sh* running at once (generally *not* advised!), you can run the individual script in the tools directory, rather than the symbolic link (which will always point to the most recent one). The list of jobs and their descriptions (as used by *reportProgress.sh*) is written to a file *jobList.txt* (which is a symbolic link, linking to *slurmOutput/jobList-YYYY-MM-DD-HHMMSS.txt*). There are symbolic links created in the top level directory (where the script is run) and in the output directory. For future reference, the configuration file is also archived in the *slurmOutput* directory, with the timestamp added to the filename. This will allow you to go back and see exactly what you submitted each time. Resource diagnostics -------------------- Each of the individual askapsoft tasks writes information about the time taken and the memory used to a "stats" file. The information includes the job ID, the number of cores used, a description of the task, along with an indication of whether the job succeeded ('OK') or failed ('FAIL'). For distributed jobs, a distinction is made between the master process (rank 0) and the worker processes, and for the workers both the peak and average memory useage are reported. This is important for tasks like **cimager** where there are clear differences between the memory usage on the master & worker nodes. These files are progressively updated as jobs complete, and so can be examined as the pipeline run progresses. Both ASCII (formatted into columns) and CSV files are created, placed in the top-level directory and labelled with the same time-stamp as above - stats-all-YYYY-MM-DD-HHMMSS.txt (or .csv). Here is an excerpt of a typical stats file:: JobID nCores Description Result Real User System PeakVM PeakRSS StartTime 4460004 1 split_BPCALB00 OK 8.9 0.73 1.08 487 136 2018-06-08T20:21:24,630 4460005 1 flag_BPCALB00_Dyn OK 10.33 9.34 0.44 470 120 2018-06-08T20:21:39,974 4460077 1 applyBP1934_BPCALB00 OK 21.91 21.27 0.53 483 146 2018-06-08T20:47:30,309 4460113 1 split_F00B00 OK 544.95 64.02 149.75 531 181 2018-06-08T20:21:42,092 4460114 1 applyBP_F00B00 OK 3119.34 3060.18 51.39 483 146 2018-06-08T20:47:30,310 4460115 1 flag_F00B00_Dyn OK 1469.36 1375.93 64.4 471 121 2018-06-08T21:39:42,799 4460116 1 avg_F00B00 OK 143.78 85.39 41.1 540 175 2018-06-08T22:04:22,974 4460117 1 flagAv_F00B00_Dyn OK 42.18 37.79 1.64 406 56 2018-06-08T22:06:57,483 4460118 145 contSC_F00B00_L0_master OK 308.06 289.86 17.97 6379 4002 2018-06-08T22:07:56,605 4460118 145 contSC_F00B00_L0_workerPeak OK 308.06 289.86 17.97 4561 2466 2018-06-08T22:07:56,605 4460118 145 contSC_F00B00_L0_workerAve OK 308.06 289.86 17.97 2158.8 1541.2 2018-06-08T22:07:56,605 4460118 145 contSC_F00B00_L1_master OK 736.05 605.89 129.9 7160 4002 2018-06-08T22:18:44,012 4460118 145 contSC_F00B00_L1_workerPeak OK 736.05 605.89 129.9 5577 2466 2018-06-08T22:18:44,012 4460118 145 contSC_F00B00_L1_workerAve OK 736.05 605.89 129.9 2222.0 1541.2 2018-06-08T22:18:44,012 Upon completion of the pipeline, this information is also presented graphically, in a PNG image (statsPlot-YYYY-MM-DD-HHMMSS.png, which is put in both the top-level directory and the *diagnostics* directory. Here is an example: .. image:: exampleStatsPlot.png :width: 90% :align: center Each row in the plot corresponds to a single beam/field combination, with mosaicking jobs for a given field getting their own row, and mosaicking and source-finding jobs for the overall observation getting a further row (at the top). Each type of job has its own colour, and the width of the line for each job indicates the number of cores used. Jobs that fail are represented by crosses at the start time. Overall pipeline control ------------------------ A further key set of files are the pipeline control parsets. These are ParameterSet files, that are used to record the state of jobs launched by the pipeline, their slurm job IDs, and the progress of particular stages (replacing the former use of checkfiles). These are kept in the pipelineControl directory, and there is one for each field/beam combination, one for each field alone, and one for the 'base' level jobs. It is not intended that this file be accessed outside of the pipeline processes, although the shell script *pipelineControl.sh* does allow interactions (and is used by the pipeline scripts to do so). There are several types of parameters present: * - the job name, similar to that used in the stats files. This has one of four states: - todo: job is pending - running: job is currently running (or has timed-out or been cancelled) - ok: job completed successfully - fail: job completed with errors * check. - a checkpointing parameter, used to indicate a particular step in the pipeline workflow has completed successfully * jobid. - the slurm job ID corresponding to the job . When the job completes this should be removed from the parset. * fieldXX.beamXX (where XX indicates the numerical count for field or beam) - a special parameter intended to be used to stop processing on a particular beam should a QA check fail. This is not yet fully utilised, but the flags are present. Relevant states are either 'todo' or 'stop'. The states of the values, along with the stats file content, are used to determine whether a pipeline needs to be restarted (see :doc:`pipelineUsage`). When a pipeline run completes, the pipelineControl directory is copied to a directory named with the timestamp, then tarred up. When the final pipeline run finishes (ie. no more processing is required), the pipelineControl/ directory is removed.