User Tools

Site Tools


correlator:archiving

Archiving LBA or AuScope data products from the correlator

Archiving the Correlator Output Data

The correlator archive now lives on Pawsey's new data store. (You'll need your own Pawsey account for access.)

If you log in to the data store via your web browser, under “Tools” there are some data management scripts available for command line use: ashell.py and aterm.jar .

There is a wrapper script for ashell.py that automatically tars up and transfers the data required for the archive:

archivec.py $CORR_DATA/<expname> /projects/VLBI/Archive/LBA/<exp_parent>

where <exp_parent> is the overall project code, not the <expname> for this session (e.g. the 'v534' in 'v534a'). It will prompt for username and password if required, but you can avoid that by first setting up a session delegate. E.g.

ashell.py "login + delegate 7"

will prompt for your username and password then set up passwordless delegate access for the next 7 days.

N.B. it is advisable to run transfers in screen as the transfers can take a while.

To archive data by hand (deprecated):

corr@cuppa03:~$ ashell.py 
Welcome to ashell v0.8, type 'help' for a list of commands
ivec:offline>login
Username: hbignall # insert your Pawsey user name here
Password:
ivec:online>cf VLBI/Archive/LBA/{}           # changes to this directory on the data store 
ivec:online>cd /path/to/local/data           # changes the local working directory
ivec:online>put ./                           # will transfer the whole {expname} directory to the data store
ivec:online>logout                           # when finished, to prevent someone else using your login
ivec:offline>exit

Archiving LBA FITS files to ATOA

The FITS files (with an md5sum) need to be uploaded to ATOA. There is a script to generate the md5sum file and upload it and the FITS file to the appropriate place:

archatoa.py *FITS*

Be careful to upload only the production FITS files. From Pawsey, archatoa.py requires ssh tunneling to access the ATNF web site. You can set that up to happen automatically. Removing errant files from ATOA is problematic, so it is recommended to leave a cooling off period before running archatoa.

Archiving the pipeline outputs

The pipeline outputs are distributed to PIs via the wiki. lba_feedback.py will create the wiki page, and archpipe will automatically send the archive plots and the wiki page to the wiki. The wiki pages are linked from the correlator records spreadsheet ( LBA or Auscope).

cd $PIPE/<expname>/out
lba_feedback.py <expname>.wikilog > <expname.txt>
archpipe <expname>

From Pawsey, archpipe requires ssh tunneling to access the ATNF web site. You can set that up to happen automatically.

Distributing the Data

Once archived, log in to My Data at https://data.pawsey.org.au. Locate the FITS files (VLBI/archive/LBA/), and make them public. Email the link to the PI.

Deleting the baseband data

Once the correlated data are verified and released, the baseband data should be deleted. On Pawsey systems, don't use the normal 'rm' command as the overheads are likely to cause file system problems, as described here: https://support.pawsey.org.au/documentation/display/US/Tips+and+Best+Practices

There is a convenient alias (rml) defined in ~cormac/.bash_aliases. To use, e.g.:

. ~cormac/.bash_aliases
cd ~/scratch/corr/baseband
rml v454b-??

will efficiently delete the baseband data directories for v454b

Notes on what needs to go into the archive (for correlation with DiFX versions 1.5 and higher)

If you are using espresso.py for correlation and archivec.py for archiving, the following is all done automatically.

Associated with each job

  • .joblist (not critical)
  • .v2d
  • .vex
  • .difxlog (DiFX messages from errormon2)

Unique to each job

  • .input
  • .calc
  • .im
  • .uvw [DiFX 1.5]
  • .delay [DiFX 1.5]
  • .rate [DiFX 1.5]
  • .flag
  • .difx (output directory - may contain multiple files)

Additional files for pulsar modes only:

  • .polyco
  • .binconfig

Final output

  • FITS files - may be associated with multiple jobs, arbitrarily named. Include README (as per description of output files on wiki page).

Notes

Jobs may live in subdirectories, with identical filenames for files from different jobs within each subdirectory.

For accountability it's important to keep the directory structure.

Ideally we want to keep all relevant files for all production jobs.

NB: the following is mainly relevant for old versions of DiFX (pre DiFX-2): In some cases SWIN format output data may be impractically large for online storage and it may not be desirable to keep this intermediate-stage data. For example, output from DiFX 1.5 when a full band is correlated at high spectral resolution, but the user only wants the subset of the band containing the spectral lines at high resolution. In this case the output FITS data will generally be a manageable size, but the SWIN output data in the .difx directories will typically be at least several times larger (e.g. it covers 16 MHz, while the region of interest is only 2 MHz wide).

It may be useful to keep all jobs (clock search and test as well - e.g. to have some data at higher spectral resolution for checking). Usually clock search jobs are in “clocks” subdirectory, but it won't necessarily be obvious which jobs are test or final production. Test jobs could be manually moved to a subdirectory (e.g. “test”). Other (dated) subdirectories may exist for production jobs (especially where multiple runs were needed). Note: Running with espresso now creates a comment.txt file to contain a description of each job (operator prompted to edit file at completion of correlation). Espresso also allows test jobs to be specified on running (to be written to “test” subdirectory in output data area).

Not useful to keep:

  • jobs with zero length (no) output files
  • MPI related files (threads, machines, run)
  • station file lists (already in .input)
correlator/archiving.txt · Last modified: 2017/09/26 17:02 by cormac