Sharing Data on the WWW
You are not alone
When Ray first asked me to give a talk at this meeting, the title "Sharing Data on the WWW" was proposed. I figured that it would be something to start with, but certainly would have to be changed in final form because the image that came to my mind was that of playground children sharing (or not sharing) their toys. As I pieced together the various parts I realized that what I wanted to talk about is exactly that - astronomers sharing their data and ideas with others and to remind people that you are not alone.
I'll try to focus on the positive aspect of data sharing and not dwell on ways some researchers will try to "hide" some aspects of their research so they don't get scooped.
What I'll talk about are avenues of getting your discoveries out into the world, a brief listing of commonly used formats, and methods of obtaining other peoples' results so you can use them. I will use my personal experience with working on NED for the past 12 years to indicate how this has been made possible.
1. Putting it out there
2. What formats are commonly used?
3. Getting other people's Data
4. Combined data for objects
5. How can it work?
Putting it out there
The most respected method of informing the community of what you've done is through the refereed journals. These articles are archival in nature and are often considered to be the official word in research. However, these articles are harder to write because they need to be in particular formats and need to go through an elaborate array of referees, editors, copy-editors, and publishers before the results actually "hit the streets". The process may take up to a year from the time a article is submitted until it is actually in the hands of the intended audience, although some articles may become available in just 2-3 months. This could be a daunting process, but you are not alone. Generally, your co- workers probably have experience with publishing articles and will help you get it right. The journal editors also take an active role in following guidelines and keeping standards.
Unfortunately (or fortunately), a LOT of data in the form of tables, spectra, images, etc. are being made available in this manner. The data centers (ADC/CDS) and data bases (NED/SIMBAD) are the de facto repositories for these products after publication.
However, it does happen that errors are generated and not caught until after publication. Undetected errors could mislead future researchers if they use the data at face-value. With so much information out there it is really tough for researchers to fully understand every observation and they may take the option of just quoting a published value. DANGEROUS!
One solution to the time-lag problem is the use of pre-print servers. Often (but not always) authors will post their articles on a service such as astro-ph after the article has been accepted for publication. That way what is retrieved should be the same as what gets published. However, there is limited quality control of what gets submitted and versions may change from what gets archived with the journal. Some articles are submitted to astro-ph before they are refereed and authors may or may not revise the on-line version prior to or after publication. Some "pre-prints" may never even get published. BE CAUTIOUS!
Other venues for disseminating your knowledge are via newsletters, such as the Galactic Center News or Dwarf Tales and privately published Observatory Reports and Monographs. Private lists tend to be distributed upon request and can sometimes contain very rich samples, while personal lists are often kept within a small group with occasional objects published as deemed ready. Chat groups may get information out into the world, but it is often highly suspicious.
What formats are commonly used?
Many journals today encourage LaTex files for article submission and they even provide style files to help authors write in the style the journals expect. This is fine for the publication process, but may not be the best for data distribution. If a LaTex file is to be used outside of the journal environment, one needs the correct style file to read it. Different journals may even have conflicting commands for some features.
Basically, any of the popular word processing packages may be used to generate the text. Data tables also may be generated by various database packages such as Excel, or Lotus 1-2-3. However, I would like to make a plea that FLAT ASCII files also be generated by the authors and included with the submitted article. There have been cases where data tables created in LaTex have been incorrectly converted to ASCII by researchers and even publishers. By having the author supply BOTH formats helps to minimize this problem.
Spectra and images are most scientifically useful when provided in FITS format. Postscript, JPEGs, GIFs, etc. may make for smaller files, but one needs more information (such as that given in the headers) to actually use the data. The future may see a change in the format for distributing reports as we shall soon hear concerning XML
Getting the Data
There are several methods of getting the data off the WWW. The first (and best) place you should go is to your local World Data Center. They are going through quite a bit of effort to create ASCII files of data tables and to archive them with standardized README files which are written by the authors or based directly on the data as published. In the event that a published table needs revision, these files are kept current with corresponding updates made to the README file describing the history of the changes. Data tables, spectra, and images are made available whenever practical. Although these are the most trustworthy, they should also be treated with care.
Some data are better maintained at a mission center such as IPAC, HEASARC, or STScI for various reasons such as size, quantity, availability dates, works in progress, etc. These locations tend to have their own user interface to the data, have the most current and correct versions, and provide detailed documentation on how to use the data. Of course, published data are now directly available via the electronic journals which published the data to begin with. However, these sites require a subscription/password which may not always be available, the data tables may not always be obtainable in ASCII, the images/spectra might not be in FITS format, and the data files may not be as well maintained as they are at the Data Centers. Retrieving data from the preprint servers have the same concerns (except for the subscription requirement), but also have the possibility of being data different than that which was published or might not even have been published. Requesting data directly from the authors can be quite successful and it shows them that there are people interested in their data. However, the documentation may not be as complete as what might be available from the Data Centers and the author, if still upgrading the data set, may send you a working version which might be different from the published version, thus losing the paper trail.
As a last resort, you can scan or hand-type the data tables yourself. This can be very time consuming and may contain errors depending on your proofreading ability. These tables are also the data as published without known errors corrected.
Combined data for Objects
A different method of getting data from the WWW is to use centralized databases such as NED, SIMBAD, and LEDA. These data bases extract information from the literature and attach them to unique objects. By their nature, NED and SIMBAD are inhomogeneous, but quite helpful in providing pointers to the literature and detailed data on individual objects. LEDA has attempted to provide homogeneous information for various parameters.
The documented data files at the data centers are the ones accessed by VizieR, which lets you search each of them for detailed data table entries. Compilation catalogs such as the Catalog of Infrared Observations and the Russian CATS compilation for radio sources also let you query specific data points from catalogs or journal articles.
How can it work?
The key to sharing data is the conscious effort to WANT to. There are ways of hiding data, but the best solution for astronomy is to make your data easily accessible. This includes being clear in what objects you have in your sample. If they already have existing names, use them. If they are new objects, consult the IAU specifications for nomenclature for procedures on how to name them. If you're not sure, check NED and SIMBAD to see if your object is in either data base.
When using acronyms and catch phrases, make sure you define them sufficiently. Keep in mind that your articles will still be around in 20 years and concepts will certainly have changed.
Remember that the journals basically have a logical flow to how an article should be written. Standards have been established and can only work when they are adhered to. Flat ASCII tables and FITS files are the most general forms of distributing data at this point.
And, lastly, the true glue that allows NED, SIMBAD, ADS, CDS, and the ADC to work is the 19-digit reference code (also called the bibliographic code). This powerful little code allows the archives to interconnect. For journal articles it is easily understandable by humans without the need for extensive lookup tables since it includes the year, journal, volume, and starting page number of the article.
You are responsible for your data!!!
Dictionary of nomenclature: