In case you have problems during the application of the following guide, please refer to the area-specific consultant.

For problems related to access and architectures: Mattia D’Antonio (m.dantonio@cineca.it)
For problems related to exome data analysis: Alessandro Bruselles (a.bruselles@gmail.com)
For problems related to SNP-array analysis: Giovanni Birolo (giovanni.birolo@gmail.com)

Getting started

  1. CINECA registration. Make sure you are registered to UserDB portal (https://userdb.hpc.cineca.it/). For more information refer to: http://www.hpc.cineca.it/content/users
    Note: an ID card (or a valid document) is required to successfully complete the registration to the portal.
    Note also that in order to perform analysis on your data, you will need a HPC account to be created. This is a separate, optional step that you have to ask explicitly.
  2. Project creation.  Write to superc@cineca.it for a child project of the main NIG project to be created for your group. Remember to also ask to be granted the access to the NIG repository data (DRES_Genomi). The project will also be assigned a certain amount of computing hours to perform for future analyses. Furthermore, a Principal Investigator (PI) has to be assigned to this project (usually the head of your research group).
  3. NIG portal registration: register to the NIG portal by writing a request to m.dantonio@cineca.it

Procedure

Here the procedure is different, depending on the type of the data to analyze.

Exomic data:

  1. Filtering, mapping, variant calling [fastq] -> [g.vcf]
  2. annotation [unannotated vcf] -> [annotated vcf]
  3. data split into samples [annotated vcf] -> [several vcf]
  4. loading fastq into dataset
  5. loading g.vcf into dataset
  6. loading vcf into corresponding datasets
  7. (optional) phasing [several g.vcf] -> [unannotated vcf]

SNP-array data:

  1. annotation [unannotated vcf.gz] -> [annotated vcf.gz]
  2. loading of un-annotated vcf.gz into dataset. BEWARE: it is *extremely* important that the data is compressed, otherwise the web interface will try to parse your vcf.
  3. loading of annotated vcf.gz file into the study. BEWARE: it is *mandatory* that the names of both phenotypes and datasets match.

1. Loading the data

This section focuses on importing your data into CINECA. Since you are going to analyze your data with the pipeline that the NIG consortium has developed for you, you need to load your raw data in your own data directory.

  1. Loading data into your area. You might consider copying your own data on the CINECA infrastructure by using FTP protocol (to this end we recommend Filezilla). If your data is particularly big, please consider using GridFTP or GlobusOnline. In either case, we recommend you to deposit your data into your $CINECA_SCRATCH area.

    These are the parameters to use when connecting:

    
    host: login.pico.cineca.it
    username: your username
    password: your password
    
    

    For further information, please refer to https://wiki.u-gov.it/confluence/display/SCAIUS/Data+Transfer.

  2. Import data into the genomic repository. To make your data visible from the NIG portal, you also need to import your data on the repository. You can do that by uploading files from the web interface or by directly accessing on iRODS. Since the first solution is only suitable for a limited amount of data, the second solution is typically adopted. Before being allowed to access to iRODS you need an x509 personal certificate (if your don’t, ask CINECA to provide you with a new one) and add your “CN” into your userDB profile settings.
    For further information about IRODS setup, please refer to: https://wiki.u-gov.it/confluence/display/SCAIUS/iRODS-based+REPO.

Import data through the web portal. From the web platform you can:

  1. Login into the NIG web portal nig.cineca.it
  2. Create a study (A container used to organize your data. Could correspond to a family)
  3. Into the study create a dataset (An homogeneous set of files defined by the same set of metadata. Typically corresponds to a sample)
  4. Upload your files into the dataset

Import data through iRODS. You can copy your files directly on iRODS and then ingest data on the web repository. You can ingest single files onto an already defined study/dataset organization (see above paragraph) or you can create folders and subfolders on iRODS to also ingest the study/dataset structure.

Ingest single files:

  1. Login on Pico cluster (login.pico.cineca.it)
  2. load the i-commands with iinit
  3. copy your files onto you iRODS home

user@pico:$ iput file.fq.gz /CINECA01/home/DRES_Genomi/YOUR_IRODS_USERNAME/

  1. Login into the NIG web portal nig.cineca.it
  2. Open the study containing the dataset on which you want to ingest your files and open the Stage tab
  3. Import the file into the desired dataset

Ingest structured files:

  1. Login on Pico cluster (login.pico.cineca.it)
  2. organize your files into a folder (i.e., a study) containing subfolders (i.e., datasets)
  3. load the i-commands with iinit
  4. copy your folder onto you iRODS home
    
    user@pico:$ iput your_study_folder /CINECA01/home/DRES_Genomi/YOUR_IRODS_USERNAME/
  5. Login into the NIG web portal nig.cineca.it
  6. Open the study list and import the folder as a new study

Ingest semi-structured files. You can also define your studies from the web portal and ingest only datasets:

  1. Login on Pico cluster
  2. Put your files into a folder
  3. load the i-commands with iinit
  4. copy your folder onto you iRODS home
    
    user@pico:$ iput your_dataset_folder /CINECA01/home/DRES_Genomi/YOUR_IRODS_USERNAME/
  5. Login into the NIG web portal nig.cineca.it
  6. Open the study on which you want to ingest your dataset and open the Stage tab
  7. Import the folder as a new dataset

NOTE: Consider copying into your study any other files containing metadata (e.g., pedigree file) to automate the metadata definition (see paragraphs below).

2. Analyzing your data

This section explains how to analyze your data with the pipeline provided by the NIG consortium.

  1. Analyze the data with the NIG’s pipeline
    1. Login. To perform this step you will have to login into Pico by using SSH (for further information: https://wiki.u-gov.it/confluence/display/SCAIUS/Data+management):For Windows users: install putty, launch it and fill in username and host (login.pico.cineca.it)
      For Unix users: open a Terminal and issue the following command:

      
      ssh USER@login.pico.cineca.it
      
      
    2. Load the NIG module. At Cineca we have provided a nig module which provides you with all the necessary steps.
      
      user@pico:$ module load nig
      
      
    3. Run the pipeline. The NIG module provides you with 2 types of pipelines:
      1. If you’re analyzing exome-sequencing data, issue the following command:
        
        pipeline -i $INPUT_DIR -o $OUTPUT_DIR -p $TEMP_DIR -c "s" -d $STUDY_ID -T $KIT_NAME -a $PROJECT_NAME -n $SAMPLES -w $WALLTIME
        
        
      2. If you’re analyzing SNP-array data, you have to provide the vcf file yourself.

(copying results into the REPO: after the pipeline has completed, remember copying the g.vcf file produced by the previous pipeline into your REPO space by using the i-commands. You can put this file into the same dataset already containing the input fastq files).

3. Annotation

The NIG pipeline richly annotates the g.vcf produced by the first step of the pipeline with a great number of annotations, including CADDv1.3, DANN, phastCons, PhyloP, spidex, 1000Gp3, GERP, fitConsv1.01, UK10K, fathmmMKL, ExACr0.3, ESP6500 (and many many more!).

  1. If you’re annotating exome-sequencing data:
    pipeline_joint_variant_calling_and_annotation -a $PROJECT_NAME -i $INPUT_DIR -p $TEMP_DIR -o $OUTPUT_DIR -w $WALLTIME -m $MEMORY
    
    
  2. If you’re analyzing SNP-array data:
    pipeline_chip_annotation -a $PROJECT_NAME -i $INPUT_DIR -p $TEMP_DIR -o $OUTPUT_DIR -w $WALLTIME -m $MEMORY
    
    

4. Phasing

Although not required, you can run a third step to obtain phasing-related information.

  1. If you wish to obtain phasing-related information, run the pipeline for phasing. After this step, you will obtain as many files as the number of components in your family.
  2. Wait for the nig-administrator to complete the second phase of the analysis.

5. Split vcf prodotto dal phasing

TODO

6. Adding metadata

Depending on the size of your data, this procedure can be performed either through a web interface or by import files.

Define metadata on the Web

  1. Login into the NIG web portal nig.cineca.it and open a Study
  2. Fill in the additional information sections with all the available metadata (e.g., phenotypical information, technical information, etc.)
  3. Link defined metadata to related genomic datasets

Import metadata from annotation files (e.g., .ped)

  1. Upload your .ped file into your Study as a Resource
  2. Wait while the system complete the import procedure
  3. Link defined metadata to related genomic datasets

Upload annotation files on iRODS

  1. Upload your annotation files on iRODS (see paragraph 1. Loading the data for details)
  2. Login into the NIG web portal nig.cineca.it
  3. Open the study on which you want to ingest your annotation file and open the Stage tab
  4. Import the annotation files as a new Resource
  5. Link defined metadata to related genomic datasets

Please note that you can follow the instructions described on paragraph 1.2 in the section Ingest structured files by putting the annotation files directly on the study folder. In this case, here you can directly jump to the last point (Link defined metadata to related genomic datasets).