Intro

aDiff is an annotation tool for differential gene expression results generated by cuffdiff (Trapnell C., Nature Biotechnology, 2012).

It annotates cuffdiff outputs with ensembl gene ids, gene ontology terms and kegg ids.

Additonally it uses DAVIDs API (Huang DW, Nature Protoc., 2009; Huang DW, Nucleic Acids Res., 2009; Xiaoli J, Bioinformatics, 2012) to perform enrichment analysis.

A Cytoscape (Shannon P, Genome Research, 2003) instance running with the String (Szklarczyk D, Nucleic Acids Res., 2017) App installed can additionally be plugged in to generate expanded protein-protein interactions.

For a full RNAseq pipeline including aDiff check: http://bioinformatics.age.mpg.de/presentations-tutorials/presentations/modules/rnaseq-tuxedo-update/#/intro

Examples

Example of an aDiff call on a c. elegans dataset:

$ aDiff -D -i cuffdiff_output -o adiff_output \
-G references/cel.latest.ensembl.gtf \
-C cuffmerge_output/merged.gtf \
--DAVIDuser "<Registered.Email@david.com>" \
--organismtag CEL \
--cytoscape_host 'localhost' \
--cytoscape_port 1234

Example of an aDiff call on a d. melanogaster dataset:

$ aDiff -D -i cuffdiff_output -o adiff_output \
-G references/Drosophila_melanogaster.BDGP6.90.gtf \
-C cuffmerge_output/merged.gtf \
--dataset dmelanogaster_gene_ensembl \
--filter flybase_gene_id \
--outputBiotypes 'flybase_gene_id gene_biotype' \
--outputGoterms 'flybase_gene_id go_id name_1006' \
--DAVIDid FLYBASE_GENE_ID \
--DAVIDuser "<Registered.Email@david.com>" \
--organismtag DMEL \
--species 'drosophila melanogaster' \
--cytoscape_host 'localhost' \
--cytoscape_port 1234

Example of an aDiff call on a mus musculus dataset:

$ aDiff -i cufdiff_output -o adiff_output \
-G ensembl.mus_musculus.83.original.gtf \
-C cuffmerge_output/merged.gtf \
--TSV \
--dataset mmusculus_gene_ensembl \
-u "<Registered.Email@david.com>" \
--DAVIDid ENSEMBL_GENE_ID \
--host http://dec2015.archive.ensembl.org/biomart \
--organismtag MUS \
--species 'mus musculus' \
--cytoscape_host 'localhost' \
--cytoscape_port 1234

Example of an aDiff call on a h. sapiens dataset:

$ aDiff -i cufdiff_output -o adiff_output \
-G ensembl.homo_sapiens.83.original.gtf \
-C cuffmerge_output/merged.gtf \
--TSV \
--dataset hsapiens_gene_ensembl \
-u "<Registered.Email@david.com>" \
--DAVIDid ENSEMBL_GENE_ID \
--host http://dec2015.archive.ensembl.org/biomart \
--organismtag HSA \
--species 'homo sapiens' \
--cytoscape_host 'localhost' \
--cytoscape_port 1234

Output files

Example of the output for the the h. sapiens call above.

  • diff_sig_geneexp.xlsx this file reports significant differential gene expression. It is based on the gene_exp.diff file output of cuffdiff adding annotation columns to it. It contains one sheet for each pairwise comparison filtered to significant values (as defined in cuffdiff).

  • diff_sig_iso.xlsx this file reports significant differential isoform expression . It is based on the isoform_exp.diff file output of cuffdiff adding annotation columns to it. It contains one sheet for each pairwise comparison filtered to significant values (as defined in cuffdiff).

  • diff_sig_prom.xlsx this file reports significant differential promoter usage. It is based on the promoters.diff file output of cuffdiff adding annotation columns to it. It contains one sheet for each pairwise comparison filtered to significant values (as defined in cuffdiff).

  • diff_sig_splic.xlsx this file reprots significant differential splicing . It is based on the splicing.diff file output of cuffdiff adding annotation columns to it. It contains one sheet for each pairwise comparison filtered to significant values (as defined in cuffdiff).

  • diff_sig_cds.xlsx this file reports significant differential cds usage. It is based on the cds.diff file output of cuffdiff adding annotation columns to it. It contains one sheet for each pairwise comparison filtered to significant values (as defined in cuffdiff).

  • geneexp_ALL.tsv this file is based on the gene_exp.diff file output of cuffdiff adding annotation columns to it.

  • iso_ALL.tsv this file is based on the isoform_exp.diff file output of cuffdiff adding annotation columns to it.

  • prom_ALL.tsv this file is based on the promoters.diff file output of cuffdiff adding annotation columns to it.

  • splic_ALL.tsv this file is based on the splicing.diff file output of cuffdiff adding annotation columns to it.

  • cds_ALL.tsv this file is based on the cds.diff file output of cuffdiff adding annotation columns to it.

  • diff_p.05.xlsx contains a sheet for each of the files above (ie. geneexp_ALL.tsv, iso_ALL.tsv, prom_ALL.tsv, splic_ALL.tsv, cds_ALL.tsv ) subset to p values bellow 0.05.

  • KEGG_PATHWAY_diff_sig_geneexp.xlsx this file is based on the gene_exp.diff file output of cuffdiff. It generates a result sheet for each pairwise comparison. It reports DAVID enrichment results for KEGG using genes labeled as significant by cuffdiff.

  • GOTERM_BP_FAT_diff_sig_splic.xlsx this is file is based on the splicing.diff file output of cuffdiff. It generates a result sheet for each pairwise comparison. It reports DAVID enrichment results for Gene Ontology Biological Process (GOTERM BP) using genes labeled as significant by cuffdiff.

  • OMIM_DISEASE_diff_sig_geneexp.xlsx this file is based on the gene_exp.diff file output of cuffdiff. It generates a result sheet for each pairwise comparison. It reports DAVID enrichment results for OMIM DISEASE using genes labeled as significant by cuffdiff.

DAVID output columns:

  • categoryName: Category name. eg.: GOTERM_BP_FAT.

  • termName: Term name. eg.: GO:0048468~cell development.

  • listHits: Number of items in the query list matching this term.

  • percent: Percentage of items in the query list matching this term.

  • ease: ease test p value.

  • geneIds: gene ids.

  • Gene_name: gene name.

  • listTotals: number of genes in query list.

  • popHits: number of genes in background population list matching this term.

  • popTotals: number of genes in background population lis.

  • foldEnrichment: Fold enrichment.

  • bonferroni: Bonferroni corrected p values.

  • benjamini: Benjamini-Hochberg corrected p values.

  • afdr: False discovery rate.

More information on the standard ouput columns of cuffdiff can be found here.

The cytoscape folder contains cytoscape session files cys, as well as pdfs and pngs of the generated networks. Networks are generated by String PPI queries allowing a 25% size expanasion and a confidence cuttoff of 0.4. It also generates a subnetwork by ranking the genes by abs(log2(fold change)) and selecting the top 10% of nodes with edges and the respective first neighbours as well as the same 10% slection but using difusion. Node color maps log2(fold change) - blue down, red up - while node border color and size map normalized expression.

Help

$ aDiff --help

aDiff is an annotation tool for differential gene expression results generated
by cuffdiff (Trapnell C., Nature Biotechnology, 2012).

usage: aDiff [-h] [-D] [-i INPUTFOLDER] [-o OUTPUTFOLDER] [-G ORIGINALGTF]
             [-C CUFFCOMPAREGTF] [-f INPUTFILES] [-s SHORTOUTPUTNAME]
             [--sigOnly] [--TSV] [--TSVall] [--description] [--listMarts]
             [--mart MART] [--listDatasets] [--dataset DATASET]
             [--listFilters] [--filter FILTER] [--listAttributes]
             [--outputBiotypes OUTPUTBIOTYPES] [--outputGoterms OUTPUTGOTERMS]
             [--KEGG] [--listKEGGorganisms] [--KEGGorg KEGGORG] [--findKEGGdb]
             [--KEGGdb KEGGDB] [--DAVIDid DAVIDID] [--DAVIDcat DAVIDCAT]
             [-u DAVIDUSER] [--host HOST] [--organismtag {DMEL,CEL,MUS,HSA}]
             [--species SPECIES] [--limit LIMIT] [--cuttoff CUTTOFF]
             [--taxon TAXON] [--cytoscape_host CYTOSCAPE_HOST]
             [--cytoscape_port CYTOSCAPE_PORT]

optional arguments:
  -h, --help            show this help message and exit
  -D, --DAVID           Use this flag to perform DAVID GO enrichment analysis
                        (default: False)
  -i INPUTFOLDER, --inputFolder INPUTFOLDER
                        Cuffdiff output folder (default: None)
  -o OUTPUTFOLDER, --outputFolder OUTPUTFOLDER
                        Output folder (default: None)
  -G ORIGINALGTF, --originalGTF ORIGINALGTF
                        Original/downloaded GTF (default: None)
  -C CUFFCOMPAREGTF, --cuffcompareGTF CUFFCOMPAREGTF
                        Merged cuffcompared GTF (default: None)
  -f INPUTFILES, --inputFiles INPUTFILES
                        Implies -s. Use this option to select which *.diff
                        files you wish to analyse.'. (default: gene_exp.diff
                        promoters.diff splicing.diff cds.diff
                        isoform_exp.diff)
  -s SHORTOUTPUTNAME, --shortOutputName SHORTOUTPUTNAME
                        Use this option to select a short outpput name for
                        each *.diff file used in '-f'. No '.' (dots) allowed.
                        (default: geneexp prom splic cds iso)
  --sigOnly             Only create report tables for cuffdiff-labeled
                        significantly changed genes (default: False)
  --TSV                 For p values > = 0.05 write tables as tab separated
                        values (default: False)
  --TSVall              Save p < 0.05 save tables as tab separated values in a
                        folder called TSV (default: False)
  --description         Get a description of what this script does. (default:
                        False)
  --listMarts           List biomaRt Marts (default: False)
  --mart MART           Your mart of choice. (default: ENSEMBL_MART_ENSEMBL)
  --listDatasets        List datasets for your mart (default: False)
  --dataset DATASET     Dataset of your choice. (default:
                        celegans_gene_ensembl)
  --listFilters         List available filters (default: False)
  --filter FILTER       Filter to use to identify your genes. (default:
                        ensembl_gene_id)
  --listAttributes      List available attributes for your dataset. (default:
                        False)
  --outputBiotypes OUTPUTBIOTYPES
                        Outputs/attributes for your biotypes data. Order has
                        to be kept, ie. first IDs then biotype. (default:
                        ensembl_gene_id gene_biotype)
  --outputGoterms OUTPUTGOTERMS
                        Outputs/attributes for your goterms data. Order has to
                        be kept, ie. 1st gene_id, then go_id, then
                        go_term_name (default: ensembl_gene_id go_id
                        name_1006)
  --KEGG                Add KEGG annotations (default: False)
  --listKEGGorganisms   List KEGG organisms. (default: False)
  --KEGGorg KEGGORG     KEGG organism. (default: cel)
  --findKEGGdb          KEGG has DB identifier for each linked DB. Use this
                        function to find the label of your DB, eg: 'ensembl-
                        hsa', 'FlyBase'. This option requires --originalGTF
                        and --KEGGorg (default: False)
  --KEGGdb KEGGDB       KEGG database linked to your ensembl organism.
                        (default: EnsemblGenomes-Gn)
  --DAVIDid DAVIDID     DAVID's id for your dataset. List of ids available in
                        http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_A
                        PI.html#input_list (default: WORMBASE_GENE_ID)
  --DAVIDcat DAVIDCAT   DAVID's categories you wish to analyse. List of
                        available categories in https://david.ncifcrf.gov/cont
                        ent.jsp?file=DAVID_API.html#approved_list. (default: G
                        OTERM_BP_FAT,GOTERM_CC_FAT,GOTERM_MF_FAT,KEGG_PATHWAY,
                        PFAM,PROSITE,GENETIC_ASSOCIATION_DB_DISEASE,OMIM_DISEA
                        SE)
  -u DAVIDUSER, --DAVIDuser DAVIDUSER
                        Your DAVID's user id. example: 'John.Doe@age.mpg.de'
                        (default: None)
  --host HOST           Ensembl host. Check http://www.ensembl.org/info/websit
                        e/archives/index.html for older releases. (default:
                        http://www.ensembl.org/biomart)
  --organismtag {DMEL,CEL,MUS,HSA}
                        Organism tag. (default: None)
  --species SPECIES     Species for string app query. eg. 'caenorhabditis
                        elegans', 'drosophila melanogaster', 'mus musculus',
                        'homo sapiens'. Default='caenorhabditis elegans'
                        (default: caenorhabditis elegans)
  --limit LIMIT         Limit for string app query. Number of extra genes to
                        recover. If None, limit=N(query_genes)*.25 (default:
                        None)
  --cuttoff CUTTOFF     Confidence cuttoff for sting app query. Default=0.4
                        (default: 0.4)
  --taxon TAXON         Taxon id for string app query. For the species shown
                        above, taxon id will be automatically identified.
                        (default: None)
  --cytoscape_host CYTOSCAPE_HOST
                        Host address for cytoscape. (default: None)
  --cytoscape_port CYTOSCAPE_PORT
                        Cytoscape port. (default: None)