Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The zebrafish genome has no UCSC Known Genes set, therefore we used the following major transcript and protein source databases to obtain a comprehensive, high-quality gene set:

  • RefSeq transcripts
  • Ensembl coding genes
  • RefSeq proteins
  • Uniprot proteins

...

We only included transcripts or proteins that belongs belong to ZFIN genes, because almost all ontologies map annotations to ZFIN genes.

...

From all hits of transcripts/proteins in each of the three steps above, we retained only the best hit per locus, which effectively handles matches of paralogs. As a substantial number of bona-fide bonafide genes (such as Ctnnbl1 or Wnt9a) map to scaffolds, we include all gene-containing scaffolds in zebrafish GREAT. In contrast to the human and mouse gene sets, we also keep genes that currently do not possess a meaningful GO annotation because a manual inspection found that the human ortholog often has annotations. Some of these genes have annotations in other ontologies.

...

From the RefSeq transcripts that are associated to with a ZFIN gene ID, our gene set contains genes corresponding to 14,720 of them (95%). From the 13,104 Ensembl genes that are associated to with a ZFIN gene ID, our gene set contains genes corresponding to 12,378 (94.5%).

Finally, the combined use of RefSeq, Ensembl, and Uniprot substantially increased the number of genes that have annotations in our ontologies. If our gene set would be based on RefSeq transcripts alone, we would miss 1,912 genes with annotations. Similarly, using only Ensembl transcripts we would miss 1,218 genes with annotations.

...

Set of Genes for GREAT 4.0

...

File Format

  • Human/Mouse

    No Format
     <ucscClusterId> <ucscClusterId or Ensembl/MGI gene ID> <TAB> <tssChrom> <TAB> <tssCoord> <TAB> <tssStrand> <TAB> <geneSymbol>
    


  • Zebrafish

    No Format
     <rowNumber> <TAB> <tssChrom> <TAB> <tssCoord> <TAB> <tssStrand> <TAB> <geneSymbol>
    


...

Many genes have multiple splice variants, however, the vast majority of annotations available for these genes do not (and often cannot) distinguish between the different isoforms. Motivated by this observation, GREAT uses a single transcription start site to represent each gene in calculating gene regulatory domains. So, for human and mouse genomes, GREAT uses the transcription start site of the canonical isoform of a gene. The definition of the canonical isoform is taken from the knownCanonical table of the UCSC Known Genes track (1). For zebrafish, we take the most upstream transcription start site.

...