Data Integrity

Invalid input BED regions

GREAT requires valid input BED data that adheres to the UCSC BED standard. In addition to adhering to the UCSC BED standard, GREAT confirms that the input data meets certain "sanity" requirements. If any lines in the input data do not conform to the sanity checks listed below, GREAT will abort and notify you of the error.

  • One or both of chromStart, chromEnd is a negative number
  • chromStart > chromEnd
  • One or both of chromStart, chromEnd is larger than the chromosome's actual size
  • chrom is not a valid chromosome
    • BED format requires chromosome names (e.g. "chr12") and not just chromosome numbers (e.g. "12").
  • The score, if provided, is not an integer

BED regions over assembly gaps

GREAT uses association rules to assign a regulatory domain to every gene in the genome. Such regulatory domains can extend through assembly gaps. The weight assigned to a particular ontology term in the binomial test is calculated as the union of the regulatory domains for all genes annotated with the term, minus the assembly gaps. Yet, input data that maps to assembly gaps is assigned to nearby genes even though those assembly gaps are not used in the calculation of the gene regulatory domain size. If your input regions map to assembly gaps, the regions will be assigned to the neighboring genes and included in the statistical tests.

Practically, this somewhat quirky behavior does not affect calculations for any real human or mouse data, as these genome assemblies are extremely high quality and thus no real data should map to assembly gaps. In the future, as GREAT supports more species with less-finished genomes, it is conceivable that an input region could uniquely map to a portion of the genome with its midpoint in a small assembly gap. In our judgment, the current implementation, which includes (rather than ignores) such regions is the appropriate approach to incorporate input mapping to assembly gaps.