REDfly seeks to include all experimentally verified fly cis-regulatory modules (CRMs) and transcription factor binding sites (TFBSs), along with their DNA sequence, their associated genes, and the expression patterns they direct. At this time, REDfly only contains CRMs and TFBSs from D. melanogaster, despite the growing number of functionally tested sequences from other fly species. Currently we focus on sequences that have been tested by reporter gene assays in transgenic animals or cultured cells, and on binding sites discovered by DNaseI footprinting and electrophoretic mobility shift (gel shift) assays. These sequences are stored in REDfly as two main data classes: Reporter Constructs and TFBSs.
Any sequence tested by reporter gene assay is included in REDfly as a “Reporter Construct” (RC) and has three associated attributes: (1) expression; (2) CRM; and (3) minimization.
Expression has value “positive” or “negative” and describes whether or not the sequence was reported to drive gene expression in the reporter gene assay. RCs with positive expression have their expression patterns annotated (see Expression Pattern Annotations).
Note that RCs recorded as “negative” for regulatory activity should be treated with caution by the user; as with any negative data, the failure to observe reporter gene activity could simply reflect a failure of the assay rather than a biological result. The sequence might still mediate gene regulation in a tissue not examined by the reporting researcher, require a promoter different than the one used in the reporter constuct in order to function, or be a silencer or other form of negative regulatory element not detectable in the assay.
If only a single sequence covering a given set of genomic coordinates is annotated, this sequence is considered a CRM. Where multiple nested sequences with identical activity are present, the shortest such sequence is designated as a CRM. In other words, we define CRM as the minimal-length reporter construct in a set of one or more nested reporter constructs that produce the same gene expression pattern (see Figure 1).
Note that on occasion it will appear as though several nested RCs have the same activity (i.e., are associated with identical expression terms) yet are all designated as CRMs. This situation arises when the constructs actually drive different gene expression patterns, but at a level that is not easily captured by the anatomy ontology used to annotate the expression (e.g., two different subsets of motor neuron, both annotated simply as “motor neuron”). These differences will usually be clarified in the free text notes accompanying the record.
When a CRM is part of a set of nested sequences, rather than a single tested sequence at a particular locus, we say that the CRM and associated RCs have undergone “minimization.” (see Figure 1)
Additional information recorded for Reporter Constructs is detailed below.
TFBSs in REDfly derive mainly from two sources of evidence: DNAse I footprinting experiments and electrophoretic mobility shift assays (EMSA, “gel shift”).
For footprinting experiments, when a binding factor purified from nuclear extract has been shown to be the derivative of a specific gene, footprints were attributed to the gene encoding that factor, otherwise the binding factor for nuclear extract footprints has been left as "unspecified." Where possible we followed the rule of precedence in attributing footprint data to a particular reference, unless members of the same research group reported refined coordinates in a subsequent publication. When two or more overlapping motifs for the same transcription factor were reported for a single footprinted region, they were merged and annotated as one footprint. References that used non-D. melanogaster proteins or non-D. melanogaster target DNA have been excluded, since these experiments do not represent biological meaningful regulatory interactions in vivo. The majority of footprinted sites were assembled initially from the FlyReg database.
Whereas DNAse I footprinting provides an exact sequence for the binding site, TFBSs obtained from EMSA experiments formally can be said only to bind somewhere within the sequence of the probe used in the assay (typically 20-50 bp in length). In most cases, the authors have provided a presumed binding sequence within the probe, and we have used this to represent the binding site.
Yeast one-hybrid (Y1H) data are derived from high-throughput Y1H studies such as those described by Hens et al. (2011) Nature Methods, 8(12), 1065–1070. Unlike footprinting assays, which provide a defined binding site, and EMSAs, in which the binding sequence is often inferred by authors, YIH data, if derived from large bait sequences, can be of much lower sequence resolution. To prevent such sequences from showing up in TFBS search results, restrict the “evidence types” to exclude Y1H data, or use the “maximum size” advanced search option to restrict results to short sequences.
Sequences suspected to be CRMs based on regions of overlap between reporter constructs with similar activity, but not experimentally demonstrated to be so, are designated as “inferred CRMs.” Note that unlike Reporter Constructs, inferred CRMs have no empirical evidence supporting their functionality. (see Figure 1) At present, sets of overlapping RCs that include an RC with “negative” expression activity are excluded from determination of iCRMs.
CRMs predicted based on computational or genomic assays, but not tested experimentally for regulatory function, are “predicted CRMs (pCRMs).” At present, pCRMs falling within the defined search locus are displayed in the Predicted CRM tab, but are not available for download and cannot be searched directly. More extensive pCRM search, display, and download capabilities will be included in future releases. In the meantime, you can contact us for a full list of REDfly pCRMs.