|Regulatory Element Database for Drosophila v3.2|
|Home||Search||Help||Resources/Links||News||About REDfly||Contact Us|
REDfly User Guide
REDfly seeks to include all experimentally verified fly cis-regulatory modules (CRMs) and transcription factor binding sites (TFBSs), along with their DNA sequence, their associated genes, and the expression patterns they direct. At this time, REDfly only contains CRMs and TFBSs from D. melanogaster, despite the growing number of functionally tested sequences from other fly species. Currently we focus on sequences that have been tested by reporter gene assays in transgenic animals, and on binding sites discovered by DNaseI footprinting and electrophoretic mobility shift (gel shift) assays. These sequences are stored in REDfly as two main data classes: Reporter Constructs and TFBSs.
Any sequence tested by reporter gene assay is included in REDfly as a “Reporter Construct” (RC) and has three associated attributes: (1) expression; (2) CRM; and (3) minimization.
(1) Expression has value “positive” or “negative” and describes whether or not the sequence was reported to drive gene expression in the reporter gene assay. RCs with positive expression have their expression patterns annotated (see Expression Pattern Annotations).
Note that RCs recorded as “negative” for regulatory activity should be treated with caution by the user; as with any negative data, the failure to observe reporter gene activity could simply reflect a failure of the assay rather than a biological result. The sequence might still mediate gene regulation in a tissue not examined by the reporting researcher, require a promoter different than the one used in the reporter constuct in order to function, or be a silencer or other form of negative regulatory element not detectable in the assay.
(2) If only a single sequence covering a given set of genomic coordinates is annotated, this sequence is considered a CRM. Where multiple nested sequences with identical activity are present, the shortest such sequence is designated as a CRM. In other words, we define CRM as the minimal-length reporter construct in a set of one or more nested reporter constructs that produce the same gene expression pattern (see Figure 1).
Note that on occasion it will appear as though several nested RCs have the same activity (i.e., are associated with identical expression terms) yet are all designated as CRMs. This situation arises when the constructs actually drive different gene expression patterns, but at a level that is not easily captured by the anatomy ontology used to annotate the expression (e.g., two different subsets of motor neuron, both annotated simply as “motor neuron”). These differences will usually be clarified in the free text notes accompanying the record.
(3) When a CRM is part of a set of nested sequences, rather than a single tested sequence at a particular locus, we say that the CRM and associated RCs have undergone “minimization.” (see Figure 1)
Additional information recorded for Reporter Constructs is detailed below.
TFBSs in REDfly derive mainly from two sources of evidence: DNAse I footprinting experiments and electrophoretic mobility shift assays (EMSA, “gel shift”). Data are also included from high-throughput yeast one-hybrid and “MITOMI analysis of regulatory elements (MARE).”
For footprinting experiments, when a binding factor purified from nuclear extract has been shown to be the derivative of a specific gene, footprints were attributed to the gene encoding that factor, otherwise the binding factor for nuclear extract footprints has been left as "unspecified." Where possible we followed the rule of precedence in attributing footprint data to a particular reference, unless members of the same research group reported refined coordinates in a subsequent publication. When two or more overlapping motifs for the same transcription factor were reported for a single footprinted region, they were merged and annotated as one footprint. References that used non-D. melanogaster proteins or non-D. melanogaster target DNA have been excluded, since these experiments do not represent biological meaningful regulatory interactions in vivo. The majority of footprinted sites were assembled initially from the FlyReg database.
Whereas DNAse I footprinting provides an exact sequence for the binding site, TFBSs obtained from EMSA experiments formally can be said only to bind somewhere within the sequence of the probe used in the assay (typically 20-50 bp in length). In most cases, the authors have provided a presumed binding sequence within the probe, and we have used this to represent the binding site.
Yeast one-hybrid (Y1H) data are derived from high-throughput Y1H studies such as those described by Hens et al. (2011) Nature Methods, 8(12), 1065–1070. Unlike footprinting assays, which provide a defined binding site, and EMSAs, in which the binding sequence is often inferred by authors, YIH data, if derived from large bait sequences, can be of much lower sequence resolution. To prevent such sequences from showing up in TFBS search results, restrict the “evidence types” to exclude Y1H data, or use the “maximum size” Advanced Search option to restrict results to short sequences.
Sequences suspected to be CRMs based on regions of overlap between reporter constructs with similar activity, but not experimentally demonstrated to be so, are designated as “inferred CRMs.” Note that unlike Reporter Constructs, inferred CRMs have no empirical evidence supporting their functionality. (see Figure 1) At present, RCs with “negative” expression activity are excluded from determination of iCRMs.
All Reporter Constructs are named beginning with their associated gene symbol followed by an underscore (e.g., eve_). Where RCs are described with specific names in the literature, we have attempted to maintain those names. In those cases where no specific names were given, we have assigned names based either on spatial activity or on position with respect to the gene. TFBSs are named with the convention [name_of_TF]_[name_of_regulated_gene]:REDflyID. In some instances, the name of the associated gene for a CRM or the transcription factor for a TFBS is not known. In these cases, the gene name is given as “unspecified.”
For your convenience, we provide a table of identifier mappings that cross-references REDfly identifiers with their corresponding FlyBase, ORegAnno and FlyReg identifiers.
Sequence coordinates default to the most current release, version 5 (dm3). Coordinates from earlier releases are available through the dropdown at the top of the "basic info" tab and as options for download.
Sequence coordinates are represented in REDfly as one-based start, one-based end (for a discussion of genome coordinate representations, see http://genome.ucsc.edu/FAQ/FAQtracks.html - tracks1 and http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms).
Expression patterns have been annotated using terms from the Drosophila gross anatomy ontology. Both the annotation and the ontology itself are works in progress, so care should be taken when making use of these data. Because expression patterns as described in the literature are not reported using the ontology terms, and are given in varying levels of detail, providing an exact description in the database is not always straightforward. We have provided these descriptions as a way to facilitate searching and grouping the included CRMs. However, we strongly encourage users to consult the notes included with our annotations as well as the original references for more detailed descriptions of expression patterns. In particular, note that the anatomy ontology does not always provide a means to distinguish sub tissue- or organ-level cell populations. Thus, two entries annotated as "wing disc" may in fact refer to non-overlapping cell types within the disc, two entries annotated as "ventral nerve cord" may refer to separate neuronal lineages, etc. Expression patterns are reported based on the textual descriptions given by the authors; we did not attempt to refine these descriptions based on our own analysis of published photographs. See also Searching expression patterns.
In general, TFBS records are not annotated with expression pattern information. However, if a TFBS is contained within an annotated RC/CRM, the TFBS will inherit expression pattern information from the related RC/CRM record. Note that in most cases, it has not been demonstrated that this particular TFBS plays a functional role in mediating any or all of the tissue-specific expression ascribed to the CRM.
Basic search (see Figure 2A) allows for searching by gene name, FlyBase ID (FBgn), FlyBase FBtp number, element name, PubMed ID, or recent updates; the latter will return all records entered on the most recent date of data entry/update. Options to “browse all” records and to download all Reporter Constructs, all CRMs, or all TFBSs are also available.
Gene names should be searched as official FlyBase gene symbols (e.g., dpp, h, betaTub60D) or FlyBase IDs (e.g., FBgn0000490, FBgn0003888). Greek letters have been written out (e.g., alpha, delta). At present only valid primary gene symbols are accepted. If the name of a gene does not appear in the drop-down, it is likely that a synonym rather than the primary name is being used. In such cases, retrieving the proper name from FlyBase and searching again should resolve the problem. The gene name “unspecified” has been included to allow for searching for RCs or TFBSs where the associated gene or transcription factor, respectively, is not known.
A wild-card is automatically appended to the end of the search string for all element name and FBtp number searches (e.g., searching for “eve_stripe” will return “eve_stripe1”, “eve_stripe2”, “eve_stripe3+7”, etc.).
The “advanced search” pane (see Figure 2B) is divided into two tabs, one for Reporter Construct/CRM options and one for TFBS options.
RC/CRM options include searching for all records, for CRM records only, or for CRMs with associated TFBS data only. These can be further filtered for positive vs. negative expression [Negative Expression] and for whether or not an element has undergone minimization [Minimization].
TFBS-specific options allow for searching all TFBSs or only those with associated CRM data. Gene names can be used to search all TFBS records or only those where the named gene is either the target or encodes the transcription factor, respectively.
Details on search options are as follows:
Position search will select any RCs/CRMs or TFBSs located in the specified position relative to their target gene. Options are 5’ to the gene, 3’ to the gene, within an intron, or within an exon. Options are non-exclusive, i.e., a RC that begins 5’ to the gene and extends through the first intron will be found by a search for any of 5’, intron, or exon.
To be considered as overlapping a genomic feature, a regulatory element must extend greater than five bp into that feature. Thus, a CRM in the proximal promoter region that begins 500 bp 5’ to the transcription start of its gene and extends two bp into the first exon is considered to be exclusively 5’ to the gene and will not be returned on a search for elements within exons.
Selecting a maximum size will exclude any RCs/CRMs or TFBSs whose length is greater than the specified value, in basepairs.
This field allows the user to restrict a search to sequences supported by only certain types of evidence, e.g., TFBSs supported by DNAse I footprinting only.
Location search will select any RCs/CRMs or TFBSs lying within the specified sequence range, using release 5/dm3 coordinates. Coordinates from older releases can be converted through FlyBase’s “Coordinates Converter” tool.
Placing a value in these fields will restrict the search results to those records that have been added or updated on or after the chosen date, respectively. Use the Last Updated feature to check for additions and corrections since your last search.
Two methods of expression pattern searching have been implemented in REDfly. Searching using the Ontology/Expression Term search field will select records containing the specified term or any of the descendant terms in the ontology hierarchy; checking the “exact term” box will restrict the search to only that term. The Ontology search function therefore provides a way to identify RCs/CRMs that potentially drive similar spatial patterns of expression despite that expression having been described at different levels of detail in the literature. Ontology searching can be conducted either by entering an ontology term in the search box, or by selecting a term from the pop-up Ontology Browser. The search box incorporates a search widget from the National Center for Biomedical Ontology [link] that converts synonyms to preferred terms. For example, typing “wing disc” will automatically bring up “dorsal mesothoracic disc” as the top option in the drop-down.
See also Expression Pattern Annotations.
For example, searching for "mesoderm" using the Expression Term search will return annotations such as
FBbt:00000128, trunk mesoderm
FBbt:00000130, visceral mesoderm
Using "exact match," only
would be returned.
A search for "mesoderm" using the Ontology Search would return records with the same terms as above, but also with terms such as
FBbt:00000128, trunk mesoderm
FBbt:00000130, visceral mesoderm
FBbt:00005073, somatic muscle
FBbt:00000466, oblique muscle
FBbt:00005247, hemocyte primordium
and so forth. With “exact term” checked, only records explicitly annotated as FBbt:00000126, mesoderm will be returned.
In practice, Exact Term searches will often be too restrictive, and the full Ontology search too permissive. In the future, we hope to improve the Ontology search function to allow greater control over the depth of the search. Users may find it helpful to examine the FlyBase Gene Expression section for aid in navigating the Ontology.
The Search Results pane (see Figure 2C) provides a summary view of all records of CRMs/RCs, TFBSs, and Inferred CRMs returned by a search. Each class is returned in a separate tab; numbers in the tab header indicate how many records were returned for each data type. Clicking in the header row will sort results by the selected column. Users can choose to download one or more records directly from this pane in a variety of formats using the “download” button at the bottom (see Downloads). Clicking on a row will open a detailed view window for the record. Alternatively, multiple records can be selected using the check boxes along the left-hand side of the pane and then clicking on the “view selected” button at the bottom. Multiple detailed view windows open in a stack; the “tile windows” button will tile these in the browser window. The “window tab selector” brings the selected tab (see below) to the foreground in each open detailed view window.
Results for each record are presented in a detailed view window composed of multiple tabs displaying different sections of the information for each entry.
The Basic Info tab (see Figure 2D) contains the genomic coordinates of the feature based on the current sequence release. Coordinates for older releases can be obtained using the “previous coordinates” button. For RCs, the RC attributes—has_expression, is_CRM, is_minimized—are listed. Other information contained in the Basic Info tab includes the species (currently only D. melanogaster); the name of the associated gene(s) with links to FlyBase and FlyMine, and, for TFBSs, FlyTF records; and links to the FlyBase, Gbrowse and UCSC genome browsers. Note that because TFBSs and CRMs are not strand-specific sequence features, no strand information is reflected in the graphical views. When accessing the Flybase Gbrowse genome browser we have occasionally experienced a timeout error and are working to diagnose the cause. The REDfly ID of the record and date of the last update are also provided.
The Location tab (see Figure 2E) provides a snapshot of the genomic region taken from the FlyMine Gbrowse implementation and displays genes, transcripts, regulatory regions, and TFBSs. The current feature is highlighted in blue and is further marked by gray shading extending vertically throughout the image. In cases of inferred CRMs, or where the FlyMine annotation has not caught up with REDfly annotations, the feature will not be displayed, but the gray bar will indicate its proper position. Note that the feature tracks use data from FlyMine and may therefore contain additional features not present in the REDfly annotation. This is especially true with respect to TFBSs, which in the case of FlyMine includes predicted binding sites as well as empirically verified ones The position of the feature relative to transcripts of the associated gene is provided above the graphic. The coordinates of the feature can be found below the graphic.
The Images tab (RCs/CRMs only; see Figure 2F) shows the expression pattern of the reporter gene. These images are provided courtesy of FlyExpress and clicking on the image will bring the user to the FlyExpress website, from which a search can be initiated for other genes with a similar expression pattern. Images are currently available for only a subset of REDfly records.
The Citation/Evidence tab displays the reference and PubMed ID and links to the PubMed record for the current annotation. The name of the REDfly curator responsible for annotating this feature is also provided. This tab also provides the sequence source terms and the evidence for the feature.
Sequence Source Terms: Many older references do not provide exact sequence referents (e.g., genome coordinates, PCR primer sequences, GenBank IDs). Most often, sequence ranges are given as restriction maps. Because sequence polymorphisms between the clones used by researchers and the published genome sequence can lead to gain or loss of restriction sites and thus affect our determination of the reported sequence, we differentiate between those sequences unambiguously provided in the reference or through communication with the authors and those inferred from restriction maps. In those places where we were unable to locate a referenced restriction site or where sizes of the restriction fragments were not well matched with the reported sizes, we list the sequence end as "estimated/uncertain." In time, we hope to reconcile all ambiguities through communication with the authors.
Sequences reported as "inferred from restriction map" use as endpoints the first nucleotide of the restriction site for both the 5' and 3' ends of the sequence. Depending on the actual cut site of the enzyme, therefore, and modification and/or sites used for subcloning, the exact CRM sequence tested by the authors may differ from the reported site by several basepairs.
Orientation of CRMs is given as matching the orientation of the transcription unit, i.e. "5' end estimated" refers to the 5' end of the CRM when oriented in the same 5' to 3' direction as the gene.
TFBS sequences initially from the FlyReg database do not contain sequence source terms.
All RC and CRM records are linked to the REDfly annotations of any TFBSs that fall within them. These are listed in the TFBS tab (for RC/CRM records; see Figure 2G); clicking within a row will open a window with detailed results for that record. Similarly, if a TFBS falls within a known RC/CRM, the name of the RC/CRM and a link to its REDfly record is provided in the RC tab. Searches of REDfly can be restricted to just those TFBSs that map to known CRMs, and vice-versa, using the options in the Advanced Search pane.
The Sequence tab (see Figure 2H) displays the size (in basepairs) and sequence of the current feature. For TFBSs, the "sequence with flank" is also provided. This includes the TFBS sequence in capital letters, with approximately 20 bp of additional sequence extending on each end. This extended sequence allows for the usually short TFBSs to be mapped unambiguously to the genome.
The Expression tab (see Figure 2I) lists the expression terms associated with each record, using the anatomy ontology as described above. Although TFBSs do not of themselves have expression patterns, where a TFBS maps in a RC/CRM, it inherits the expression pattern information from that RC/CRM. Clicking on a column header will sort by that column. Clicking on a term will initiate a REDfly search in a new browser window for all records containing the specified term.
Both FlyBase and the BDGP in situ database use the anatomy ontology for reporting gene expression patterns. We have therefore provided links from each expression pattern in REDfly to each of these databases. Following these links will generate a list of genes annotated as having the selected expression pattern. As mappings between the anatomy ontologies of different organisms are developed, we hope to create links to similarly expressed genes in these organisms as well.
The Notes tab contains free-text notes that elaborate on the basic annotation of the feature. In particular, the notes can indicate details of expression patterns that cannot be adequately captured by the anatomy ontology. This tab is only available if there are notes associated with the record.
A number of terms used in the annotation of the BDGP in situ database have not yet been brought into line with the most current fly anatomy ontology terms and contain term ID's that are associated with different terms in the anatomy ontology. REDfly conforms to the anatomy ontology throughout, which could potentially lead to confusion when comparing entries in the two databases. These conflicting terms should be resolved once the BDGP terms are updated.
REDfly increments its version number only with the release of major new features—version numbers do not increase with addition or update of new records. To properly reflect the content of REDfly at any given time, please use the date of access as well as the version number.
REDfly tracks updates/corrections to individual records. The date of the last update to a record can be found in the “basic info” pane of the detailed results window and the number of times a record has been updated is recorded by the value of the third segment of the REDfly ID (e.g., RFRC:00000272.001 indicates the first entry of the record, RFRC:00000272.002 the second, etc.). To obtain data for pervious versions of a specific record, please contact us.
The "Download" button will download the checked CRM records in one of a variety of formats. At present, REDfly supports the following options:
Sequences in multi-FASTA format. The FASTA header contains the following data:
Comma-separated list, one line per record. Fields are: "name", "species name", "gene_name", "flybase_id", "chromosome", "sequence"
Data in GFF version 3 format. The "attributes" field holds the CRM name ("ID="); database identifiers ("dbxref=") for FlyBase, PubMed (PMID), and REDfly; and the expression terms("Ontology_term="). Note that because TFBSs and CRMs are not strand-specific sequence features, no strand information is specified in the GFF file.
The format used by Gbrowse to load local custom annotations. Newer versions of Gbrowse use a modified version of this format. Downloads in this format will be included in the future. However, note that most Gbrowse implementations can also accept custom annotations in GFFv3 format.
The REDfly database is built using MySQL. A diagram of the schema can be found here.
Reporter Constructs and their attributes in REDfly. The figure illustrates a hypothetical locus for which seven different reporter constructs (A-G) have been tested in vivo. Construct A is a 1 kb sequence fragment located roughly 2 kb upstream of the transcription start. Because it is an isolated construct, it is considered to be a CRM that has not been subject to minimization. If this construct showed reporter gene activity, it would be designated as “expression positive”; otherwise it would be labeled “expression negative.” Constructs B-G are part of an overlapping and partially nested series of sequences spanning 750 bp of DNA 7.25 kb upstream of the transcription start. In this example, each drives the identical pattern of reporter gene expression. Because each of these constructs overlaps at least one other, we consider this region and the six constructs to have undergone minimization. Constructs C and E are each the shortest of a respective set of nested sequences and are therefore considered to be CRMs (marked in red). The remaining constructs are designated as RCs (black). A 94 bp sequence marks the minimal region of overlap among all of the constructs and is thus registered in REDfly as an inferred CRM (iCRM, blue). If more than one iCRM is calculated with the same coordinates, but different expression terms, these will be merged into a single iCRM that includes the union of the expression terms.
The new REDfly user interface. See text for details. Search options (A, B), results overview (C), and detailed results (D-I) are all displayed within a single web browser window. (D-I) Detailed results are displayed as individual floating windows that can be stacked or tiled as desired; on a large monitor, a dozen or more individual records can be fully tiled for simultaneous viewing.
Back to contents