Database updated on 26-Jan-2017 with 423 changes

REDfly version 5.2.2 released!
REDfly now has the latest FlyBase anatomy ontology.
Regulatory Element Database for Drosophila v5.2.2

Table Of Contents

Data Types

Reporter Constructs

TFBSs

Inferred CRMs

Element Names

Sequence Coordinates

Expression Pattern Annotations

User Interface

Search Pane

Basic Search

Advanced Search

RC/CRM options

TFBS-specific options

Position

Location Search

Maximum Size

Restrict Evidence To

Last Updated After/Date Added

Searching Expression Patterns

Search Results Pane

Detailed View Window

Basic Info

Location

Images

Citation/Evidence

TFBS/RC

Sequence

Expression

Notes

Note on Annotations Used by the BDGP

Versioning and updated records

Downloads

FASTA

CSV

GFFv3

GBrowse annotation format

BED

Database Schema

Data Types

REDfly seeks to include all experimentally verified fly cis-regulatory modules (CRMs) and transcription factor binding sites (TFBSs), along with their DNA sequence, their associated genes, and the expression patterns they direct. At this time, REDfly only contains CRMs and TFBSs from D. melanogaster, despite the growing number of functionally tested sequences from other fly species. Currently we focus on sequences that have been tested by reporter gene assays in transgenic animals or cultured cells, and on binding sites discovered by DNaseI footprinting and electrophoretic mobility shift (gel shift) assays. These sequences are stored in REDfly as two main data classes: Reporter Constructs and TFBSs.

Reporter Constructs

Any sequence tested by reporter gene assay is included in REDfly as a “Reporter Construct” (RC) and has three associated attributes: (1) expression; (2) CRM; and (3) minimization.

(1) Expression has value “positive” or “negative” and describes whether or not the sequence was reported to drive gene expression in the reporter gene assay. RCs with positive expression have their expression patterns annotated (see Expression Pattern Annotations).

Note that RCs recorded as “negative” for regulatory activity should be treated with caution by the user; as with any negative data, the failure to observe reporter gene activity could simply reflect a failure of the assay rather than a biological result. The sequence might still mediate gene regulation in a tissue not examined by the reporting researcher, require a promoter different than the one used in the reporter constuct in order to function, or be a silencer or other form of negative regulatory element not detectable in the assay.

 (2) If only a single sequence covering a given set of genomic coordinates is annotated, this sequence is considered a CRM. Where multiple nested sequences with identical activity are present, the shortest such sequence is designated as a CRM. In other words, we define CRM as the minimal-length reporter construct in a set of one or more nested reporter constructs that produce the same gene expression pattern (see Figure 1).

Note that on occasion it will appear as though several nested RCs have the same activity (i.e., are associated with identical expression terms) yet are all designated as CRMs. This situation arises when the constructs actually drive different gene expression patterns, but at a level that is not easily captured by the anatomy ontology used to annotate the expression (e.g., two different subsets of motor neuron, both annotated simply as “motor neuron”). These differences will usually be clarified in the free text notes accompanying the record.

(3) When a CRM is part of a set of nested sequences, rather than a single tested sequence at a particular locus, we say that the CRM and associated RCs have undergone “minimization.” (see Figure 1)

Additional information recorded for Reporter Constructs is detailed below.

TFBSs

TFBSs in REDfly derive mainly from two sources of evidence: DNAse I footprinting experiments and electrophoretic mobility shift assays (EMSA, “gel shift”).

For footprinting experiments, when a binding factor purified from nuclear extract has been shown to be the derivative of a specific gene, footprints were attributed to the gene encoding that factor, otherwise the binding factor for nuclear extract footprints has been left as "unspecified." Where possible we followed the rule of precedence in attributing footprint data to a particular reference, unless members of the same research group reported refined coordinates in a subsequent publication. When two or more overlapping motifs for the same transcription factor were reported for a single footprinted region, they were merged and annotated as one footprint. References that used non-D. melanogaster proteins or non-D. melanogaster target DNA have been excluded, since these experiments do not represent biological meaningful regulatory interactions in vivo. The majority of footprinted sites were assembled initially from the FlyReg database.

Whereas DNAse I footprinting provides an exact sequence for the binding site, TFBSs obtained from EMSA experiments formally can be said only to bind somewhere within the sequence of the probe used in the assay (typically 20-50 bp in length). In most cases, the authors have provided a presumed binding sequence within the probe, and we have used this to represent the binding site.

Yeast one-hybrid (Y1H) data are derived from high-throughput Y1H studies such as those described by Hens et al. (2011) Nature Methods, 8(12), 1065–1070. Unlike footprinting assays, which provide a defined binding site, and EMSAs, in which the binding sequence is often inferred by authors, YIH data, if derived from large bait sequences, can be of much lower sequence resolution. To prevent such sequences from showing up in TFBS search results, restrict the “evidence types” to exclude Y1H data, or use the “maximum size” Advanced Search option to restrict results to short sequences.

Back to contents

Inferred CRMs

Sequences suspected to be CRMs based on regions of overlap between reporter constructs with similar activity, but not experimentally demonstrated to be so, are designated as “inferred CRMs.” Note that unlike Reporter Constructs, inferred CRMs have no empirical evidence supporting their functionality. (see Figure 1) At present, sets of overlapping RCs that include an RC with “negative” expression activity are excluded from determination of iCRMs.

Element Names

All Reporter Constructs are named beginning with their associated gene symbol followed by an underscore (e.g., eve_). Where RCs are described with specific names in the literature, we have attempted to maintain those names. In those cases where no specific names were given, we have assigned names based either on spatial activity or on position with respect to the gene. TFBSs are named with the convention [name_of_TF]_[name_of_regulated_gene]:REDflyID. In some instances, the name of the associated gene for a CRM or the transcription factor for a TFBS is not known. In these cases, the gene name is given as “unspecified.”

For your convenience, we provide a table of identifier mappings that cross-references REDfly identifiers with their corresponding FlyBase, ORegAnno and FlyReg identifiers.

Sequence Coordinates

Sequence coordinates default to the most current release, Release 6 (dm6, Aug. 2014). Coordinates from release 5 are available through the dropdown at the top of the basic info” tab and as options for download. Although we maintain the R3 and R4 coordinates for older records, they will no longer be available for download. If you require R3 or R4 coordinates please contact us.

Sequence coordinates are represented in REDfly as one-based start, one-based end (for a discussion of genome coordinate representations, see http://genome.ucsc.edu/FAQ/FAQtracks.html - tracks1 and http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms).

Expression Pattern Annotations

Expression patterns have been annotated using terms from the Drosophila gross anatomy ontology. Both the annotation and the ontology itself are works in progress, so care should be taken when making use of these data. Because expression patterns as described in the literature are not reported using the ontology terms, and are given in varying levels of detail, providing an exact description in the database is not always straightforward. We have provided these descriptions as a way to facilitate searching and grouping the included CRMs. However, we strongly encourage users to consult the notes included with our annotations as well as the original references for more detailed descriptions of expression patterns. In particular, note that the anatomy ontology does not always provide a means to distinguish sub tissue- or organ-level cell populations. Thus, two entries annotated as "wing disc" may in fact refer to non-overlapping cell types within the disc, two entries annotated as "ventral nerve cord" may refer to separate neuronal lineages, etc. Expression patterns are reported based on the textual descriptions given by the authors; we did not attempt to refine these descriptions based on our own analysis of published photographs. See also Searching expression patterns.

In general, TFBS records are not annotated with expression pattern information. However, if a TFBS is contained within an annotated RC/CRM, the TFBS will inherit expression pattern information from the related RC/CRM record. Note that in most cases, it has not been demonstrated that this particular TFBS plays a functional role in mediating any or all of the tissue-specific expression ascribed to the CRM.

Back to contents

User Interface

REDfly uses a Model View Controller (MVC) architecture.  This paradigm keeps the domain logic and the user interface separate, isolating the effect of future design changes and facilitating the sharing of REDfly data with collaborators.  The user interface uses the ExtJS JavaScript application framework while all domain logic is provided by a RESTful API. This API is available to collaborators and allows for direct access to REDfly data for inclusion or ingestion by other tools.  API documentation is provided at redfly.ccr.buffalo.edu/api/explorer.php.

Search Pane

Basic Search

Basic search (see Figure 2A) allows for searching by gene name, FlyBase ID (FBgn), FlyBase FBtp number, element name, PubMed ID, or recent updates; the latter will return all records entered on the most recent date of data entry/update. Options to “browse all” records and to download all Reporter Constructs, all CRMs, or all TFBSs are also available.

By default, searches will not return Reporter Construct/CRM records discovered exclusively through cell-culture based assays (this is to prevent the results from being dominated by RCs proven to function in only a single cell type). To include these RCs/CRMs, uncheck the “Exclude Cell Line Only” box to the right of the “search” button.

Gene names should be searched as official FlyBase gene symbols (e.g., dpp, h, betaTub60D) or FlyBase IDs (e.g., FBgn0000490, FBgn0003888). Greek letters have been written out (e.g., alpha, delta). At present only valid primary gene symbols are accepted. If the name of a gene does not appear in the drop-down, it is likely that a synonym rather than the primary name is being used. In such cases, retrieving the proper name from FlyBase and searching again should resolve the problem. The gene name “unspecified” has been included to allow for searching for RCs or TFBSs where the associated gene or transcription factor, respectively, is not known.

If “by name” is selected, the search will retrieve only those records explicitly annotated with the gene name being searched. The default behavior is “by locus,” which will retrieve all records with the current gene name, but also all other Reporter Constructs/TFBSs within the defined region (default 10,000 bp upstream and downstream of the named gene). This allows for retrieval of Reporter Constructs annotated as “unspecified” and those lying near a gene of interest but annotated as being associated with a different gene. The size of the genomic region to be searched can be modified using the “search range interval” box under Advanced Search. (Similar behavior can be achieved by using coordinates to specify a genomic region to search using the Location Search features in the Advanced Search area.)

A wild-card is automatically appended to the end of the search string for all element name searches (e.g., searching for “eve_stripe” will return “eve_stripe1”, “eve_stripe2”, “eve_stripe3+7”, etc.).

Advanced Search

The Advanced Search pane (see Figure 2B) is divided into two tabs, one for Reporter Construct/CRM options and one for TFBS options.

RC/CRM options

These include searching for all records, for CRM records only, for CRMs with associated TFBS data only, or for “inferred CRMs. These can be further filtered for positive vs. negative expression and for whether or not an element has undergone minimization.

TFBS-specific options

These allow for searching all TFBSs or only those with associated CRM data. Gene names can be used to search all TFBS records or only those where the named gene is either the target or encodes the transcription factor, respectively.

Details on search options are as follows:

Position

Position search will select any RCs/CRMs or TFBSs located in the specified position relative to their target gene. Options are 5’ to the gene, 3’ to the gene, within an intron, or within an exon. Options are non-exclusive, i.e., a RC that begins 5’ to the gene and extends through the first intron will be found by a search for any of 5’, intron, or exon.

To be considered as overlapping a genomic feature, a regulatory element must extend greater than five bp into that feature. Thus, a CRM in the proximal promoter region that begins 500 bp 5’ to the transcription start of its gene and extends two bp into the first exon is considered to be exclusively 5’ to the gene and will not be returned on a search for elements within exons.

Positional information is reported in the detailed view windows in the Location tab.

Location Search

Location search will select any RCs/CRMs or TFBSs lying within the specified sequence range, using release 6/dm6 coordinates. Coordinates from older releases can be converted through FlyBase’s “Coordinates Converter” tool.

Search Range Interval

Search Range Interval sets the size of the genomic region to be searched when the Basic Search “by locus” option is selected. Default is 10,000 bp.

Maximum Size

Selecting a maximum size will exclude any RCs/CRMs or TFBSs whose length is greater than the specified value, in basepairs.

Restrict Evidence To

This field allows the user to restrict a search to sequences supported by only certain types of evidence, e.g., TFBSs supported by DNAse I footprinting only.

Back to Contents

Last Updated After/Date Added

Placing a value in these fields will restrict the search results to those records that have been added or updated on or after the chosen date, respectively. Use the Last Updated feature to check for additions and corrections since your last search.

Searching Expression Patterns

Two methods of expression pattern searching have been implemented in REDfly. Searching using the Ontology/Expression Term search field will select records containing the specified term or any of the descendant terms in the ontology hierarchy; checking the “exact term” box will restrict the search to only that term. The Ontology search function therefore provides a way to identify RCs/CRMs that potentially drive similar spatial patterns of expression despite that expression having been described at different levels of detail in the literature. Ontology searching can be conducted either by entering an ontology term in the search box, or by selecting a term from the pop-up Ontology Browser. The search box incorporates a search widget from the National Center for Biomedical Ontology that converts synonyms to preferred terms. For example, typing “wing disc” will automatically bring up “dorsal mesothoracic disc” as the top option in the drop-down.

See also Expression Pattern Annotations.

For example, searching for "mesoderm" using the Expression Term search will return annotations such as

FBbt:00000126, mesoderm

FBbt:00000128, trunk mesoderm

FBbt:00000130, visceral mesoderm

Using "exact match," only

FBbt:00000126, mesoderm

would be returned.

A search for "mesoderm" using the Ontology Search would return records with the same terms as above, but also with terms such as

FBbt:00000126, mesoderm

FBbt:00000128, trunk mesoderm

FBbt:00000130, visceral mesoderm

FBbt:00005073, somatic muscle

FBbt:00000466, oblique muscle

FBbt:00005247, hemocyte primordium

FBbt:00001666, cardioblast

and so forth. With “exact term” checked, only records explicitly annotated as FBbt:00000126, mesoderm will be returned.

In practice, Exact Term searches will often be too restrictive, and the full Ontology search too permissive. In the future, we hope to improve the Ontology search function to allow greater control over the depth of the search. Users may find it helpful to examine the FlyBase Gene Expression section for aid in navigating the Ontology.

Search Results Pane

The Search Results pane (see Figure 2C) provides a summary view of all records of CRMs/RCs, TFBSs, and Inferred CRMs returned by a search. Each class is returned in a separate tab; numbers in the tab header indicate how many records were returned for each data type. Clicking in the header row will sort results by the selected column. Users can choose to download one or more records directly from this pane in a variety of formats using the “download” button at the bottom (see Downloads). Clicking on a row will open a detailed view window for the record. Alternatively, multiple records can be selected using the check boxes along the left-hand side of the pane and then clicking on the “view selected” button at the bottom. Multiple detailed view windows open in a stack; the “tile windows” button will tile these in the browser window. The “window tab selector” brings the selected tab (see below) to the foreground in each open detailed view window.

Back to contents

Detailed View Window

Results for each record are presented in a detailed view window composed of multiple tabs displaying different sections of the information for each entry.

Basic Info

The Basic Info tab (see Figure 2D) contains the genomic coordinates of the feature based on the current sequence release. Coordinates for older releases can be obtained using the “previous coordinates” button. For RCs, the RC attributes—has_expression, is_CRM, is_minimized—are listed. Other information contained in the Basic Info tab includes the the species (currently only D. melanogaster); the name of the associated gene(s) with links to FlyBase and FlyMine, and, for TFBSs, FlyTF records; and links to the FlyBase Gbrowse and UCSC genome browsers. Note that because TFBSs and CRMs are not strand-specific sequence features, no strand information is reflected in the graphical views. When accessing the Flybase Gbrowse genome browser we have occasionally experienced a timeout error and are working to diagnose the cause. The REDfly ID of the record and date of the last update are also provided.

Location

The Location tab (see Figure 2E) provides a snapshot of the genomic region taken from FlyBase Gbrowse and displays genes, transcripts, and CRMs. TFBSs, inferred CRMs, or new CRM annotations not yet in FlyBase are not currently displayed. The position of the feature relative to transcripts of the associated gene is provided above the graphic.

Images

The Images tab (RCs/CRMs only; see Figure 2F) shows the expression pattern of the reporter gene. A subset of these images are provided courtesy of FlyExpress and clicking on these will bring the user to the FlyExpress website, from which a search can be initiated for other genes with a similar expression pattern. Images are currently available for only a subset of REDfly records. In many cases, if no image is available the figure number showing the RC in the published report is provided.

Citation/Evidence

The Citation/Evidence tab displays the reference and PubMed ID and links to the PubMed record for the current annotation. The name of the REDfly curator responsible for annotating this feature is also provided. This tab also provides the sequence source terms and the evidence for the feature.

Sequence Source Terms: Many older references do not provide exact sequence referents (e.g., genome coordinates, PCR primer sequences, GenBank IDs). Most often, sequence ranges are given as restriction maps. Because sequence polymorphisms between the clones used by researchers and the published genome sequence can lead to gain or loss of restriction sites and thus affect our determination of the reported sequence, we differentiate between those sequences unambiguously provided in the reference or through communication with the authors and those inferred from restriction maps. In those places where we were unable to locate a referenced restriction site or where sizes of the restriction fragments were not well matched with the reported sizes, we list the sequence end as "estimated/uncertain." In time, we hope to reconcile all ambiguities through communication with the authors.

Sequences reported as "inferred from restriction map" use as endpoints the first nucleotide of the restriction site for both the 5' and 3' ends of the sequence. Depending on the actual cut site of the enzyme, therefore, and modification and/or sites used for subcloning, the exact CRM sequence tested by the authors may differ from the reported site by several basepairs.

Orientation of CRMs is given as matching the orientation of the transcription unit, i.e. "5' end estimated" refers to the 5' end of the CRM when oriented in the same 5' to 3' direction as the gene.

TFBS sequences initially from the FlyReg database do not contain sequence source terms.

TFBS/RC

All RC and CRM records are linked to the REDfly annotations of any TFBSs that fall within them. These are listed in the TFBS tab (for RC/CRM records; see Figure 2G); clicking within a row will open a window with detailed results for that record. Similarly, if a TFBS falls within a known RC/CRM, the name of the RC/CRM and a link to its REDfly record is provided in the RC tab. Searches of REDfly can be restricted to just those TFBSs that map to known CRMs, and vice-versa, using the options in the Advanced Search pane.

Sequence

The Sequence tab (see Figure 2H) displays the size (in basepairs) and sequence of the current feature. For TFBSs, the "sequence with flank" is also provided. This includes the TFBS sequence in capital letters, with approximately 20 bp of additional sequence extending on each end. This extended sequence allows for the usually short TFBSs to be mapped unambiguously to the genome.

Expression

See also Expression pattern annotations and Searching expression patterns.

The Expression tab (see Figure 2I) lists the expression terms associated with each record, using the anatomy ontology as described above. Although TFBSs do not of themselves have expression patterns, where a TFBS maps in a RC/CRM, it inherits the expression pattern information from that RC/CRM. Clicking on a column header will sort by that column. Clicking on a term will initiate a REDfly search in a new browser window for all records containing the specified term.

Both FlyBase and the BDGP in situ database use the anatomy ontology for reporting gene expression patterns. We have therefore provided links from each expression pattern in REDfly to each of these databases. Following these links will generate a list of genes annotated as having the selected expression pattern. As mappings between the anatomy ontologies of different organisms are developed, we hope to create links to similarly expressed genes in these organisms as well.

Notes

The Notes tab contains free-text notes that elaborate on the basic annotation of the feature. In particular, the notes can indicate details of expression patterns that cannot be adequately captured by the anatomy ontology.

Back to contents

Note on Annotations Used by the BDGP

A number of terms used in the annotation of the BDGP in situ database have not yet been brought into line with the most current fly anatomy ontology terms and contain term ID's that are associated with different terms in the anatomy ontology. REDfly conforms to the anatomy ontology throughout, which could potentially lead to confusion when comparing entries in the two databases. These conflicting terms should be resolved once the BDGP terms are updated.

Versioning and updated records

Prior to v5.0, REDfly incremented its version number only with the release of major new features—version numbers did not increase with addition or update of new records. To properly reflect the content of REDfly at any given time, please use the date of access as well as the version number.

Starting with v5.0, REDfly moved from a continuous-release to a versioned-release cycle. From v5.0 forward, new records are only added/updated in conjunction with a numbered release.

REDfly tracks updates/corrections to individual records. The date of the last update to a record can be found in the “basic info” pane of the detailed results window and the number of times a record has been updated is recorded by the value of the third segment of the REDfly ID (e.g., RFRC:00000272.001 indicates the first entry of the record, RFRC:00000272.002 the second, etc.). To obtain data for pervious versions of a specific record, please contact us.

Downloads

The "Download" button will download the checked CRM records in one of a variety of formats. At present, REDfly supports the following options:

FASTA

Sequences in multi-FASTA format. The FASTA header contains the following data:

>CRM_name|species|gene|FlyBase_ID|chromosome

CSV

Comma-separated list, one line per record. Fields are: "name", "species name", "gene_name", "flybase_id", "chromosome", "sequence"

GFFv3

Data in GFF version 3 format. The "attributes" field holds the CRM name ("ID="); database identifiers ("dbxref=") for FlyBase, PubMed (PMID), and REDfly; and the expression terms("Ontology_term="). Note that because TFBSs and CRMs are not strand-specific sequence features, no strand information is specified in the GFF file.

GBrowse annotation format

The format used by GBrowse to load local custom annotations. Newer versions of GBrowse (GBrowse2) use a modified version of this format. Downloads in this format will be included in the future. However, note that most GBrowse implementations can also accept custom annotations in GFFv3 format.

BED

Data in BED format. File type “BED simple” downloads a four-column BED file (chrom, start, end, name) with no headers, suitable for direct analysis or for use with a genome browser. File type “BED browser” produces a eight-column BED file with additional header information to enable richer functionality when used with the UCSC Genome Browser. Default track name of “CRMs” or “TFBS” and default track description of “CRMs (or TFBSs) selected from REDfly” is specified.

Database Schema

The REDfly database is built using MySQL. A diagram of the schema can be found here.

Back to Contents


Figure 1

Fig1web.png

Reporter Constructs and their attributes in REDfly. The figure illustrates a hypothetical locus for which seven different reporter constructs (A-G) have been tested in vivo. Construct A is a 1 kb sequence fragment located roughly 2 kb upstream of the transcription start. Because it is an isolated construct, it is considered to be a CRM that has not been subject to minimization. If this construct showed reporter gene activity, it would be designated as “expression positive”; otherwise it would be labeled “expression negative.” Constructs B-G are part of an overlapping and partially nested series of sequences spanning 750 bp of DNA 7.25 kb upstream of the transcription start. In this example, each drives the identical pattern of reporter gene expression. Because each of these constructs overlaps at least one other, we consider this region and the six constructs to have undergone minimization. Constructs C and E are each the shortest of a respective set of nested sequences and are therefore considered to be CRMs (marked in red). The remaining constructs are designated as RCs (black). A 94 bp sequence marks the minimal region of overlap among all of the constructs and is thus registered in REDfly as an inferred CRM (iCRM, blue). If more than one iCRM is calculated with the same coordinates, but different expression terms, these will be merged into a single iCRM that includes the union of the expression terms.

Figure 2

Fig2web.png

The REDfly user interface. See text for details. Search options (A, B), results overview (C), and detailed results (D-I) are all displayed within a single web browser window. (D-I) Detailed results are displayed as individual floating windows that can be stacked or tiled as desired; on a large monitor, a dozen or more individual records can be fully tiled for simultaneous viewing.



Funded in part by the NSF and the NIGMS