|
||||||||||
|
REDfly User Guide
REDfly seeks to include all experimentally verified fly cis-regulatory modules (CRMs) and transcription factor binding sites (TFBSs), along with their DNA sequence, their associated genes, and the expression patterns they direct. For the initial release we have focused on sequences that have been unambiguously demonstrated to be sufficient to regulate gene expression, primarily through reporter gene assays in transgenic animals, and on binding sites discovered by DNaseI footprinting assays. REDfly therefore does not currently contain certain sequences annotated in FlyBase as "regulatory_region" or "enhancer" such as individual transcription factor binding site motifs (e.g., "sca-enhancer-1" or "Dpt-reg_element-2") or predicted but not tested elements (e.g., "su(s)-su(s)-reg_element-1"). Also not currently included are CRMs inferred but not demonstrated to have specific activities based on deletion analysis-either from reporter gene assays or from genomic deletions-as well as silencer and boundary elements. At this time, REDfly only contains CRMs and TFBSs from D. melanogaster, despite the growing number of functionally tested sequences from other fly species. Future updates of REDfly will include many of these additional sequences along with a description of the evidence used to support their annotation. For the most part, CRMs are included directly as reported in the literature. Where multiple nested sequences with identical activity were reported, the shortest such sequence was selected. Sequences with identical activity that are distinct but substantially overlapping are mostly reported separately, although in some instances of minimal overlap, one or more sequences were omitted. At present, TFBSs include primarily DNase I (but not hydroxy-radical or copper nuclease) footprinting experiments that used protein obtained from nuclear extract (either crude or purified) or recombinant expression (either partial or full-length). When a binding factor purified from nuclear extract has been shown to be the derivative of a specific gene, footprints were attributed to the gene encoding that factor, otherwise the binding factor for nuclear extract footprints has been left as "unspecified." Where possible we followed the rule of precedence in attributing footprint data to a particular reference, unless members of the same research group reported refined coordinates in a subsequent publication. When two or more overlapping motifs for the same transcription factor were reported for a single footprinted region, they were merged and annotated as one footprint. Results from most electromobility shift assay (EMSA) experiments are not currently included in REDfly but will be included in the future. Also excluded are any references that used non-D. melanogaster proteins or non-D. melanogaster target DNA, since these experiments do not represent biological meaningful regulatory interactions in vivo. Back to contentsAll CRMs are named beginning with their associated gene symbol followed by an underscore (e.g., eve_). Where CRMs are described with specific names in the literature, we have attempted to maintain those names. In those cases where no specific names were given, we have assigned names based either on spatial activity or on position with respect to the gene. TFBSs are named with the convention [name_of_TF]_[name_of_regulated_gene]:REDflyID. For your convienence, we provide a table of identifier mappings that cross-references REDfly identifiers with their corresponding ORegAnno and FlyReg identifiers. Back to contentsSequence coordinates are given in the most current release, version 5, and it is these coordinates that are used for Detailed View page. Expression patterns have been annotated using terms from the Drosophila gross anatomy ontology v1.5. Both the annotation and the ontology itself are works in progress, so care should be taken when making use of these data. Because expression patterns as described in the literature are not reported using the ontology terms, and are given in varying levels of detail, providing an exact description in the database is not always straightforward. We have provided these descriptions as a way to facilitate searching and grouping the included CRMs. However, we strongly encourage users to consult the original references for more detailed descriptions of expression patterns. In particular, note that the anatomy ontology does not at this time always provide a means to distinguish sub tissue- or organ-level cell populations. Thus, two entries annotated as "wing disc" may in fact refer to non-overlapping cell types within the disc, two entries annotated as "ventral nerve cord" may refer to separate neuronal lineages, etc. Expression patterns are reported based on the textual descriptions given by the authors; we did not attempt to refine these descriptions based on our own analysis of published photographs. See also Searching expression patterns. In general, TFBS records are not annotated with expression pattern information. However, if a TFBS is contained within an annotated CRM, the TFBS will inherit expression pattern information from the related CRM record. Note that in most cases, it has not been demonstrated that this particular TFBS plays a functional role in mediating any or all of the tissue-specific expression ascribed to the CRM. Back to contentsWhile all search parameters can be used for CRMs, only relevant search parameters from the Search page are used for searcing TFBSs. Searches can be conducted for CRMs, for only those CRMs that have associated TFBS data, for TFBSs, or for only those TFBSs contained within an annotated CRM, using the appropriate checkboxes. The default state will return matches from both CRMs and TFBSs. Gene names should be searched as official FlyBase gene symbols (e.g., dpp, h, betaTub60D) or FlyBase ID's (e.g., FBgn0000490, FBgn0001168, FBgn0003888). A wild-card (*) is automatically appended at the beginning and end of the name unless "exact match" is selected. Greek letters have been written out (e.g., alpha, delta). Location search will select any CRMs or TFBSs lying within the specified sequence range. Selecting a maximum size will exclude any CRMs or TFBSs whose length is greater than the specified value, in basepairs. Placing a value in this field will restrict the search results to those records that have been added or updated after the chosen date. Use this feature to check for additions and corrections since your last search. Two methods of expression pattern searching have been implemented in REDfly. Searching using the Expression Term search field will select records containing the specified string. A wild-card (*) is automatically appended at the beginning and end of the name unless "exact match" is selected. Searching using the Ontology Search will select records containing the specified string or any of the descendant terms in the ontology hierarchy. The Ontology search function therefore provides a way to identify CRMs that potentially drive similar spatial patterns of expression despite that expression having been described at different levels of detail in the literature. Ontology searching can be conducted either by entering an ontology term (or term ID) in the search box, or by selecting a term from the pop-up Ontology Browser. See also Expression Pattern Annotations.For example, searching for "mesoderm" using the Expression Term search will return annotations such as FBbt:00000126, mesoderm FBbt:00000128, trunk mesoderm FBbt:00000130, visceral mesoderm Using "exact match," only FBbt:00000126, mesodermwould be returned. A search for "mesoderm" using the Ontology Search would return records with the same terms as above, but also with terms such as FBbt:00005073, somatic muscle FBbt:00000466, oblique muscle FBbt:00005247, hemocyte primordium FBbt:00001666, cardioblastand so forth. In practice, the Expression Term search will often be too restrictive, and the Ontology search too permissive. In the future, we hope to improve the Ontology search function to allow greater control over the depth of the search. Users may find it helpful to examine the FlyBase Gene Expression section for aid in navigating the Ontology. Back to contentsThe Search Results page provides a summary view of all records of both CRM and TFBS, returned by a search.CRMs are displayed first, followed by TFBSs. Users can select to download one or more sequences directly from this page (see Downloads ). Links are provided to a Detailed View page for each returned record. Back to contentsHyperlinks are provided from the Detailed View page to the FlyBase, FlyMine, and, for TFBSs, FlyTF records of the associated gene, to the UCSC genome browser, and to the Flybase Gbrowse genome browser. Links are also provided to the PubMed citation of the primary reference. Expression pattern link-outs are described below. Note that because TFBSs and CRMs are not strand-specific sequence features, no strand information is reflected in the graphical views. When accessing the Flybase Gbrowse genome browser we have occasionally experienced a timeout error and are working to diagnose the cause. The Sequence link offers options for viewing the CRM or TFBS sequence in the browser window or for downloading the sequence and associated data in a variety of formats. For TFBSs, a "sequence with flank" option is also available. This option displays the TFBS sequence in capital letters, with approximately 20 bp of additional sequence extending on each end. This extended sequence allows for the usually short TFBSs to be mapped unambiguously to the genome. Back to contentsMany older references do not provide exact sequence referents (e.g., genome coordinates, PCR primer sequences, GenBank IDs). Most often, sequence ranges are given as restriction maps. Because sequence polymorphisms between the clones used by researchers and the published genome sequence can lead to gain or loss of restriction sites and thus affect our determination of the reported sequence, we differentiate between those sequences unambiguously provided in the reference or through communication with the authors and those inferred from restriction maps. In those places where we were unable to locate a referenced restriction site or where sizes of the restriction fragments were not well matched with the reported sizes, we list the sequence end as "estimated/uncertain." In time, we hope to reconcile all ambiguities through communication with the authors. Sequences reported as "inferred from restriction map" use as endpoints the first nucleotide of the restriction site for both the 5' and 3' ends of the sequence. Depending on the actual cut site of the enzyme, therefore, and modification and/or sites used for subcloning, the exact CRM sequence tested by the authors may differ from the reported site by several basepairs. Orientation of CRMs is given as matching the orientation of the transcription unit, i.e. "5' end estimated" refers to the 5' end of the CRM when oriented in the same 5' to 3' direction as the gene. TFBS sequences initially from the FlyReg database do not contain sequence source terms. Back to contentsThe "Associated CRM" and "Associated TFBS" fields will display links to the appropriate REDfly records, where available. See also Expression pattern annotations and Searching expression patterns. Both FlyBase and the BDGP in situ database use the anatomy ontology for reporting gene expression patterns. We have therefore provided links from each expression pattern in REDfly to each of these databases. Following these links will generate a list of genes annotated as having the selected expression pattern. As mappings between the anatomy ontologies of different organisms are developed, we hope to create links to similarly expressed genes in these organisms as well. Following the "REDfly" link will initiate a fresh REDfly search for all records including the specified term (an "Expression Term" search). Back to contentsA number of terms used in the annotation of the BDGP in situ database have not yet been brought into line with the most current fly anatomy ontology terms and contain term ID's that are associated with different terms in the anatomy ontology. REDfly conforms to the anatomy ontology throughout, which could potentially lead to confusion when comparing entries in the two databases. These conflicting terms should be resolved once the BDGP terms are updated. Back to contentsThe "Download" button will download the checked CRM records in one of a variety of formats. At present, REDfly supports the following options: FASTA
Sequences in multi-FASTA format. The FASTA header contains the following data: >CRM_name|species|gene|FlyBase_ID|chromosome CSV
Comma-separated list, one line per record. Fields are: "name", "species name", "gene_name", "flybase_id", "chromosome", "sequence" GFFv3
Data in GFF version 3 format. The "attributes" field holds the CRM name ("ID="); database identifiers ("dbxref=") for FlyBase, PubMed (PMID), and REDfly; and the expression terms("Ontology_term="). Note that because TFBSs and CRMs are not strand-specific sequence features, no strand information is specified in the GFF file. Gbrowse annotation format
The format used by Gbrowse to load local custom annotations. REDfly supports two XML formats, one for CRMs and the other for TFBSs. This is the most comprehensive format available at this time and serves as the data interchange format. The CRM XML (in a list of CRMs) contains, for each CRM: the element name, sequence, source term, evidence term, chromosome, last update date, gene name, PubMed Id, Sequence, start and end coordinates (versions 3-5), a list of citations, a list of associated expression terms, a list of external references, a list of associated TFBS names. The TFBS XML (a list of TFBSs) contains, for each TFBS: the TFBS name, sequence, flank sequence, transcription factor, gene name, chromosome, site start and end coordinates (versions 3-5), a list of citations, a list of associated expression terms, a list of external references, a list of associated CRM names.
Download a CRM XML template or a TFBS XML template
For proper validation, CRM or TFBS XML must maintain the ordering of tags as in the templates/schemas.
Any list tag such as expression_term_list (or citation_list) when used must contain atleast one expression term (or citation as the case may be) without duplicates.
redfly_id : String name : String, not empty pubmed_id : Integer, not empty flybase_id : String Notes : String promoter : 0 represents 'no' and 1 represents 'yes' last_update : data and time in format YYYY-MM-DD HH:MM:SS gene_name : String, not empty transcription_factor : String, not empty chromosome : String, not over 8 characters in length species_name : String evidence_term : String sequence_source_term : String sequence coordinates: Integercitation id : Integer, not empty type : String, not emptyexpression_term id : Integer, not empty name : String, not empty external_reference reference_info : String, not empty url : String, not empty associated_crm or associated_tfbs crm_name or tfbs_name : String, not empty Back to contents |
||||||||||
|
||||||||||