1 Input
1.1 Sequence files
GenBank, FASTA, Text or multiple FASTA/Text files (one or more) can be loaded into DiProGB using the menu point File->Open. All the selected files are stored in the Sequence list. By default the sequence graph is shown for the first sequence of the first list entry. There are other possibilities for loading sequence files into DiProGB. The corresponding files can be added via the Sequence list (2.2) or they can also be dragged and dropped into the main window. In the latter case the file name(s) will be added to the Sequence list and the first file will be opened automatically. Sequence information can also be entered manually File->Open->Manual input or downloaded directly from NCBI File->Download->NCBI Sequence files.
1.1.1 GenBank files
GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences ( Nucleic Acids Research 2007 Jan ;35(Database issue):D21-5). A GenBank flat file contains a large amount of information. Normally it consists of the sequence information itself and several annotated features and qualifiers. DiProGB searches for the following keywords and extracts the corresponding information:
- DEFINITION - to get the title of the sequence
- LOCUS - to determine if the sequence is circular or not
- FEATURES - to extract all following annotated features and their location/qualifiers
- ORIGIN - to extract the following nucleotide sequence
The sequence information after ORIGIN is parsed according to the rules of 1.1 General information.
1.1.2 (multiple) FASTA files
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. DiProGB uses the first line with a ">" symbol as name for the sequence. All other lines starting with ">" are ignored. The rest is extracted as sequence information and parsed according to the rules of 1.1 General information. Multiple files can also be opened via Files->Open->FASTA/Text file(s). A multiple FASTA file contains more than one nucleotide sequence in one file. Each sequence is separated by a comment line starting with ">". For multiple files the genome browser takes lines starting with ">" as sequence title and the following line up to the next ">" as nucleotide sequences. The sequences are named as followed: ~ + number in the multiple file + ~ + path and name of the multiple file (e.g. "~1~C:\multiplesequence.fasta").
1.1.3 Text files
A pure text file containing only sequence information can also be uploaded. It is handled like a FASTA file without expecting ">". The advantage is that one can copy an unformatted nucleotide sequence into a plain text file and open it with DiProGB. The path and filename of the text file is used as sequence name (e.g. in the statistics table).
1.1.4 Downloading NCBI sequence files or further dinucleotide properties
The option Files->Download->NCBI sequence files offers the possibility to download FASTA or GenBank files directly from the
NCBI homepage. After clicking the menu entry the user can insert (e.g. paste) a list of accession numbers separated by either
returns, spaces, tabs or commas. To search for accession numbers the user is directed to the NCBI homepage when pressing the Search
button. After choosing a file type (FASTA or GenBank) and specifying a folder where the files are stored the Start download button
will start the download. The path and file names of successfully downloaded files is inserted into the list of preselected sequences (cf. 2.2).
If the user wants to download additional dinucleotide properties he is directed to the DiProDB homepage after clicking Files->Download->Dinucleotide
properties.
1.2 Feature files
In order to add annotated biological information (e.g. gene information) to the previously uploaded sequence file a number of different file formats is supported:
PTT - files
The PTT file format is a table of protein features. A Protein Table (PTT) file contains the location, strand, length, ID, generic name,
COG assignation and functional annotation for every predicted protein in whole genomes. These protein table files can be obtained from
NCBI's ftp site: ftp://ftp.ncbi.nih.gov/genomes/.
GFF - files
GFF = "General Feature Format" is a format for describing genes and other features associated with DNA, RNA and Protein sequences.
Detailed information can be found here.
GTF - files
GTF = "Gene Transfer Format" is a refinement to GFF that tightens the specification. The first eight GTF
fields are the same as GFF. The group field has been expanded into a list of attributes. Each attribute
consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following
attribute by exactly one space. Detailed information can be found here.
BED - files
BED format provides a flexible way to define the data lines that are displayed in an annotation track.
BED lines have three required fields and nine additional optional fields. The number of fields per line
must be consistent throughout any single set of data in an annotation track. The order of the optional
fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.
Detailed information can be found here.
To upload a feature file the user can click on File->Open->Feature file. After selecting a file in one of the standard formats gff, .gtf, .ptt, .bed or a custom file in a tabular format an interface opens. In the latter case the user can specify column separators, comments and the table header. After pressing the button Upload file the feature file is parsed. Each row of the table is regarded as separate feature. In the custom case the user now has the possibility to assign columns to the following predefined features.
Predefined feature | Description |
---|---|
Sequence name | Column that contains the name of the sequence for each feature (e.g. chromosome or organism). At the end of the upload process the user is asked to specify one sequence name matching the uploaded sequence. This allows to filter the feature file if it contains features for different sequences. |
Feature name | Name of the feature e.g. CDS or gene. All other columns are handled as qualifiers for this column. |
Start | First position of the feature. |
End | Last position of the feature. |
Strand | Strand the feature is situated on. Can be “–“ for the minus strand all other characters e.g. “+”, ” .” are interpreted as positive strand. |
The assignment for all predefined features is optional (can be “None”). If a feature is not merged to an already uploaded feature,
at least the two columns containing the Start and the End positions have to be specified. For a standard format
the columns for the predefined features are selected automatically.
In a last step the user has to decide if he wants to create a new list (old list is cleaned) for the uploaded features or if he
wants to add them to the existing feature list. For the latter case there are again two options. The features can either be added
to the end of the existing list or the new features can be merged with the already existing features by matching the content of
two specified columns either exact or partial.
All features are inserted into the feature list when clicking the Assign content button. With help of the
Feature Selection window all or a subset of all features can be selected to be colored on the sequence graph.
The Feature List displays all selected features and their qualifiers in a separate window. More information about both
windows can be found in section 2.3 Feature List, Selection.
The added information is lost if the user changes the uploaded sequence file.
1.3 Non-unique nucleotides
DiProGB copes with non-unique nucleotides. When loading a sequence file a warning window pops up with information about non-unique nucleotides contained in the sequence. Non-unique nucleotides are replaced by the standard nucleotides A,C,G,T according to the following table:
- a,t,g,c are replaced by the capital letters A,T,G,C
- u,U are replaced by T
- r,R are randomly replaced by A or G
- y,Y are randomly assigned to C or T
- s,S are randomly assigned to C or G
- w,W are randomly assigned to A or T
- k,K are randomly assigned to G or T
- m,M are randomly assigned to C or A
- b,B are randomly assigned to C or G or T
- d,D are randomly assigned to by A or G or T
- h,H are randomly assigned to A or C or T
- v,V are randomly assigned to A or C or G
- all others are randomly assigned to A or C or G or T
For all random cases the replaced nucleotides are always equally distributed and the replacement information is stored in the list of all features under non unique nucleotide information.
1.4 Dinucleotide properties
Dinucleotide in this context means two adjacent (neighboured) nucleotides in RNA or DNA sequences. The sequence alphabet consists of the four bases A, T (U), G and C and this leads to the 16 different dinucleotides AA, AT, AC, AG, TA, TT, TC, TG, CA, CT, CC, CG, GA, GT, GC, GG. If dinucleotide properties were determined for double-stranded DNA/RNA there is no strand direction preference. In this case some of the dinucleotide steps are symmetry-related and therefore there are only ten unique dinucleotide step types: AT, TA, GC, CG, AA (=TT), AC (=GT), AG (=CT), CA (=TG), CC (=GG), GA (=TC). Dinucleotides represent the smallest unit for describing neighbouring relationships between adjacent bases. For each of the 16 dinucleotides thermodynamic (e.g. stacking energy, free energy, ...), structural (e.g. twist, roll, ...) or other properties (e.g. sequence based) are available, based on previous experimental or computational work. More information on dinucleotides and a collection of more than 100 published dinucleotide property sets can be found in the dinucleotide property database (DiProDB). DiProGB encodes the sequence information containing 4 bases by the 16 values of the dinucleotide properties and creates a sequence graph (see 2.1). To change or upload dinucleotide properties the user has to open the DiPro list described in section 2.2 Sequences, Dinucleotide properties. In this list several default dinucleotide property sets can be found. The user can also add files with one or more new dinucleotide property sets.
1.5 Sessions
Uploading a session (.view - File) allows to continue a previously saved work. It is also useful to simply save all parameters of an especially interesting session. The session files contain all information about:
- the file- and pathnames
- the start and end nucleotide of the shown part of the sequence
- all selected features and qualifiers
- the color table
- parameters governing the appearance of the sequence graph.