|
Introduction |
Preparing
Input Sequences | Output
| Parameters of MAGIIC-PRO
| Mirror Site |
||||||||
|
|
||||||||
| Introduction MAGIIC-PRO employs a novel combination of intra- and inter-block gap constraints to discover functional motifs that are interleaved by several large irregular gaps, which aims to discover functional signatures of a query protein and its relative sequences. Automatic discovery of patterns in unaligned biological sequences is an important problem in molecular biology. MAGIIC-PRO is different from several previously developed packages performing similar tasks in two major ways.
MAGIIC-PRO is efficient and effective in identifying functional sites and predicting hot regions in protein-protein interactions directly from protein sequences. The smaller rigid intra-block gap constraint is used to relax the restriction in local motif blocks but still keep them compact, and the larger flexible inter-block gap constraint is proposed to allow longer irregular gaps between compact motif blocks. Using two types of gap constraints for different purposes improves the efficiency of the mining process while keeping high accuracy of mining results. Note. The outputted patterns are organized such that a pattern that is closed by another pattern with same amount of supports will not be reported. Users are encouraged to start with a pilot search with rigorous constraints and then relax the constraints step by step. Note. The current version of MAGIIC-PRO service limits the number of discovered patterns to 800,000, and limits the number of discovered blocks to 5,000. Main Site. http://biominer.bime.ntu.edu.tw/magiicpro at Department of Bio-Industrial Mechatronics Engineering, National Taiwan University. (Main Site) Mirror Site 1. http://biominer.cse.yzu.edu.tw/magiicpro at Department of Computer Science and Engineering, Yuan-Ze University. We assume that every user of MAGIIC-PRO has a protein sequence of interest at hand. MAGIIC-PRO takes a protein sequence as input, and helps the users to prepare a training data for pattern mining. The task of collecting relative sequences of the query protein can be achieved by using Swiss-Prot annotations or executing the PSI-BLAST program. Once the query protein and the training data have been determined, the mining process is executed using the parameters described in the following subsection. Currently, MAGIIC-PRO accepts a set of sequences in Fasta format (example). The input can be specified in four different ways.
A sequence in FASTA format begins with a single-line header
followed by one or more lines of sequence data. Sequences in
fasta formatted files are preceded by a line starting with a
greater than symbol (">").The
word following the ">" symbol is the identifier of the
sequence, and the rest of the line is the description (both
are optional). There should be no space between the ">" and
the first letter of the identifier.
The end of a sequence is identified when either the start of
a new sequence is found or the end of the file is reached. If the Swiss-Port access number is used as part of the sequence identifier in the FASTA format, e.g.">sp|P08692|ARC1_ECOLI", MAGIIC-PRO will automatically search for available PDB structures from the PDB database. This is applicable to both the query protein and the sequences in the training data. The identifier in the example above contains three fields separated by the "|" symbol. The first word "sp" is a short name for the Swiss-Prot database (obligatory), the second word "P08692" is a Swiss-Prot accession number (obligatory), and the third word "ARC1_ECOLI" is a Swiss-Prot entry name (optional).
When the users are aware of that his query sequence has a 3D structure in the PDB database, the PDB code (e.g. 1exk or 1exk:A for chain A) can be used as the identifier. The following examples are acceptable.
Note. If there are not any characters after greater symbol ">" in header, then the program automatically select a unique sequential number as the sequence identifier. The minimum support : Minimum percentage of sequences that match the derived pattern. If you selected this option, and set minimum support = 60, MAGIIC-PRO will derive patterns that matches at least 60% of the input sequences. A pattern with a high support usually highlight the most highly conserved residues regarding a functional region, while a large-size but low-support pattern in general provides a complete signature with respect to a function site. The most important parameter of MAGIIC-PRO is minimum support constraint, setting minimum support is a subtle task. A pattern will be reported as long as it is supported by at least a certain number of sequences. A too small value may lead to the generation of thousands of patterns and the huge amount of computing, whereas a too big one may lead to no answer found. The support constraint is critical to the mining results since it might not be possible to know in advance that among what percentage of the training data a satisfied pattern can be discovered. To come up with an appropriate minimum support, MAGIIC-PRO allows dynamic decreasing of minimum support, the users start a pilot search from a large value for the support constraint, saying 90%, aand decrease the support constraint by a certain value, saying 10%, iteratively until at least a certain number of patterns have been found. At this stage, we usually observed that most patterns with the maximum support are related to a functional site of the query protein but do not serve as a complete signature of a functional site. In order to find patterns with more conserved blocks involved, we suggest users continued decreasing the minimum support constraint by a certain value, saying 5% or 10%.. In addition to the support constraint, MAGIIC-PRO has some other parameters for advanced users. Before going into the details, we first give a formal definition of a pattern block. An exact pattern is composed of more than one pattern elements as a sequence, in which between each successive pair of pattern elements is a rigid or flexible gap. The notation x(a,b), a < b, is used for a flexible gap with minimum length of a and maximum length of b, and x(a) stands for a rigid gap with a fixed length of a. The wildcard x(a) is omitted if a = 0, and is written as x if a = 1, i.e. x = x(1). Small and rigid gaps are considered as intra-block gaps, while large and flexible gaps are treated as inter-block gaps. The first group of advanced parameters specifies the gap constraints.
The second group of advanced parameters specifies the size constraints.
The conservation plot and pattern snapshot: After the mining process finishes, the users can first take a look on the conservation plot and pattern snapshot. As shown in the Figure 1(a), the locations of the conserved regions are summarized in the complete conservation plot derived from all the patterns. It can be observed in Figure 1(a) that there are nine conserved regions in the query protein. In the same web page, the users are provided with an interactive interface to collect patterns of interest in a pattern snapshot. Different from the conservation plot, a pattern snapshot in addition tells which pattern blocks are simultaneously conserved during evolution. The users are suggested to browse the lists of the top ten high-support and top ten large-size patterns. The size of a pattern is defined as the number of exact components it contains. A pattern with a high support usually highlight the most highly conserved residues regarding a functional region, while a large-size but low-support pattern in general provides a complete signature with respect to a functional site. The users can use the interactive snapshot to select patterns and move your mouse over any bar in the conservation plot to show the brief descriptions about the associated residue, or click it for the detailed information. For more information about facilities of MAGIIC-PRO see [The facilities of MAGIIC-PRO].
The format of outputted pattern: ID Sup. Hits Size #Blocks Posite Patterns1. 173 180 6 2 217G-x-G-x(2)-P222-x(64,111)-316Y-x-C-G319 ... Pattern ID : 1. The ID of the derived pattern. Support of the pattern : This pattern matches 173 different input sequences Hits of the pattern : This pattern hits 180 in 173 supporting sequences Size of the pattern : This pattern has 6 elements Number of blocks in the pattern : This pattern has 2 blocks Discrete Pattern (regular expression): G-x-G-x(2)-P-x(64,111)-Y-x-C-G
To facilitate studying the patterns found by MAGIIC-PRO, we provide five useful links for each pattern. The first one highlights the locations of the pattern in its supporting sequences. Second, the derived pattern can be plotted with a three-dimensional protein structure if there are PDB entries available for any of the supporting sequences. Third, the derived pattern can be fed to the ScanProsite web service to check its selectivity, the ability to reject false positive matches. Fourth, the users can perform a multiple sequence alignment on the segments of supporting sequences that are associated with the selected pattern. This helps the user to construct a more generalized pattern with amino acid substitutions considered. Fifth, MAGIIC-PRO aligns each excluded sequence with the segment of the query protein. This helps to tell why a particular sequence does not match the pattern. Links Associated with Each Pattern : To facilitate studying the patterns found by MAGIIC-PRO, we provide five useful links for evaluating the quality of each pattern.
* The results will be kept in the server for one month. The results page, if bookmarked, can be reached during this period. Sequential pattern mining often generates an exponential number of frequent patterns that are subsequences of a long pattern, which is prohibitively expensive in both time and space. A solution to this problem was proposed recently by [Yan et al. 2003, Wang et al. 2004], called mining frequent closed sequential patterns. A pattern P is support-closed if there exists no super-pattern of P with the same number of supporting sequences in the database. MAGIIC-PRO employs a closure checking scheme developed based on BIND technique [Wang et al. 2004] with bounded-size gaps considered. Considering closed patterns spanning large wildcard regions is critical to the success of identifying functional sites of proteins because it helps to refine the mining results by focusing on the concurrences of the conserved regions. This effect aims to eliminate patterns that can be covered by other super patterns with same occurrences during mining process more aggressively. [ see the page for details] Yan,X., Han,J. Afshar,R. (2003) CloSpan: Mining Closed Sequential Patterns in Large Databases. In Proc. of the 3rd SIAM Intl. Conf. on Data Mining (SDM'03). Wang,J., Han,J. (2004) BIDE: Efficient Mining of Frequent Closed Sequences. In Proc. of the 20th Intl. Conf. on Data Engineering (ICDE'04), 79-90. We provide a lot of experiments and examples to demonstrate the capability of MAGIIC-PRO in identifying flexible and long patterns, and show by some study cases the potential of MAGIIC-PRO in identifying functional sites and predicting hot regions in protein-protein interactions directly from protein sequences. All the experiments were conducted on a machine with a 3.4GHz Intel Pentium 4 CPU and memory of 2GBs, running Linux Fedora 4 Server. More results can be found in the web page : http://biominer.cse.yzu.edu.tw/magiicpro/Examples/index.html Mirror Site 1. http://biominer.bime.ntu.edu.tw/magiicpro at Department of Bio-Industrial Mechatronics Engineering, National Taiwan University. (Main Site) 2. http://biominer.cse.yzu.edu.tw/magiicpro at Department of Computer Science and Engineering, Yuan-Ze University. |
||||||||
|
|
||||||||
|
Introduction |
Preparing
Input Sequences | Output
| Parameters of MAGIIC-PRO
| Mirror Site |
||||||||
|
|
||||||||
|
Feedback and application description are always
welcome. |