Introduction | Preparing Input Sequences | Output | Parameters of MAGIIC-PRO | Mirror Site
The Facilities of MAGIIC-PRO | The Closure Checking Scheme | Experiments and Examples


Introduction

MAGIIC-PRO employs a novel combination of intra- and inter-block gap constraints to discover functional motifs that are interleaved by several large irregular gaps, which aims to discover functional signatures of a query protein and its relative sequences.

Automatic discovery of patterns in unaligned biological sequences is an important problem in molecular biology. MAGIIC-PRO is different from several previously developed packages performing similar tasks in two major ways.

The first remarkable feature of MAGIIC-PRO is its efficiency. Discovering patterns that are composed of several conserved regions interleaved with long irregular gaps is a time-consuming task due to the large search space considered. With incorporating new types of gap constraints and several state of the art data mining techniques, MAGIIC-PRO usually identifies satisfied patterns within an acceptable response time. The efficiency of MAGIIC-PRO realizes the possibility of quickly detecting functional signatures of which the residues are not from only one region of the protein sequences or that are only conserved in few members of a protein family.

The second remarkable feature of MAGIIC-PRO is its effort in refining the mining results. It is always the case that pattern mining tools produce a large amount of patterns but fail to guide the users directly to the important discoveries. Considering large flexible gaps improves the completeness of the detected functional sites, and meanwhile many noisy patterns can be eliminated. With the proposed pattern snapshot and conservation plot, the relationships between distinct conserved blocks can be easily visualized, and after that the importance of each pattern can be re-judged by users themselves.

MAGIIC-PRO is efficient and effective in identifying functional sites and predicting hot regions in protein-protein interactions directly from protein sequences. The smaller rigid intra-block gap constraint is used to relax the restriction in local motif blocks but still keep them compact, and the larger flexible inter-block gap constraint is proposed to allow longer irregular gaps between compact motif blocks. Using two types of gap constraints for different purposes improves the efficiency of the mining process while keeping high accuracy of mining results.

Note. The outputted patterns are organized such that a pattern that is closed by another pattern with same amount of supports will not be reported. Users are encouraged to start with a pilot search with rigorous constraints and then relax the constraints step by step.

Note. The current version of MAGIIC-PRO service limits the number of discovered patterns to 800,000, and limits the number of discovered blocks to 5,000.

Main Site. http://biominer.bime.ntu.edu.tw/magiicpro at Department of Bio-Industrial Mechatronics Engineering, National Taiwan University. (Main Site)

Mirror Site 1. http://biominer.cse.yzu.edu.tw/magiicpro at Department of Computer Science and Engineering, Yuan-Ze University.


Preparing Input Sequences

We assume that every user of MAGIIC-PRO has a protein sequence of interest at hand. MAGIIC-PRO takes a protein sequence as input, and helps the users to prepare a training data for pattern mining. The task of collecting relative sequences of the query protein can be achieved by using Swiss-Prot annotations or executing the PSI-BLAST program. Once the query protein and the training data have been determined, the mining process is executed using the parameters described in the following subsection.

Currently, MAGIIC-PRO accepts a set of sequences in Fasta format (example). The input can be specified in four different ways.

Option one : Exploit Swiss-Prot annotation. In this way, you should have the Swiss-Prot accession number for the query protein.

Option two : Use PSI-BLAST to find homologues for a query protein. In this way, you should have either a Swiss-Prot accession number or a query sequence in FASTA format.

Option three : Prepare the training data by protein family ID. In this way, the first sequence well be the reference protein. User can use regular expressions to retrieve some sequences of special interests from the Swiss-Prot database. Regular expressions are very powerful when specifying a pattern for a complex search.

Option four : Prepare the training data by 'copy and paste' or specify a file . In this way, you should have the set of protein sequences. If no query/reference sequence specifies in your case, the first protein sequence of training data will serve as the reference protein.

Note. The current version of MAGIIC-PRO service limits the number of query sequences to 500.

A sequence in FASTA format begins with a single-line header followed by one or more lines of sequence data. Sequences in fasta formatted files are preceded by a line starting with a greater than symbol (">").The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. The end of a sequence is identified when either the start of a new sequence is found or the end of the file is reached.

If the Swiss-Port access number is used as part of the sequence identifier in the FASTA format,  e.g.">sp|P08692|ARC1_ECOLI",  MAGIIC-PRO will automatically search for available PDB structures from the PDB database. This is applicable to both the query protein and the sequences in the training data. The identifier in the example above contains three fields separated by the "|" symbol. The first word "sp" is a short name for the Swiss-Prot database (obligatory), the second word "P08692" is a Swiss-Prot accession number (obligatory), and the third word "ARC1_ECOLI" is a Swiss-Prot entry name (optional).

Examples of valid header line

>sp|P08692|ARC1_ECOLI Arsenate reductase (EC 1.20.4.1) (Arsenical pump modifier) - Escherichia coli.        Or

>ARC1_ECOLI|P08692|Arsenate reductase                                                                                                           Or

>P08622

Example of a valid sequence file

>sp|P08692|ARC1_ECOLI Arsenate reductase (EC 1.20.4.1) (Arsenical pump modifier) - Escherichia coli.
MSNITIYHNPACGTSRNTLEMIRNSGTEPTIILYLENPPSRDELVKLIADMGISVRALLR
KNVEPYEQLGLAEDKFTDDQLIDFMLQHPILINRPIVVTPLGTRLCRPSEVVLDILQDAQ
KGAFTKEDGEKVVDEAGKRLK
>sp|P52147|ARC2_ECOLI Arsenate reductase (EC 1.20.4.1) (Arsenical pump modifier) - Escherichia coli.
MSNITIYHNPHCGTSRNTLEMIRNSGIEPTVILYLETPPSRDELLKLIADMGISVRALLR
KNVEPYEELGLAEDKFTDDQLIDFMLQHPILINRPIVVTPLGTKLCRPSEVVLDILPDAQ
KAAFTKEDGEKVVDDSGKRLK

When the users are aware of that his query sequence has a 3D structure in the PDB database, the PDB code (e.g. 1exk or 1exk:A for chain A) can be used as the identifier. The following examples are acceptable.

Example of the header line for a PDB sequence

>1EXK:A|PDBID|CHAIN|SEQUENCE                  Or

>1EXK                                                                          Or

>DNAJ|1EXK:A|THE CYSTEINE-RICH DOMAIN OF THE ESCHERICHIA COLI CHAPERONE PROTEIN DNAJ

Example of a valid sequence for a PDB sequence
(The FASTA example of following PDB sequence is obtained from RCSB PDB website.)

>1EXK:A|PDBID|CHAIN|SEQUENCE
GVTKEIRIPTLEECDVCHGSGAKPGTQPQTCPTCHGSGQVQMRQGFFAVQQTCPHCQGRG
TLIKDPCNKCHGHGRVERS

Note. If there are not any characters after greater symbol ">" in header, then the program automatically select a unique sequential number as the sequence identifier.


Parameters of MAGIIC-PRO

The minimum support : Minimum percentage of  sequences that match the derived pattern. If you selected this option, and set minimum support = 60, MAGIIC-PRO will derive patterns that matches at least 60% of the input sequences.

A pattern with a high support usually highlight the most highly conserved residues regarding a functional region, while a large-size but low-support pattern in general provides a complete signature with respect to a function site.

The most important parameter of MAGIIC-PRO is minimum support constraint, setting minimum support is a subtle task. A pattern will be reported as long as it is supported by at least a certain number of sequences. A too small value may lead to the generation of thousands of patterns and the huge amount of computing, whereas a too big one may lead to no answer found. The support constraint is critical to the mining results since it might not be possible to know in advance that among what percentage of the training data a satisfied pattern can be discovered.

To come up with an appropriate minimum support, MAGIIC-PRO allows dynamic decreasing of minimum support, the users start a pilot search from a large value for the support constraint, saying 90%, aand decrease the support constraint by a certain value, saying 10%, iteratively until at least a certain number of patterns have been found. At this stage, we usually observed that most patterns with the maximum support are related to a functional site of the query protein but do not serve as a complete signature of a functional site. In order to find patterns with more conserved blocks involved, we suggest users continued decreasing the minimum support constraint by a certain value, saying 5% or 10%..

In addition to the support constraint, MAGIIC-PRO has some other parameters for advanced users. Before going into the details, we first give a formal definition of a pattern block. An exact pattern is composed of more than one pattern elements as a sequence, in which between each successive pair of pattern elements is a rigid or flexible gap. The notation x(a,b), a < b, is used for a flexible gap with minimum length of a and maximum length of b, and x(a) stands for a rigid gap with a fixed length of a. The wildcard x(a) is omitted if a = 0, and is written as x if a = 1, i.e. x = x(1). Small and rigid gaps are considered as intra-block gaps, while large and flexible gaps are treated as inter-block gaps.

The first group of advanced parameters specifies the gap constraints.

The maximum length of an intra-block gap (default value = 3)

Using this option you can set the maximum range of rigid wildcard between two elements in one motif block  (from the position of the one element to the position of the next element). The minimum length of an intra-blocks gap is 0. Examples are as follows:

The pattern {I-x-H-N-P-x-C-x(30,68)-E-x(2)-L-x-K-x-I} has 2 motif blocks. The maximum length of an rigid intra-block gap of both motif blocks {I-x-H-N-P-x-C} and {E-x(2)-L-x-K-x-I} is 2.

Note. Currently the server limits that the maximum length of an intra-block gap must be less then the minimum length of an inter-block gap.

The minimum length of an inter-block gap (default value = the maximum length of an intra-block gap + 1)

Using this option you can set the low bounded range of wildcard between two motif blocks of the pattern (from the end of the motif block to the beginning of the next motif block).

The pattern {I-x-H-N-P-x-C-x(30,68)-E-x(2)-L-x-K-x-I} has 2 motif blocks. The low bound of an inter-block gap are 30.

Note. Currently the server limits that the minimum length of an inter-block gap threshold must greater than the maximum length of an intra-block gap threshold.

The maximum relative flexibility of an inter-block gap with respect to the length of the inter-block gap present in the query protein (default value = 30%)

Let l be the length of inter-block gap in the query protein which connects two successive pattern blocks. The lower and upper bound of the inter-block gap is defined as (1 - fmax)*l and (1 + fmax)*l respectively, where fmax is called the relative flexibility constraint.

The second group of advanced parameters specifies the size constraints.

The minimum size of a block (default value = 3)

The minimum number of elements in one motif block. The pattern {I-x-H-N-P-x-C-x(30,68)-E-x(2)-L-x-K-x-I} has 2 motif blocks. The first motif block {I-x-H-N-P-x-C} has 5 elements (I, H, N, P, and C), and the second block {E-x(2)-L-x-K-x-I} has 4 elements (E, L, K and I) .

Note. Currently the server limits that the minimum number of elements in one motif block must equal to or greater than 3.

The minimum number of blocks in a pattern (default value = 2)

The minimum number of blocks in the derived pattern. The pattern {I-x-H-N-P-x-C-x(30,68)-E-x(2)-L-x-K-x-I} has 9 elements and 2 blocks.

We argue that a pattern should have at least two blocks to be meaningful, because an important region is seldom to be conserved singly either from structural or functional aspects. In this way, we can remove a large amount of noisy patterns, the users can be directly guided to the important discoveries. We suggest the users do not change the default values of these parameters in his or her first search.


Output

The conservation plot and pattern snapshot:

After the mining process finishes, the users can first take a look on the conservation plot and pattern snapshot. As shown in the Figure 1(a), the locations of the conserved regions are summarized in the complete conservation plot derived from all the patterns. It can be observed in Figure 1(a) that there are nine conserved regions in the query protein. In the same web page, the users are provided with an interactive interface to collect patterns of interest in a pattern snapshot. Different from the conservation plot, a pattern snapshot in addition tells which pattern blocks are simultaneously conserved during evolution. The users are suggested to browse the lists of the top ten high-support and top ten large-size patterns. The size of a pattern is defined as the number of exact components it contains. A pattern with a high support usually highlight the most highly conserved residues regarding a functional region, while a large-size but low-support pattern in general provides a complete signature with respect to a functional site.

The users can use the interactive snapshot to select patterns and move your mouse over any bar in the conservation plot to show the brief descriptions about the associated residue, or click it for the detailed information.

For more information about facilities of MAGIIC-PRO see [The facilities of MAGIIC-PRO].


(a) Conservation plot


(b) Top ten high-support patterns with three or more blocks


(c) Top ten large-size patterns with three or more blocks

Fig. 1. Example of pattern snapshots and the conservation plot of all the patterns derived by MAGIIC-PRO in one mining process.

The format of outputted pattern:

ID   Sup.  Hits  Size  #Blocks   Posite Patterns
1.   173   180    6      2       217G-x-G-x(2)-P222-x(64,111)-316Y-x-C-G319

 ...

Pattern ID : 1. The ID of the derived pattern.

Support of the pattern : This pattern matches 173 different input sequences

Hits of the pattern : This pattern hits 180 in 173 supporting sequences

Size of the pattern : This pattern has 6 elements

Number of blocks in the pattern : This pattern has 2 blocks

Discrete Pattern (regular expression):

G-x-G-x(2)-P-x(64,111)-Y-x-C-G

Pattern Elements : Highlighted by coffee. This pattern has 6 elements (G, G, P, Y, C, G) and 2-blocks.

Intra-block gap : Highlighted by green color. This pattern has 4 intra-block gaps, the minimum/maximum intra-block gap is 0/2.

Inter-block gap : Highlighted by blue. This pattern has 1 inter-block gap x(64,111) interleaving by first block G-x-G-x(2)-P and second block Y-x-C-G.

Motif blocks : The pattern has 2 motif blocks that are separated by one inter-block gap. {G-x-G-x(2)-P} and {Y-x-C-G}.

Blocks appearing position : The subscripts highlighted in gray are used to denote the starting (put it on left of the block) and ending positions of a conserved blocks (put it on right of the block) according to its location in the query protein.


 

The Facilities of MAGIIC-PRO

After the mining process finishes, MAGIIC-PRO generates a pattern snapshot that maps all the derived patterns on the query protein. The residues in each pattern are collected to create a conservation plot, and the conservation level of a residue is determined by the percentage of total supporting proteins merged from different patterns. Conservation plot provides a whole picture about the conserved residues with respect to the query protein.

Listing top-10 frequent and maximum size patterns in PROSITE language. The users are suggested to browse the top ten high support patterns and top ten long length patterns. From our experiences, the patterns with structural or functional meanings can usually be found from either of these two lists. The patterns with high supports generally reveal the most highly conserved regions regarding the functional regions, and the patterns with long length usually define a complete signature with respect to a functional site.

Listing all the discovered patterns in PROSITE language.

The complete results for printing of file saving.

To facilitate studying the patterns found by MAGIIC-PRO, we provide five useful links for each pattern. The first one highlights the locations of the pattern in its supporting sequences. Second, the derived pattern can be plotted with a three-dimensional protein structure if there are PDB entries available for any of the supporting sequences. Third, the derived pattern can be fed to the ScanProsite web service to check its selectivity, the ability to reject false positive matches. Fourth, the users can perform a multiple sequence alignment on the segments of supporting sequences that are associated with the selected pattern. This helps the user to construct a more generalized pattern with amino acid substitutions considered. Fifth, MAGIIC-PRO aligns each excluded sequence with the segment of the query protein. This helps to tell why a particular sequence does not match the pattern.

Links Associated with Each Pattern : To facilitate studying the patterns found by MAGIIC-PRO, we provide five useful links for evaluating the quality of each pattern.

First, the users can examine the occurrences of the pattern in its supporting sequences.

Second, the derived pattern can be fed to the ScanProsite web service (http://www.expasy.org/tools/scanprosite/) to check its selectivity, the ability to reject false positive matches.

Third, the derived pattern can be inspected on a three-dimensional protein structure if there are PDB entries available for any of the supporting sequences.

Fourth, the users can perform a multiple sequence alignment on the segments of supporting sequences that are associated with the selected pattern. This helps the user to construct a more generalized pattern with amino acid substitutions are considered.

Fifth, MAGIIC-PRO aligns each non-supporting sequence with the segment of query protein with respect to the derived pattern. This facility tells the user why the sequences can not be matched by the pattern.

* The results will be kept in the server for one month. The results page, if bookmarked, can be reached during this period.


The Closure Checking Scheme

Sequential pattern mining often generates an exponential number of frequent patterns that are subsequences of a long pattern, which is prohibitively expensive in both time and space. A solution to this problem was proposed recently by [Yan et al. 2003, Wang et al. 2004], called mining frequent closed sequential patterns. A pattern P is support-closed if there exists no super-pattern of P with the same number of supporting sequences in the database. MAGIIC-PRO employs a closure checking scheme developed based on BIND technique [Wang et al. 2004] with bounded-size gaps considered. Considering closed patterns spanning large wildcard regions is critical to the success of identifying functional sites of proteins because it helps to refine the mining results by focusing on the concurrences of the conserved regions. This effect aims to eliminate patterns that can be covered by other super patterns with same occurrences during mining process more aggressively. [ see the page for details]

Yan,X., Han,J. Afshar,R. (2003) CloSpan: Mining Closed Sequential Patterns in Large Databases. In Proc. of the 3rd SIAM Intl. Conf. on Data Mining (SDM'03).

Wang,J., Han,J. (2004) BIDE: Efficient Mining of Frequent Closed Sequences. In Proc. of the 20th Intl. Conf. on Data Engineering  (ICDE'04), 79-90.


Experiments and Examples

We provide a lot of experiments and examples to demonstrate the capability of MAGIIC-PRO in identifying flexible and long patterns, and show by some study cases the potential of MAGIIC-PRO in identifying functional sites and predicting hot regions in protein-protein interactions directly from protein sequences. All the experiments were conducted on a machine with a 3.4GHz Intel Pentium 4 CPU and memory of 2GBs, running Linux Fedora 4 Server.

More results can be found in the web page : http://biominer.cse.yzu.edu.tw/magiicpro/Examples/index.html


Mirror Site

1. http://biominer.bime.ntu.edu.tw/magiicpro at Department of Bio-Industrial Mechatronics Engineering, National Taiwan University. (Main Site)

2. http://biominer.cse.yzu.edu.tw/magiicpro at Department of Computer Science and Engineering, Yuan-Ze University.


Introduction | Preparing Input Sequences | Output | Parameters of MAGIIC-PRO | Mirror Site
The Facilities of MAGIIC-PRO | The Closure Checking Scheme | Experiments and Examples


Feedback and application description are always welcome. 
Contact cmhsu@saturn.yzu.edu.tw for bugs and question about MAGIIC-PRO.