MAGIIC-PRO employs a novel combination of intra- and inter-block gap constraints to discover functional long motifs that are interleaved by several large irregular gaps. The smaller rigid intra-block gap constraint is used to relax the restriction in local motif blocks but still keep them compact, and the larger flexible inter-block gap constraint is proposed to allow longer irregular gaps between compact motif blocks. Using two types of gap constraints for different purposes improves the efficiency of the mining process while keeping high accuracy of mining results. The efficiency of the algorithm also helps to identify functional motifs that are conserved in only a small subset of the input sequences. This feature is important because some highly specific signatures are usually conserved in few members of a protein family.

Why using MAGIIC-PRO ?

MAGIIC-PRO can find long patterns efficiently.

What are long patterns ?

We say a pattern a long pattern if the first element and the last element are far away from each other when present in the protein sequence. Take this pattern as an example, N-R-N-x(101,129)-Y-x(3)-G-x(3)-D, the first residue N and the last residue D might be interleaved with at most 139 residues.

Are long patterns meaningful?

Yes, the following examples show that even these residues are far away in the primary sequences, they are clustered into substructures that constitute functional sites when proteins are folded.  (more examples)

P-x-C-x(3)-R-x(77,78)-R-P-I


(PDB Codes: 1i9d)

¡@

L-G-x-I-x(69,76)-L-x(2)-G-x-G
-x(87,97)-P-G-x(3)-
T


(PDB Codes : 1b14)

I-x(2)-H-x-H-x-D-x(2)-G-G-x(48,49)
-Y-x-G-x-G-H-x(2)-D-x(2)-V-V-W-x-P
-x(4)-L-x-G-G-C-x(2)-K
-x(31,34)-V-V-x(2)-H


(PDB Codes : 1a7t)

Why large flexible gaps?

For a set of proteins that share a common function or structure, it is often that only a few of common residues are conserved among them. It means that the conserved residues in long patterns must be interleaved with large gaps. The lengths of gaps are usually irregular because insertions and deletions might happen during evolution.

Why employing two types of gap constraints?

Allowing large flexible gaps might result in patterns with their elements largely scattered. In protein sequences, the conserved residues usually appear as clusters, and multiple clusters together contribute to an important substructure. In MAGIIC-PRO, the smaller rigid intra-block gap constraint is used to relax the restriction in local motif blocks but still keep them compact, and the larger flexible inter-block gap constraint is proposed to allow longer irregular gaps between compact motif blocks.

Why using exact patterns?

We say a pattern element is exact if it permits only one amino acid type. MAGIIC-PRO only considers exact patterns when discovering patterns that can largely refine the mining results. After the mining process terminates, the users can perform Clustal-W on the segments associated the patterns to derive a more generalized pattern with substitutions considered.

C-x(2)-C-x-G-x-G-x(12,19)-C-x(2)-C-x-G

¡÷

¡÷

Why low supports?

Low supports are desired during mining process because some highly specific signatures are usually conserved in few members of a protein family. However, it might not be possible to know in advance that which proteins share exactly the same functional signature. By setting the minimum support constraint as a lower value, MAGIIC-PRO can discover patterns that really present as structural signatures but are only conserved in a subset of input sequences. Such patterns are more informative and useful in predicting ligand binding or protein interaction.

Pitfalls of diagnostic patterns

A pattern is said to be diagnostic for a family if it matches all the known sequences in the family, and no other known sequence. However, a diagnostic pattern does not always correspond to a functional signature. For example, the pattern of PS00637 / CXXCXGXG dnaJ domain signature does not capture the signature of the zinc binding site

C-[DEGSTHKR]-x-C-x-G-x-[GK]-[AGSDM]-x(2)-[GSNKR]-x(4,6)-C-x(2,3)-C-x-G-x-G

Different from that, the pattern found by MAGIIC-PRO with the maximum support captures the structural signature of the second zinc binding site.

C-x(2)-C-x-G-x-G-x(12,19)-C-x(2)-C-x-G

The pattern found by MAGIIC-PRO with longer length but lower support further detects the signature of the first zinc binding site.

 C-x(2)-C-x-G-x-G-x(8,10)-C-x(2)-C-x-G-x-G-x(10,18)-C-x(2)-C-x-G-x-G-x(5,8)-C-x(2)-C-x-G-x-G

It has been annotated in the literature that the second binding site is more important than the first binding site. Our mining results confirm that since the second pattern has lower support than the first one.

Comparison with other related works delivering sequence patterns:

1.     Algorithms that consider only rigid gaps (ex. Teiresias): The algorithms that consider only rigid gaps can find each conserved region separately. However, they usually generate a large amount of false short patterns and the information regarding the concurrences of these blocks is missing.

2.     Algorithms that consider flexible gaps (ex. Pratt): The algorithms that allow flexible gaps present in any part of the patterns might derive patterns with no explicit conserved regions. Furthermore, it has been demonstrated in our previous work [Hsu et al. 2006] that considering large flexibilities carelessly might cause the failure of delivering satisfied results within an acceptable time.

MAGIIC-PRO

Teiresias

Pratt