(1) Statistics: Searches in all Eukaryotic and Bacterial sequences of the SwissProt database are performed with the calculated regular expressions. For regular expressions with more than 15 positions, statistics are switched off automatically since the search becomes too time consuming. |
|
Many important functional aspects of proteins can be attributed short linear motifs, including post-translational modification sites and protein-protein interaction peptides (see ELM - http://elm.eu.org). Regular expressions can be used to describe such motifs, and can be used to detect/predict these. However, since such linear motifs are so short, regular expressions describing them will often massively overpredict, and information beyond sequence is needed to improve predictions (ELM).
RegExpMaker takes a multiple alignment (clustal format), and uses the amino acid groups as adopted from WR Taylor, J Theor Biol. 1986 Mar 21;119(2):205-18 (see figure below). If the amino acids in a position of the alignment fall within one of the predefined groups they are represented as a list of observed residues (within brackets: e.g. [VIL]). If the amino acids fall within more than one of the predifined groups, they will in Regular expression 1 be represented by a '.', implying that all amino acids are allowed in this position. If it is possible to identify groups that do not have at least one amino acid present in that position, that position is represented by amino acids not allowed in Regular Expression 2 (e.g. [^DE]).
Additionally, gaps in the alignment are interpreted as implying that that position is optional. The user also has the opportunity to specify whether the first or last position of the submited alignments is N- or C-terminally respectively, and this piece of information will be built into the resulting regular expression.
RegExpMaker uses the POSIX syntax for regular expressions (for details on syntax and examples: ELM), in contrast to Prosite, where a slightly different syntax is used (more information here).
Pål Puntervoll (C) 2003