The Bio-Web: Resources for Molecular and Cell Biologists

The Bio-Web: Molecular and Cell Biology and Bioinformatics news, tools, books, resources and web applications development

SMS logo
Format Conversion
-Combine FASTA
-EMBL to FASTA
-EMBL Feature Extractor
-EMBL Trans Extractor
-Filter DNA
-Filter Protein
-GenBank to FASTA
-GenBank Feature Extractor
-GenBank Trans Extractor
-One to Three
-Range Extractor DNA
-Range Extractor Protein
-Reverse Complement
-Split Codons
-Split FASTA
-Three to One
-Window Extractor DNA
-Window Extractor Protein
Sequence Analysis
-Codon Plot
-Codon Usage
-CpG Islands
-DNA Molecular Weight
-DNA Pattern Find
-DNA Stats
-Fuzzy Search DNA
-Fuzzy Search Protein
-Ident and Sim
-Multi Rev Trans
-Mutate for Digest
-ORF Finder
-Pairwise Align Codons
-Pairwise Align DNA
-Pairwise Align Protein
-PCR Primer Stats
-PCR Products
-Protein GRAVY
-Protein Isoelectric Point
-Protein Molecular Weight
-Protein Pattern Find
-Protein Stats
-Restriction Digest
-Restriction Summary
-Reverse Translate
-Translate
Sequence Figures
-Color Align Conservation
-Color Align Properties
-Group DNA
-Group Protein
-Primer Map
-Restriction Map
-Translation Map
Random Sequences
-Mutate DNA
-Mutate Protein
-Random Coding DNA
-Random DNA Sequence
-Random DNA Regions
-Random Protein Sequence
-Random Protein Regions
-Sample DNA
-Sample Protein
-Shuffle DNA
-Shuffle Protein
Miscellaneous
-IUPAC codes
-Genetic codes
-Browser compatibility
-Mirror this site
-Use this site off-line
-About this site
-Acknowledgments
-Reference
Sequence Manipulation Suite:
Search patterns
  • The search patterns used by the Sequence Manipulation Suite are not case sensitive. The following is a simple search pattern that will find all occurrences of the sequence fragment GGAT (and ggat):

    ggat

    The above will match GGAT but not GGAA.

  • Sequences containing residues that vary at a particular position can be matched using square brackets. The following pattern will find all occurrences of GGAT, GGAC, and GGAA:

    gga[tca]

    The above will match GGAT but not GGAG.

  • To represent a completely variable residue in a pattern, use the . character. The following pattern will find all occurrences of GCA followed by any single residue, followed by TTT:

    gca.ttt

    The above will match GCAATTT but not GCAAATTT.

  • To indicate that a residue can be repeated one or more times in a sequence, use the + character. The following pattern will find all occurrences of MVV followed by one or more R residues:

    MVVR+

    The above will match MVVRR but not MVVDR.

  • To indicate that a residue can be repeated zero or more times in a sequence, use the * character. The following pattern will find all occurrences of MD followed by zero or more K residues, followed by an L:

    MDK*L

    The above will match MDL but not MDVL.

  • To indicate that a residue can be repeated a specific number of times, use curly parentheses. The following pattern will find all occurrences of an M residue, followed by between one and four L residues, followed by a G residue:

    ML{1,4}G

    The above will match MLLG but not MLLLLLG.

  • The special characters, brackets, and curly parentheses in the above examples allow repeated residues to be found. You can find repeated sub-sequences using regular parentheses in combination with the +, *, and {} characters. The following pattern will find all occurrences of two to 5 TNT sequences in a row, followed by one or more KM repeats:

    (TNT){2,5}(KM)+

    The above will match TNTTNTTNTKM but not TNTTNKM.

  • To restrict matches to the beginning of a sequence, use the ^ character. For example, the following pattern will find GACCCT only if it is within three residues of the sequence start:

    ^.{0,3}GACCCT

    The above will find GACCCT in the sequence ATCGACCCT but not in the sequence AATCGACCCT.

  • To restrict matches to the end of a sequence, use the $ character. For example, the following pattern will find LVL only if it is located at the end of a sequence:

    LVL$

    The above will find LVL in the sequence KMHLVL but not in the sequence LVLD.

  • To find variable sequences, you can also use the | character to separate patterns for the different versions of the sequence segment you want to find. For example, to find all occurrences of MML, MAL, and MAD you could use the following:

    MML|MAL|MAD

    The above will match MML but not MMK.

  • Other examples:

    atg(...)+(tag|taa|tga)

    The above will match open reading frames that start with atg and end with tag, taa, or tga

    [VILMFWC]{10,}

    The above will match stretches of proteins containing ten or more hydrophobic residues.



new window | home | citation
Sun Oct 28 03:52:59 2012
Valid XHTML 1.0; Valid CSS