In the previous section we have learned to write simple regular expressions and to check if the patterns defined would have a match or not within a string. We did not learn how to retrieve the part of the string that actually matched the pattern, though.
Calling preg_match() with the $matches array as a third argument
In order to retrieve the actual match we need to call the preg_match() function with a third argument, typically called “$matches” (but you may name it as you wish).
The actual matches (if any) to the regular expression passed as first argument to preg_match() in the string passed as second argument, will be stored inside $matches (the third argument), which is an array with a particular structure. Let us start with a simple example in which we use var_dump() to explore the structure of the $matches array after a match occurred.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
<?php $sequence1 = "CCCGAATTCTTT"; $sequence2 = "CCCTAATTATTT"; $sequences = array($sequence1, $sequence2); $i = 1; // The following pattern $regexp may match any of the following sequences // GAATTC // GAATTA // TAATTC // TAATTA $regexp = "/[GT]AATT[CA]/"; foreach($sequences as $sequence){ if(preg_match($regexp, $sequence, $matches)){ echo "<p>\n<strong>Match to sequence $i</strong><br>\n"; var_dump($matches); echo "</p>\n"; } $i++; } ?> |
By running this script we get the following:
Match to sequence 1
array(1) {
[0]=>
string(6) “GAATTC”
}
Match to sequence 2
array(1) {
[0]=>
string(6) “TAATTA”
}
As we can see and as expected, a match to the regular expression was found in both sequences.
How is the $matches array in which the matches results are stored structured? Since the regular expression did not contain any portion within round parentheses (used to create “capture groups”, more on this in a moment), $matches contains only one element, namely the match found for the whole regular expression. You see that the var_dumps start with “array(1)”, which means that $matches is an array containing 1 element. For both sequences, this single element, that therefore has an index 0 (“[0]=>”), is a string 6 characters long (“string(6)”). The actual string differs between the two sequences, though. It is “GAATTC” for the first sequence and “TAATTA” for the second.
To rephrase this, if the regular expression (first argument of preg_match()) does not contain any capture groups defined by including a portion of it in between round parentheses, the $matches array, passed as a third argument to preg_match() during the call, will contain only one element that is constituted by the part of the string passed as second element to preg_match() that did match the pattern defined by the whole the regular expression.
There you have it, you can not only know if a regular expression finds a match inside a string, but also know what exactly this match was.
Let’s run the same example as above with slight modifications to provide a cleaner output and skip the var_dump:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
<?php $sequence1 = "CCCGAATTCTTT"; $sequence2 = "CCCTAATTATTT"; $sequences = array($sequence1, $sequence2); $i = 1; // The following pattern $regexp may match any of the following sequences // GAATTC // GAATTA // TAATTC // TAATTA $regexp = "/[GT]AATT[CA]/"; foreach($sequences as $sequence){ if(preg_match($regexp, $sequence, $matches)){ // We call preg_match() with 3 arguments instead of just 2 echo "<p>\n<strong>Match to sequence $i</strong><br>\n"; echo "Match within sequence <span style=\"font-family:courier;\">$sequence</span> is <span style=\"font-family:courier;\">".$matches[0]."</span>\n"; echo "</p>\n"; } $i++; } ?> |
We get:
Match to sequence 1
Match within sequence CCCGAATTCTTT is GAATTC
Match to sequence 2
Match within sequence CCCTAATTATTT is TAATTA
You should be aware that preg_match() will only find the first match to the regular expression within the string.
In the example above, if the target sequence had been (possible matches to the pattern in red):
CCCGAATTCTTTCCCTAATTATTTT
only the first match, GAATTC, would have been found by the preg_match() call.
Defining capture groups with round parentheses within a regular expression
Rather than recovering the match to the whole pattern defined in the regular expression, in many cases we may be interested in recovering the match to only a part of the pattern. We may use a pattern to identify a part of a string that contains the bit (or the bits) of information we are actually interested in.
To make this clear with an example, let’s consider an hypothetical header line of a FASTA sequence.
>gi|197107235|pdb|3CHW|P Chain P, Complex Of Dictyostelium Discoideum Actin
In red, the GI, a number we may want to extract from the header. Learn more about GIs here.
We could indeed write a regular expression along these lines to catch it:
“/\d+/”
Which means, as you should know by now, “any number repeated one or more times”.
This is a very broad pattern. This would match 5, 33, 453672 etc… It will indeed catch our GI number, and does so in this specific header, but this is just “by chance”.
Here’s the code we could use:
1 2 3 4 5 6 7 8 9 10 11 |
<?php $header = '>gi|197107235|pdb|3CHW|P Chain P, Complex Of Dictyostelium Discoideum Actin'; preg_match("/\d+/", $header, $matches); echo "<p>".$matches[0]."</p>"; ?> |
which gives the following output:
197107235
In this other hypothetical header:
>pdb|3CHW|P Chain P, Complex Of Dictyostelium Discoideum Actin
the very same pattern will catch “3”, the first number a preg_match() call would get to match. And it is not the GI number we wanted, which is actually not present at all within this header. Try this:
1 2 3 4 5 6 7 8 9 10 11 |
<?php $header = '>pdb|3CHW|P Chain P, Complex Of Dictyostelium Discoideum Acti'; preg_match("/\d+/", $header, $matches); echo "<p>".$matches[0]."</p>"; ?> |
As a general rule, the more specific the pattern is, and the more “context information” it contains, the higher the chances to really find what we are looking for.
In order to write a pattern very specific for a GI, that will actually catch a number in a FASTA header only if it is really a GI, let’s take advantage of syntactical context the GI number is normally embedded in in FASTA headers, and write a pattern that includes this context information.
Check the context of the GI number:
>gi|197107235|
Let’s write a regular expression that includes this context:
1 2 3 4 5 6 7 |
<?php $gi_regexp = "/gi\|\d+\|/"; // To match the pipes | literally we have to escape them with backslashes as the pipe is a metacharacter ?> |
However, if we run the usual example on it:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
<?php $header = '>gi|197107235|pdb|3CHW|P Chain P, Complex Of Dictyostelium Discoideum Actin'; $gi_regexp = "/gi\|\d+\|/"; preg_match($gi_regexp, $header, $matches); echo "<p>".$matches[0]."</p>"; ?> |
This is what we get:
gi|197107235|
Se how we now have the number we wanted, but it is included within the context we did define in the regular expression. Not surprising, the $matches array now contains as the unique element the match to the whole regular expression.
In order to obtain the number only, we can use a capture group within the regular expression. This is done by embedding the part of the expression that matches the number within round parentheses.
The regular expression therefore becomes:
1 2 3 4 5 6 7 8 |
<?php $gi_regexp = "/gi\|(\d+)\|/"; // To match the pipes | literally we have to escape them with backslashes as the pipe is a metacharacter // See that the number part of the pattern is now surrounded by parentheses ?> |
It is worth mentioning that sometimes you may want to group characters with parentheses without creating a capture group. In this case you should include ?: right after the first parenthesis. As an example, if you want to include in a pattern an optional part “ABC”, but you do not want to capture it in the $matches array, you may write:
(?:ABC)?
Don’t confuse the ? right after the parenthesis, with the one at the end, as they have a different meaning. The second one means that what precedes is optional.
Now here’s the trick: for every capture group we define within the regular expression, we can create as many as we like, a new element will be automatically added to the $matches array, containing the match to the capture group only. On using capture groups within the regular expression, the first element of the $matches array will still be the match to the whole expression, while the subsequent elements will be the matches to the capture groups, in the order they are use in the regular expression definition.
If we use just one capture group, $matches[0] will contain the match to the whole expression and $matches[1] will contain the match to the capture group.
If we use 2 capture groups (two sets of parentheses), the $matches array will contain 3 elements: $matches[0] will be the match to the whole expression, $matches[1] will be the match to the first capture group and $matches[2] will be the match to the second capture group. The concept extends to any number of capture groups we may want or need to use in your expression.
Here’s a code example to capture the GI number cleanly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
<?php $header = '>gi|197107235|pdb|3CHW|P Chain P, Complex Of Dictyostelium Discoideum Actin'; $gi_regexp = "/gi\|(\d+)\|/"; preg_match($gi_regexp, $header, $matches); echo "<p><strong>Match to the entire pattern:</strong><br>".$matches[0]."</p>"; echo "<p><strong>Just the GI number:</strong><br>".$matches[1]."</p>"; echo '<p><strong>The var_dump of the $matches array:</strong><br>'; echo var_dump($matches); echo "</p>"; ?> |
The output:
Match to the entire pattern:
gi|197107235|
Just the GI number:
197107235
The var_dump of the $matches array:
array(2) {
[0]=>
string(13) “gi|197107235|”
[1]=>
string(9) “197107235”
}
See how the $matches array now contains two elements instead of just one. The second element contains the match to the first (and in this example, only) capture group we used inside the regular expression. As in the previous examples, the first element of $matches still contains the match to the whole regular expression.
Let us consider the following FASTA sequence:
>gi|28373620|pdb|1MA9|A Chain A, Crystal Structure Of The Complex Of Human Vitamin D Binding Protein And Rabbit Muscle Actin
LERGRDYEKNKVCKEFSHLGKEDFTSLSLVLYSRKFPSGTFEQVSQLVKEVVSLTEACCAEGADPDCYDT
RTSALSAKSCESNSPFPVHPGTAECCTKEGLERKLCMAALKHQPQEFPTYVEPTNDEICEAFRKDPKEYA
NQFMWEYSTNYGQAPLSLLVSYTKSYLSMVGSCCTSASPTVCFLKERLQLKHLSLLTTLSNRVCSQYAAY
GEKKSRLSNLIKLAQKVPTADLEDVLPLAEDITNILSKCCESASEDCMAKELPEHTVKLCDNLSTKNSKF
EDCCQEKTAMDVFVCTYFMPAAQLPELPDVELPTNKDVCDPGNTKVMDKYTFELSRRTHLPEVFLSKVLE
PTLKSLGECCDVEDSTTCFNAKGPLLKKELSSFIDKGQELCADYSENTFTEYKKKLAERLKAKLPEATPT
ELAKLVNKRSDFASNCCSINSPPLYCDSEIDAELKNIL
You can find a FASTA file with this sequence here: gi-28373620.fasta
Let us write a regular expression that will allow us to catch both the GI ID and the PDB ID. This is the relevant part of the header we want to focus on, that provides all the context for the two IDs:
>gi|28373620|pdb|1MA9|
Note that while the GI ID is made by numbers only, PDB identifiers are made by capital letters or numbers, and they are always 4 characters long. Here is a good regular expression for the job:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<?php $gi_plus_pdb_regexp = "/^>gi\|(\d+)\|pdb\|([A-Z1-9]{4})\|/"; // ^>gi => the line starts with >gi // \| => then we have a pipe. It is escaped with a backslash as the pipe is a metacharcter // (\d+) => one or more numbers (the GI ID, capture group 1) // \| => a pipe // pdb => the three letters pdb // \| => a pipe // ([A-Z1-9]{4}) => capital letters or numbers, exactly 4 times (the PDB ID, capture group 2) // \| => a pipe ?> |
Now that we have a good regular expression we can write a little script to parse the FASTA sequence file and extract the GI ID and the PDB ID.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
<?php $gi_plus_pdb_regexp = "/^>gi\|(\d+)\|pdb\|([A-Z1-9]{4})\|/"; $fasta_sequence = file_get_contents("http://www.cellbiol.com/bioinformatics_web_development/uploads/gi-28373620.fasta"); $fasta_sequence_lines = explode("\n", $fasta_sequence); $gi_id = "not found"; // if the gi id is not found during the foreach cycle that follows, we still have a value for it $pdb_id = "not found"; $match_to_whole_regexp = "not found"; foreach($fasta_sequence_lines as $line){ if(preg_match($gi_plus_pdb_regexp, $line, $matches)){ $match_to_whole_regexp = $matches[0]; // The first element of the $matches array contains the match to the whole regexp $gi_id = $matches[1]; // The second element of the $matches array contains the first capture group, the GI ID $pdb_id = $matches[2]; // The third element of the $matches array contains the second capture group, the PDB ID break; } } // We can now provide an output echo "<p><strong>GI ID: </strong>$gi_id</p>"; echo "<p><strong>PDB ID: </strong>$gi_id</p>"; echo "<p><strong>Whole match: </strong>$match_to_whole_regexp</p>"; ?> |
Here is the output of this script:
GI ID: 28373620
PDB ID: 1MA9
Whole match: >gi|28373620|pdb|1MA9|
As a final example for this section, let’s unleash the power of scripting and regular expressions to extract all the GI IDs and PDB IDs from a FASTA file containing several sequences.
You can view or download the file from this link. If you look carefully at the headers of the sequences in this file you may notice that while all of them include a GI ID, only some of them include a PDB ID. For this reason, in order to be sure to catch all the GI IDs and all the PDB IDs we cannot use the compound regular expression that catches both that we have implemented in the previous example as this expression assumed that both IDs were present.
Let us write a new regular expression in which the PDB ID part is optional.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
<?php $gi_plus_pdb_regexp = "/^>gi\|(\d+)\|(?:pdb\|([A-Z1-9]{4})\|)?/"; // ^>gi => the line starts with >gi // \| => then we have a pipe. It is escaped with a backslash as the pipe is a metacharcter // (\d+) => one or more numbers (the GI ID, capture group 1) // \| => a pipe // (?: => group but not capture. Here we start an optional group of characters, the pdb part. After this group we have a ? that indicates that it is optional // pdb => the three letters pdb // \| => a pipe // ([A-Z1-9]{4}) => capital letters or numbers, exactly 4 times (the PDB ID, capture group 2) // \| => a pipe // ) => the end of the "group but not capture" pdb part // ? => indicates that what precedes (the pdb part) is optional ?> |
We can now write the code that parses the FASTA file with several sequences. The results will be presented in a table, check out this page on W3Schools for how to generate a table in html. The thead and tbody tags (optional in tables) are described here.
Also note that we store all the results in an array called $ids. The elements of this $ids array are themselves arrays of two elements, the first being the GI ID and the second the PDB ID of a sequence. So the structure of the $ids array can be exemplified as follows:
[(GI,PDB),(GI,PDB),(GI,PDB),(GI,PDB),(GI,PDB),(GI,PDB), etc…]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
<?php $gi_plus_pdb_regexp = "/^>gi\|(\d+)\|(?:pdb\|([A-Z1-9]{4})\|)?/"; // In this formulation the PDB part is optional $fasta_sequences = file_get_contents("http://www.cellbiol.com/bioinformatics_web_development/uploads/actin-human-partial.fasta"); $fasta_sequence_lines = explode("\n", $fasta_sequences); $ids = array(); $seqs_num = 0; $gis_num = 0; $pdb_num = 0; foreach($fasta_sequence_lines as $line){ $gi_id = "not found"; // if the gi id is not found during the foreach cycle that follows, we still have a value for it $pdb_id = "not found"; if(preg_match($gi_plus_pdb_regexp, $line, $matches)){ // We basically do stuff only when we get to a line for which a match is found $gi_id = $matches[1]; $gis_num++; if(array_key_exists(2, $matches)){ // if a PDB ID was found during the match $pdb_id = $matches[2]; $pdb_num++; } $ids[] = array($gi_id, $pdb_id); // For each sequence we add a sub-array with two elements, GI ID and PDB ID, to the $ids results array $seqs_num++; } } // We can now provide an output // We generate a proper HTML page this time, with a full header and a couple of CSS styles defined for the results table echo "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n<meta charset=\"utf-8\">\n<title>Test page</title>\n<style>td{padding:5px;}thead{font-weight:bold;}</style>\n</head>\n<body>"; echo "<p>$seqs_num sequences found</p>\n"; echo "<p>$gis_num GI IDs found</p>\n"; echo "<p>$pdb_num PDB IDs found</p>\n"; echo "<table><thead><tr><th>GI ID</th><th>PDB ID</th></tr><tbody>\n"; // Starting to output the HTML for the table foreach($ids as $id_couple){ // for each couple of GI/PDB IDs we create a new table raw echo "<tr><td>".$id_couple[0]."</td><td>".$id_couple[1]."</td></tr>\n"; } echo "</tbody>\n</table>\n</body>\n</html>"; // Closing the various tags opened before ?> |
You can do a copy-paste and run the script on your own server, or run a demo here.
If you are able to understand the last script example you are a long way in your PHP and regular expressions learning journey, congratulations!
In the next section we will learn how to extract all the matches to a regular expression in a string with preg_match_all(), rather that just the first match, as preg_match() does. Keep reading.
Chapter Sections
[pagelist include=”435″]
[siblings]