PHP Data structure for scoring matrices
We wish to import the scoring matrix data from the .csv files in a data structure in PHP that would be convenient to use for calculating the peptide binding scores. There is indeed one structure that is perfect for the job: an associative array or dictionary, in which we have a key for each amino-acid letter that has, as associated value, an array with the scoring values as elements.
To exemplify the concept, let’s consider the lines for Alanine and Leucine in the .csv file for MHC allele HLA-A1 that we generated in the previous section:
A;1.69;1.50;-0.15;-1.5;0.82;1.36;1.00;1.14;-1.20
L;0.51;1.62;1.24;-0.29;0.19;0.44;0.38;0.22;1.31
and manually build a dictionary called $matrix with just two keys, “A” and “L”, for now.
1 2 3 4 5 6 7 8 |
$matrix = array( "A" => array(1.69,1.50,-0.15,-1.5,0.82,1.36,1.00,1.14,-1.20), "L" => array(-1.00,0.00,-1.00,-1.00,0.00,-1.38,0.83,0.12,-1.67) ); |
We could use this matrix to score a 9 mer peptide solely composed of Alanine and Leucine residues, with sequence: ALLAAAALA
The scoring array for an amino-acid can be accessed very easily in the matrix by using the amino-acid itself as key. For an Alanine (“A”), this would be:
$matrix[“A”]
Then, to access the score for the A in a specific position of the peptide, we just use the position itself as an array index, where position 1 in the peptide corresponds to index 0 (the first element) in the array. So the score for an alanine in position 1 would be:
$matrix[“A”][0]
Check out the following scoring example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
<?php $matrix = array( "A" => array(1.69,1.50,-0.15,-1.5,0.82,1.36,1.00,1.14,-1.20), "L" => array(-1.00,0.00,-1.00,-1.00,0.00,-1.38,0.83,0.12,-1.67) ); $peptide = "ALLAAAALA"; $score = 0; $aminoacids = str_split($peptide, 1); $i = 0; foreach($aminoacids as $aminoacid){ $score = $score + $matrix[$aminoacid][$i]; $i++; } echo "<p>$peptide score is $score</p>"; // ALLAAAALA score is 1.29 ?> |
Based on this code, we can easily write a function that takes a peptide and a matrix as arguments and returns the peptide score. In the following code example we write the function and then use it to score a peptide. In the function, we include a check to make sure that each analysed amino-acid exists as a key in the matrix and return an error if it does not. This could be useful for code debugging and will prevent PHP errors if we supply an invalid sequence.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
<?php function pepscore($peptide, $matrix){ $score = 0; $aminoacids = str_split($peptide, 1); $i = 0; foreach($aminoacids as $aminoacid){ if(array_key_exists($aminoacid, $matrix)){ $score = $score + $matrix[$aminoacid][$i]; $i++; } else{ return "aa not found"; } } return $score; } $matrix = array( "A" => array(1.69,1.50,-0.15,-1.5,0.82,1.36,1.00,1.14,-1.20), "L" => array(-1.00,0.00,-1.00,-1.00,0.00,-1.38,0.83,0.12,-1.67) ); $peptide = "ALLAAAALA"; $score = pepscore($peptide, $matrix); echo "<p>$peptide score is $score</p>"; // ALLAAAALA score is 1.29 ?> |
Once we have a score, we can use the thresholds data available in the matrix to rank the peptide among the x% best binders.
In the previous section we did store the threshold information in the csv file in this format:
tre;6.68;5.18;4.21;3.47;2.85;2.31;1.83;1.39;0.99;0.61;
We could store this data in our matrix array in the same format as the scoring data for amino-acids:
1 2 3 4 5 6 7 |
$matrix = array( "A" => array(1.69,1.50,-0.15,-1.5,0.82,1.36,1.00,1.14,-1.20), "L" => array(-1.00,0.00,-1.00,-1.00,0.00,-1.38,0.83,0.12,-1.67), "tre" => array(6.68,5.18,4.21,3.47,2.85,2.31,1.83,1.39,0.99,0.61) ); |
Let’s calculate the score and the rank this time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
<?php $matrix = array( "A" => array(1.69,1.50,-0.15,-1.5,0.82,1.36,1.00,1.14,-1.20), "L" => array(-1.00,0.00,-1.00,-1.00,0.00,-1.38,0.83,0.12,-1.67), "tre" => array(6.68,5.18,4.21,3.47,2.85,2.31,1.83,1.39,0.99,0.61) ); $peptide = "ALLAAAALA"; $score = 0; $aminoacids = str_split($peptide, 1); $i = 0; foreach($aminoacids as $aminoacid){ $score = $score + $matrix[$aminoacid][$i]; $i++; } $rank = "11"; // We start by assuming the rank is higher than 10. We basically bulk assign, arbitrarily, a rank 11 to all the peptides with rank higher than 10. // if the score is higher than one of the thresholds, we will upgrade that. // Mind that the lower the rank number, the best the peptide is as a binder. A peptide with rank 1 would be among the 1% of best binders. // This is a better rank than being "just" among the 10% best binders, so make no mistakes here. $i = 1; foreach($matrix["tre"] as $threshold){ if($score >= $threshold){ $rank = $i; break; } else{ $i++; } } echo "<p>$peptide score is $score, which ranks it among the $rank% best binders</p>"; // ALLAAAALA score is 1.29, which ranks it among the 9% best binders ?> |
Let’s now modify the pepscore() function we wrote before to return the score, the rank, or both, depending on a third optional argument, so that we can enjoy some flexibility while using it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
<?php function pepscore($peptide, $matrix, $mode="all"){ $score = 0; $aminoacids = str_split($peptide, 1); $i = 0; foreach($aminoacids as $aminoacid){ if(array_key_exists($aminoacid, $matrix)){ $score = $score + $matrix[$aminoacid][$i]; $i++; } else{ return "aa not found"; } } $rank = "11"; // We start by assuming the rank is higher than 10. We basically bulk assign, arbitrarily, a rank 11 to all the peptides with rank higher than 10. // if the score is higher than one of the thresholds, we will upgrade that. // Mind that the lower the rank number, the best the peptide is as a binder. A peptide with rank 1 would be among the 1% of best binders. // This is a better rank than being "just" among the 10% best binders, so make no mistakes here. // The threshods, in the array, are in decreasing order. The first threshold is the highest and scores higher that that rank the peptide among the best 1% binders // if the score fails to be higher than the first threshold, we check for the second threshold, and so on. // When we find a threshold that is lower than the calculated score, we stop iterating through the thresholds by breaking the foreach cycle with a break statement // if we do not find such a threshold, the rank will remain 11 $i = 1; // $i is the rank value here foreach($matrix["tre"] as $threshold){ if($score >= $threshold){ $rank = $i; break; } else{ $i++; } } if($mode == "all"){ return array($score, $rank); } elseif($mode == "score"){ return $score; } elseif($mode == "rank"){ return $rank; } } $matrix = array( "A" => array(1.69,1.50,-0.15,-1.5,0.82,1.36,1.00,1.14,-1.20), "L" => array(-1.00,0.00,-1.00,-1.00,0.00,-1.38,0.83,0.12,-1.67), "tre" => array(6.68,5.18,4.21,3.47,2.85,2.31,1.83,1.39,0.99,0.61) ); $peptide = "ALLAAAALA"; $pepscore = pepscore($peptide, $matrix); $score = $pepscore[0]; $rank = $pepscore[1]; echo "<p>$peptide score is $score, which ranks it among the $rank% best binders</p>"; // ALLAAAALA score is 1.29, which ranks it among the 9% best binders ?> |
We now have a function, pepscore(), to score and rank a peptide based on a matrix. A quite valuable and central piece of code for our T-Score application.
Let’s now see how we can generate a proper matrix associative array starting from one of the .csv files we produced in the previous section.
Generating scoring matrices from the .csv files
We will now write a little function matrix_file_to_array() to convert our .csv files into associative arrays with the format discussed above.
The flow of the code is as follows:
– We import the file to a variable
– Split on “\n” to get the lines
– For each line, we split the line to an array by using the semicolon as splitting element. The first element of this array will become a key in our final matrix associative array, while all the other elements will be part of an array assigned as value to this key. Remember how each line is structured:
A;1.69;1.50;-0.15;-1.5;0.82;1.36;1.00;1.14;-1.20
– The function then returns this matrix array
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
<?php function matrix_file_to_array($matrix_file_path){ $matrix_file = file_get_contents($matrix_file_path); // Importing the .csv file to a variable $lines = explode("\n", $matrix_file); // Splitting the file into lines $matrix_array = array(); // Our final matrix associative array, empty for now foreach($lines as $line){ $line = trim($line); if($line != ""){ // If the line is not an empty line $elements = explode(";", $line); // splitting the line on the semicolons $key = array_shift($elements); // array_shift() removes the first element from an array and returns it. The first element is our key for the matrix associative array. $matrix_array[$key] = $elements; // } } return $matrix_array; } // Let's test this function on the .csv file with the data for the HLA-A1 allele $a1_matrix = matrix_file_to_array("matrix/a1-matrix.csv"); var_dump($a1_matrix); ?> |
On executing the code above from our script.php file located in the tscore directory, we get the var_dump of the matrix for the HLA-A1 allele. It will show that the $a1_matrix variable is now an array of 22 elements, as expected, as we have 20 amino-acids, plus the X amino-acid, plus the “tre” line with the thresholds.
Here is how the var_dump starts (it’s quite long so we just report the beginning of the output, but please test it on your own server):
array(22) {
[“A”]=>
array(9) {
[0]=>
string(4) “1.69”
[1]=>
string(4) “1.50”
[2]=>
string(5) “-0.15”
[3]=>
string(5) “-1.50”
[4]=>
string(4) “0.82”
[5]=>
string(4) “1.36”
[6]=>
string(4) “1.00”
[7]=>
string(4) “1.14”
[8]=>
string(5) “-1.20”
}
[“C”]=>
array(9) {
[0]=>
string(4) “2.00”
[1]=>
string(4) “0.00”
[2]=>
string(4) “0.00”
[3]=>
string(4) “1.00”
[4]=>
string(4) “0.00”
[5]=>
string(5) “-2.00”
[6]=>
string(4) “0.00”
[7]=>
string(5) “-2.00”
[8]=>
string(4) “0.00”
}
[“D”]=>
array(9) {
[0]=>
string(5) “-1.50”
[1]=>
string(5) “-2.00”
[2]=>
string(4) “6.52”
[3]=>
string(5) “-0.67”
[4]=>
string(5) “-2.00”
[5]=>
string(4) “0.67”
[6]=>
string(5) “-1.33”
[7]=>
string(5) “-1.00”
[8]=>
string(5) “-2.00”
}
etc…
Getting real: scoring and ranking a peptide sequence by using a full scoring matrix
In the following code example we will select an allele name, generate the scoring matrix array “on the fly” and use it to score and rank an actual peptide sequence.
To get a fully woking self-contained example, we move the pepscore() and matrix_file_to_array() functions to a “functions.php” file created in a folder called “include”, inside the tscore directory.
We also create a new php file called “generate_matrices.php” where we will move all the code related to the download of the .html files from the original website and the generation of the .csv files in our matrix folder, from these .html files. The reason for moving some of the code in a different file is to keep our main file, script.php, as clean and light as possible so that it will be more readable and easier to maintain. As functions.php, we will store this file in the include folder.
Our application file structure therefore becomes, for now:
tscore (permission 777)
matrix (permission 777)
script.php
include
functions.php
generate_matrices.php
And here is the content of the files:
generate_matrices.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
<?php $urls = array("http://www.imtech.res.in/raghava/nhlapred/matrices/a1.html", "http://www.imtech.res.in/raghava/nhlapred/matrices/a2.html", "http://www.imtech.res.in/raghava/nhlapred/matrices/a3.html"); foreach($urls as $url){ preg_match("/a\d\.html/", $url, $matches); // We are looking for the final "ax.html" part of the URL $page_name = $matches[0]; // and store it into a variable called $page_name $local_page_path = "matrix/$page_name"; // We will store the files with their original names in the matrix folder, here we define the relative path with respect to script.php if(!file_exists($local_page_path)){ // We check that the file is not there already, and if not file_put_contents($local_page_path, file_get_contents($url)) or die("something went wrong"); } } $file_names = scandir("matrix"); // An array with all the names of the files in the matrix directory $score_regexp = "/<TD><CENTER>(.{1,5})<\/CENTER><\/TD>/"; $threshold_regexp = "/<td><center>(.{4})<\/center><\/td>/"; // Unlike those for the score values, html tags for the threshold values are in lowercase in the original pages we are scraping the data from foreach($file_names as $file_name){ // Iterating through the files, one by one if(preg_match("/^(.+)\.html$/", $file_name, $matches) and !file_exists("matrix/".$matches[1]."-matrix.csv")){ $html_file_path = "matrix/$file_name"; $csv_file_path = "matrix/".$matches[1]."-matrix.csv"; $html_file_contents = file_get_contents($html_file_path); $html_file_lines = explode("\n", $html_file_contents); $csv_handle = fopen($csv_file_path, "w"); $table_flag = 0; // Each time we come across an opening table tag during file parsing, this will be raised by one. $threshold_csv_line_temp = "\ntre;"; $j = 1; // Counter for the thresholds lines // File parsing to scrape the data may begin! foreach($html_file_lines as $line){ $line = trim($line); // You never know which weird characters may be lurking at the beginning or end of your lines... if(preg_match("/<table/", $line)){ $table_flag++; } if($table_flag == 5){ // We are parsing one of the lines belonging the scores table if(preg_match_all($score_regexp, $line, $matches) and !preg_match("/P1/", $line)){ $i = 1; foreach($matches[1] as $match){ if($i < 10){ fwrite($csv_handle, $match.";"); // We want to add a semicolon after each of the element but the last one and we have a total of 10 elements, one aminoacid letter plus 9 scores $i++; } elseif($i == 10){ fwrite($csv_handle, $match."\n"); // After the last score we add a newline instead of a semicolon } } } } elseif($table_flag == 6){ // We are parsing one of the lines belonging the thresholds table if(preg_match($threshold_regexp, $line, $matches)){ // We use preg_match instead of preg_match_all as in the original files HTML source each threshold is in it's own line so we just need to grab one value here // we also manage things a little differently, adding the values to a variable instead of writing them to the file directly, for the same reason. if($j < 10){ $threshold_csv_line_temp = $threshold_csv_line_temp.$matches[1].";"; $j++; } elseif($j == 10){ // The last threshold value $threshold_csv_line_temp = $threshold_csv_line_temp.$matches[1]; } } } } fwrite($csv_handle, $threshold_csv_line_temp); fclose($csv_handle); } } ?> |
functions.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
<?php function pepscore($peptide, $matrix, $mode="all"){ $score = 0; $aminoacids = str_split($peptide, 1); $i = 0; foreach($aminoacids as $aminoacid){ if(array_key_exists($aminoacid, $matrix)){ $score = $score + $matrix[$aminoacid][$i]; $i++; } else{ return "aa not found"; } } $rank = "11"; // We start by assuming the rank is higher than 10. We basically bulk assign, arbitrarily, a rank 11 to all the peptides with rank higher than 10. // if the score is higher than one of the thresholds, we will upgrade that. // Mind that the lower the rank number, the best the peptide is as a binder. A peptide with rank 1 would be among the 1% of best binders. // This is a better rank than being "just" among the 10% best binders, so make no mistakes here. // The threshods, in the array, are in decreasing order. The first threshold is the highest and scores higher that that rank the peptide among the best 1% binders // if the score fails to be higher than the first threshold, we check for the second threshold, and so on. // When we find a threshold that is lower than the calculated score, we stop iterating through the thresholds by breaking the foreach cycle with a break statement // if we do not find such a threshold, the rank will remain 11 $i = 1; // $i is the rank value here foreach($matrix["tre"] as $threshold){ if($score >= $threshold){ $rank = $i; break; } else{ $i++; } } if($mode == "all"){ return array($score, $rank); } elseif($mode == "score"){ return $score; } elseif($mode == "rank"){ return $rank; } } function matrix_file_to_array($matrix_file_path){ $matrix_file = file_get_contents($matrix_file_path); // Importing the .csv file to a variable $lines = explode("\n", $matrix_file); // Splitting the file into lines $matrix_array = array(); // Our final matrix associative array, empty for now foreach($lines as $line){ $line = trim($line); if($line != ""){ // If the line is not an empty line $elements = explode(";", $line); // splitting the line on the semicolons $key = array_shift($elements); // array_shift() removes the first element from an array and returns it. The first element is our key for the matrix associative array. $matrix_array[$key] = $elements; // } } return $matrix_array; } ?> |
script.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
<?php include("include/functions.php"); include("include/generate_matrices.php"); $mhc_allele = "a1"; $mhc_allele_txt = "HLA-A1"; $peptide = "KLDVHRTAC"; $matrix = matrix_file_to_array("matrix/$mhc_allele-matrix.csv"); // We dynamically generate the path of the .csv file from the mhc allele name, will be useful later on when we let the user decide which allele he wants to use for the scoring in the web form $pepscore = pepscore($peptide, $matrix); $score = $pepscore[0]; $rank = $pepscore[1]; echo "<p>Peptide $peptide has a score of $score with the $mhc_allele_txt MHC allele, which ranks it among the $rank% best binders</p>"; ?> |
See how, as usual, with all the pieces (accessory code and functions) in place, the script itself becomes short and easy to write.
Running this code will output the following:
Peptide KLDVHRTAC has a score of 3.86 with the HLA-A1 MHC allele, which ranks it among the 4% best binders
As you see, we are getting closer to be able to write a full fledged code for our T-Score web application.
The slide_window() function
There is one important part still missing though: in the web form, the user will provide a full protein sequence rather than a 9 mer peptide sequence. We need a function to extract all possible 9 mers from a protein sequence, by applying a so-called “sliding window”. To make a simple example, let’s suppose the original protein sequence provided by the user has a length of 15 amino-acids.
KTKLDVHRTACEWVM
The 9 mers it contains are the following:
1-9: KTKLDVHRT
2-10: TKLDVHRTA
3-11: KLDVHRTAC
4-12: LDVHRTACE
5-13: DVHRTACEW
6-14: VHRTACEWV
7-15: HRTACEWVM
Those are obtained by applying a window with a fixed length of 9 to the sequence, and sliding it each time by one position on the sequence, hence the term “sliding window”.
Scoring a full protein means generating all possible peptides with the sliding window and then scoring each individual peptide. The peptides that rank higher than 10 are considered “negative”. Remember that the lower the rank, the better binder a peptide is. A rank of 1 means that the peptide is among the top 1% binders, which is better than being “just” among the top 10% binders.
Let us write a slide_window() function and test it on a short sequence. The function will take a protein sequence as argument and return an array with the individual peptide sequences of the width provided as second argument (optional, as default width= 9) as elements.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
<?php function slide_window($sequence, $width=9){ $peps = array(); // The peptides array, to be returned. Empty for now // Let's get rid of the cases in which the sequence is shorter than the sliding window width, or equal to it // Mind that when a function returns, the rest of the code it's not executed if(strlen($sequence) < $width){ return "input sequence too short"; } elseif(strlen($sequence) == $width){ return array($sequence); } $start = 0; // We start from the beginning of the sequence. Isn't that a great idea // The substr() function extract a substring from a string (first argument) starting from the given start position (second argument), of a given length (third argument) while(strlen(substr($sequence, $start, $width)) == $width){ // if the substring becomes shorter than the width of the sliding window (will happen toward the end of the input sequence), the cycle will stop $peps[] = substr($sequence, $start, $width); $start++; } return $peps; } // Let's test the function $sequence = "KTKLDVHRTACEWVM"; $peps = slide_window($sequence); var_dump($peps); ?> |
This is the output of the script above, everything looks nice:
array(7) {
[0]=>
string(9) “KTKLDVHRT”
[1]=>
string(9) “TKLDVHRTA”
[2]=>
string(9) “KLDVHRTAC”
[3]=>
string(9) “LDVHRTACE”
[4]=>
string(9) “DVHRTACEW”
[5]=>
string(9) “VHRTACEWV”
[6]=>
string(9) “HRTACEWVM”
}
Scoring and ranking all the peptides in a protein for binding to an MHC allele
By maintaining the very same structure as the script we used above to score and rank a single peptide, let’s just add the slide_window() and the process_fasta() functions to the functions.php file in the include folder and revisit the script.php file in order to score and rank all the peptides from an entire target protein, the capsid protein of human rhinovirus 14 (HRV-14) – http://www.uniprot.org/uniprot/P03303.
>sp|P03303|POLG_HRV14 Genome polyprotein OS=Human rhinovirus 14 PE=1 SV=3
MGAQVSTQKSGSHENQNILTNGSNQTFTVINYYKDAASTSSAGQSLSMDPSKFTEPVKDL
MLKGAPALNSPNVEACGYSDRVQQITLGNSTITTQEAANAVVCYAEWPEYLPDVDASDVN
KTSKPDTSVCRFYTLDSKTWTTGSKGWCWKLPDALKDMGVFGQNMFFHSLGRSGYTVHVQ
CNATKFHSGCLLVVVIPEHQLASHEGGNVSVKYTFTHPGERGIDLSSANEVGGPVKDVIY
NMNGTLLGNLLIFPHQFINLRTNNTATIVIPYINSVPIDSMTRHNNVSLMVIPIAPLTVP
TGATPSLPITVTIAPMCTEFSGIRSKSIVPQGLPTTTLPGSGQFLTTDDRQSPSALPNYE
PTPRIHIPGKVHNLLEIIQVDTLIPMNNTHTKDEVNSYLIPLNANRQNEQVFGTNLFIGD
GVFKTTLLGEIVQYYTHWSGSLRFSLMYTGPALSSAKLILAYTPPGARGPQDRREAMLGT
HVVWDIGLQSTIVMTIPWTSGVQFRYTDPDTYTSAGFLSCWYQTSLILPPETTGQVYLLS
FISACPDFKLRLMKDTQTISQTVALTEGLGDELEEVIVEKTKQTVASISSGPKHTQKVPI
LTANETGATMPVLPSDSIETRTTYMHFNGSETDVECFLGRAACVHVTEIQNKDATGIDNH
REAKLFNDWKINLSSLVQLRKKLELFTYVRFDSEYTILATASQPDSANYSSNLVVQAMYV
PPGAPNPKEWDDYTWQSASNPSVFFKVGDTSRFSVPYVGLASAYNCFYDGYSHDDAETQY
GITVLNHMGSMAFRIVNEHDEHKTLVKIRVYHRAKHVEAWIPRAPRALPYTSIGRTNYPK
NTEPVIKKRKGDIKSYGLGPRYGGIYTSNVKIMNYHLMTPEDHHNLIAPYPNRDLAIVST
GGHGAETIPHCNCTSGVYYSTYYRKYYPIICEKPTNIWIEGNPYYPSRFQAGVMKGVGPA
EPGDCGGILRCIHGPIGLLTAGGSGYVCFADIRQLECIAEEQGLSDYITGLGRAFGVGFT
DQISTKVTELQEVAKDFLTTKVLSKVVKMVSALVIICRNHDDLVTVTATLALLGCDGSPW
RFLKMYISKHFQVPYIERQANDGWFRKFNDACNAAKGLEWIANKISKLIEWIKNKVLPQA
KEKLEFCSKLKQLDILERQITTMHISNPTQEKREQLFNNVLWLEQMSQKFAPLYAVESKR
IRELKNKMVNYMQFKSKQRIEPVCVLIHGTPGSGKSLTTSIVGRAIAEHFNSAVYSLPPD
PKHFDGYQQQEVVIMDDLNQNPDGQDISMFCQMVSSVDFLPPMASLDNKGMLFTSNFVLA
STNSNTLSPPTILNPEALVRRFGFDLDICLHTTYTKNGKLNAGMSTKTCKDCHQPSNFKK
CCPLVCGKAISLVDRTTNIRYSVDQLVTAIISDFKSKMQITDSLETLFQGPVYKDLEIDV
CNTPPPECINDLLKSVDSEEIREYCKKKKWIIPEIPTNIERAMNQASMIINTILMFVSTL
GIVYVIYKLFAQTQGPYSGNPPHNKLKAPTLRPVVVQGPNTEFALSLLRKNIMTITTSKG
EFTGLGIHDRVCVIPTHAQPGDDVLVNGQKIRVKDKYKLVDPENINLELTVLTLDRNEKF
RDIRGFISEDLEGVDATLVVHSNNFTNTILEVGPVTMAGLINLSSTPTNRMIRYDYATKT
GQCGGVLCATGKIFGIHVGGNGRQGFSAQLKKQYFVEKQGQVIARHKVREFNINPVNTPT
KSKLHPSVFYDVFPGDKEPAVLSDNDPRLEVKLTESLFSKYKGNVNTEPTENMLVAVDHY
AGQLLSLDIPTSELTLKEALYGVDGLEPIDITTSAGFPYVSLGIKKRDILNKETQDTEKM
KFYLDKYGIDLPLVTYIKDELRSVDKVRLGKSRLIEASSLNDSVNMRMKLGNLYKAFHQN
PGVLTGSAVGCDPDVFWSVIPCLMDGHLMAFDYSNFDASLSPVWFVCLEKVLTKLGFAGS
SLIQSICNTHHIFRDEIYVVEGGMPSGCSGTSIFNSMINNIIIRTLILDAYKGIDLDKLK
ILAYGDDLIVSYPYELDPQVLATLGKNYGLTITPPDKSETFTKMTWENLTFLKRYFKPDQ
QFPFLVHPVMPMKDIHESIRWTKDPKNTQDHVRSLCMLAWHSGEKEYNEFIQKIRTTDIG
KCLILPEYSVLRRRWLDLF
script.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
<?php include("include/functions.php"); include("include/generate_matrices.php"); $fasta_sequence = file_get_contents("http://www.uniprot.org/uniprot/P03303.fasta"); $seq_arr = process_fasta($fasta_sequence); $seq_header = $seq_arr[0]; $sequence = $seq_arr[1]; $rank_threshold = 10; $peptides = slide_window($sequence); $mhc_allele = "a1"; $mhc_allele_txt = "HLA-A1"; $matrix = matrix_file_to_array("matrix/$mhc_allele-matrix.csv"); $results = array(); // This is where we store all the data for peptides that rank at least 10 or below, "positive" peptides // Remember that the lower the rank is, the best binder a peptide is predicted to be. The other peptides, the "negatives", will be discarded and not shown in the output $pos_counter = 1; // To keep track of the start position of each peptide in the sequence foreach($peptides as $peptide){ $pepscore = pepscore($peptide, $matrix); $score = $pepscore[0]; $rank = $pepscore[1]; if($rank <= $rank_threshold){ $results[] = array($pos_counter, $peptide, $score, $rank); // In the results array each peptide is represented by an array storing the peptide sequence, start position, score and rank } $pos_counter++; // The position counter is incremented by one each time we proceed to the next peptide } $total_peptides_num = $pos_counter; $positive_peptides_num = count($results); echo "<p>Analysed protein: $seq_header</p>\n"; echo "<p>Out of $total_peptides_num total peptides, $positive_peptides_num have a rank better then $rank_threshold for binding to MHC allele $mhc_allele_txt</p>\n"; echo "<style>td{padding:5px;}\nthead{font-weight:bold}</style>\n"; echo "<table>\n<thead>\n<tr><td>Position</td><td>Sequence</td><td>Score</td><td>Rank</td></tr>\n</thead>\n<tbody>\n"; foreach($results as $result){ $start = $result[0]; $end = $start + 8; $pepseq = $result[1]; $pepscore = $result[2]; $peprank = $result[3]; echo "<tr><td>$start-$end</td><td>$pepseq</td><td>$pepscore</td><td>$peprank</td></tr>\n"; } echo "</tbody>\n</table>"; ?> |
On running this script we get the following output:
Analysed protein: >sp|P03303|POLG_HRV14 Genome polyprotein OS=Human rhinovirus 14 PE=1 SV=3
Out of 2172 total peptides, 205 rank better then 10 for binding to MHC allele HLA-A1
Position | Sequence | Score | Rank |
12-20 | SHENQNILT | 1.3 | 9 |
24-32 | NQTFTVINY | 2.5 | 6 |
25-33 | QTFTVINYY | 6.08 | 2 |
33-41 | YKDAASTSS | 3.72 | 4 |
40-48 | SSAGQSLSM | 0.94 | 10 |
47-55 | SMDPSKFTE | 1.34 | 9 |
53-61 | FTEPVKDLM | 4.47 | 3 |
70-78 | SPNVEACGY | 4.91 | 3 |
72-80 | NVEACGYSD | 1.43 | 8 |
78-86 | YSDRVQQIT | 1.09 | 9 |
93-101 | TTQEAANAV | 1.55 | 8 |
94-102 | TQEAANAVV | 1.45 | 8 |
96-104 | EAANAVVCY | 4.26 | 3 |
97-105 | AANAVVCYA | 2.49 | 6 |
104-112 | YAEWPEYLP | 1 | 9 |
111-119 | LPDVDASDV | 0.62 | 10 |
116-124 | ASDVNKTSK | 3.55 | 4 |
134-142 | TLDSKTWTT | 1.3 | 9 |
155-163 | LKDMGVFGQ | 1.99 | 7 |
167-175 | FHSLGRSGY | 1.56 | 8 |
196-204 | IPEHQLASH | 3.62 | 4 |
202-210 | ASHEGGNVS | 1.08 | 9 |
203-211 | SHEGGNVSV | 3.63 | 4 |
205-213 | EGGNVSVKY | 1.62 | 8 |
209-217 | VSVKYTFTH | 0.74 | 10 |
222-230 | GIDLSSANE | 3.46 | 5 |
228-236 | ANEVGGPVK | 4.08 | 4 |
232-240 | GGPVKDVIY | 0.79 | 10 |
235-243 | VKDVIYNMN | 5 | 3 |
264-272 | NTATIVIPY | 2.68 | 6 |
268-276 | IVIPYINSV | 2.19 | 7 |
290-298 | MVIPIAPLT | 1.08 | 9 |
297-305 | LTVPTGATP | 1.1 | 9 |
317-325 | CTEFSGIRS | 4.83 | 3 |
358-366 | NYEPTPRIH | 2.38 | 6 |
390-398 | HTKDEVNSY | 4.74 | 3 |
391-399 | TKDEVNSYL | 3.73 | 4 |
398-406 | YLIPLNANR | 1.16 | 9 |
418-426 | IGDGVFKTT | 1.22 | 9 |
426-434 | TLLGEIVQY | 1.23 | 9 |
427-435 | LLGEIVQYY | 3.58 | 4 |
440-448 | GSLRFSLMY | 4.34 | 3 |
447-455 | MYTGPALSS | 1.15 | 9 |
448-456 | YTGPALSSA | 0.79 | 10 |
454-462 | SSAKLILAY | 11.18 | 1 |
498-506 | WTSGVQFRY | 3.27 | 5 |
506-514 | YTDPDTYTS | 4.52 | 3 |
508-516 | DPDTYTSAG | 1.09 | 9 |
512-520 | YTSAGFLSC | 3.06 | 5 |
514-522 | SAGFLSCWY | 3.9 | 4 |
529-537 | PPETTGQVY | 6.74 | 1 |
553-561 | MKDTQTISQ | 4.77 | 3 |
565-573 | LTEGLGDEL | 2.32 | 6 |
572-580 | ELEEVIVEK | 2.52 | 6 |
573-581 | LEEVIVEKT | 2.64 | 6 |
577-585 | IVEKTKQTV | 2.11 | 7 |
603-611 | ANETGATMP | 3.44 | 5 |
616-624 | DSIETRTTY | 5.28 | 2 |
617-625 | SIETRTTYM | 1.57 | 8 |
631-639 | ETDVECFLG | 3.91 | 4 |
635-643 | ECFLGRAAC | 2.82 | 6 |
646-654 | VTEIQNKDA | 2.66 | 6 |
651-659 | NKDATGIDN | 0.65 | 10 |
692-700 | DSEYTILAT | 4.21 | 3 |
695-703 | YTILATASQ | 1.53 | 8 |
701-709 | ASQPDSANY | 9.23 | 1 |
711-719 | SNLVVQAMY | 5.89 | 2 |
729-737 | EWDDYTWQS | 5.8 | 2 |
730-738 | WDDYTWQSA | 1.47 | 8 |
747-755 | VGDTSRFSV | 1.5 | 8 |
756-764 | PYVGLASAY | 3.89 | 4 |
760-768 | LASAYNCFY | 5.32 | 2 |
763-771 | AYNCFYDGY | 1.36 | 9 |
767-775 | FYDGYSHDD | 3.7 | 4 |
772-780 | SHDDAETQY | 8.39 | 1 |
776-784 | AETQYGITV | 1.21 | 9 |
803-811 | KTLVKIRVY | 4.09 | 4 |
822-830 | PRAPRALPY | 3.25 | 5 |
830-838 | YTSIGRTNY | 7.29 | 1 |
841-849 | NTEPVIKKR | 4.29 | 3 |
854-862 | KSYGLGPRY | 0.96 | 10 |
858-866 | LGPRYGGIY | 0.72 | 10 |
867-875 | TSNVKIMNY | 6.16 | 2 |
880-888 | PEDHHNLIA | 1.61 | 8 |
904-912 | GAETIPHCN | 3.15 | 5 |
905-913 | AETIPHCNC | 0.77 | 10 |
910-918 | HCNCTSGVY | 4.31 | 3 |
911-919 | CNCTSGVYY | 3.51 | 4 |
914-922 | TSGVYYSTY | 4.45 | 3 |
915-923 | SGVYYSTYY | 3.75 | 4 |
918-926 | YYSTYYRKY | 3.17 | 5 |
919-927 | YSTYYRKYY | 6.56 | 2 |
930-938 | ICEKPTNIW | 5.31 | 2 |
936-944 | NIWIEGNPY | 1.02 | 9 |
937-945 | IWIEGNPYY | 3.4 | 5 |
938-946 | WIEGNPYYP | 1.46 | 8 |
959-967 | PAEPGDCGG | 4.76 | 3 |
962-970 | PGDCGGILR | 4.08 | 4 |
978-986 | LLTAGGSGY | 2.69 | 6 |
989-997 | FADIRQLEC | 2.03 | 7 |
994-1002 | QLECIAEEQ | 1.01 | 9 |
998-1006 | IAEEQGLSD | 8.07 | 1 |
999-1007 | AEEQGLSDY | 12.46 | 1 |
1004-1012 | LSDYITGLG | 0.71 | 10 |
1019-1027 | FTDQISTKV | 4.68 | 3 |
1027-1035 | VTELQEVAK | 2.07 | 7 |
1034-1042 | AKDFLTTKV | 3.58 | 4 |
1046-1054 | VVKMVSALV | 1.36 | 9 |
1074-1082 | GCDGSPWRF | 2.91 | 5 |
1095-1103 | YIERQANDG | 5.18 | 2 |
1100-1108 | ANDGWFRKF | 2.86 | 5 |
1108-1116 | FNDACNAAK | 3 | 5 |
1117-1125 | GLEWIANKI | 4.79 | 3 |
1128-1136 | LIEWIKNKV | 1.54 | 8 |
1140-1148 | AKEKLEFCS | 1.25 | 9 |
1172-1180 | KREQLFNNV | 2.01 | 7 |
1186-1194 | MSQKFAPLY | 7.62 | 1 |
1195-1203 | AVESKRIRE | 2.28 | 7 |
1219-1227 | RIEPVCVLI | 1.34 | 9 |
1245-1253 | AIAEHFNSA | 1.18 | 9 |
1246-1254 | IAEHFNSAV | 4.18 | 4 |
1247-1255 | AEHFNSAVY | 4.32 | 3 |
1255-1263 | YSLPPDPKH | 0.66 | 10 |
1269-1277 | QQEVVIMDD | 1.32 | 9 |
1274-1282 | IMDDLNQNP | 0.69 | 10 |
1281-1289 | NPDGQDISM | 2.77 | 6 |
1296-1304 | SVDFLPPMA | 3.92 | 4 |
1305-1313 | SLDNKGMLF | 3.88 | 4 |
1343-1351 | GFDLDICLH | 1.43 | 8 |
1345-1353 | DLDICLHTT | 1.55 | 8 |
1346-1354 | LDICLHTTY | 3.51 | 4 |
1369-1377 | CKDCHQPSN | 1.52 | 8 |
1392-1400 | LVDRTTNIR | 3.27 | 5 |
1393-1401 | VDRTTNIRY | 0.7 | 10 |
1402-1410 | SVDQLVTAI | 4.94 | 3 |
1420-1428 | ITDSLETLF | 0.69 | 10 |
1425-1433 | ETLFQGPVY | 6.08 | 2 |
1433-1441 | YKDLEIDVC | 2.97 | 5 |
1437-1445 | EIDVCNTPP | 3.51 | 4 |
1445-1453 | PPECINDLL | 0.9 | 10 |
1456-1464 | VDSEEIREY | 1.43 | 8 |
1457-1465 | DSEEIREYC | 5.46 | 2 |
1478-1486 | NIERAMNQA | 1.74 | 8 |
1488-1496 | MIINTILMF | 0.62 | 10 |
1496-1504 | FVSTLGIVY | 5.18 | 2 |
1518-1526 | SGNPPHNKL | 0.8 | 10 |
1540-1548 | NTEFALSLL | 1.96 | 7 |
1559-1567 | KGEFTGLGI | 0.83 | 10 |
1572-1580 | CVIPTHAQP | 3.38 | 5 |
1581-1589 | GDDVLVNGQ | 1.65 | 8 |
1593-1601 | VKDKYKLVD | 4.87 | 3 |
1599-1607 | LVDPENINL | 5.85 | 2 |
1606-1614 | NLELTVLTL | 3.01 | 5 |
1628-1636 | SEDLEGVDA | 2.09 | 7 |
1630-1638 | DLEGVDATL | 5.53 | 2 |
1633-1641 | GVDATLVVH | 1.92 | 7 |
1636-1644 | ATLVVHSNN | 2.19 | 7 |
1649-1657 | ILEVGPVTM | 2.37 | 6 |
1651-1659 | EVGPVTMAG | 1.63 | 8 |
1657-1665 | MAGLINLSS | 1.69 | 8 |
1664-1672 | SSTPTNRMI | 0.71 | 10 |
1668-1676 | TNRMIRYDY | 1.26 | 9 |
1706-1714 | FSAQLKKQY | 2.33 | 6 |
1715-1723 | FVEKQGQVI | 4.23 | 3 |
1742-1750 | SKLHPSVFY | 1.73 | 8 |
1754-1762 | PGDKEPAVL | 2.38 | 6 |
1756-1764 | DKEPAVLSD | 5.94 | 2 |
1773-1781 | LTESLFSKY | 12.02 | 1 |
1786-1794 | NTEPTENML | 5.36 | 2 |
1792-1800 | NMLVAVDHY | 3.49 | 4 |
1793-1801 | MLVAVDHYA | 2.19 | 7 |
1796-1804 | AVDHYAGQL | 7.29 | 1 |
1806-1814 | SLDIPTSEL | 1.83 | 8 |
1813-1821 | ELTLKEALY | 3.94 | 4 |
1814-1822 | LTLKEALYG | 0.84 | 10 |
1825-1833 | GLEPIDITT | 6.78 | 1 |
1828-1836 | PIDITTSAG | 3.21 | 5 |
1831-1839 | ITTSAGFPY | 3.42 | 5 |
1856-1864 | DTEKMKFYL | 1.4 | 8 |
1859-1867 | KMKFYLDKY | 0.91 | 10 |
1863-1871 | YLDKYGIDL | 8.56 | 1 |
1868-1876 | GIDLPLVTY | 9.37 | 1 |
1877-1885 | IKDELRSVD | 2.2 | 7 |
1883-1891 | SVDKVRLGK | 6.96 | 1 |
1894-1902 | LIEASSLND | 1.63 | 8 |
1906-1914 | MRMKLGNLY | 7.72 | 1 |
1930-1938 | GCDPDVFWS | 1.45 | 8 |
1942-1950 | CLMDGHLMA | 1.29 | 9 |
1943-1951 | LMDGHLMAF | 1.68 | 8 |
1959-1967 | SLSPVWFVC | 0.96 | 10 |
1967-1975 | CLEKVLTKL | 6.33 | 2 |
1999-2007 | VVEGGMPSG | 4.14 | 4 |
2023-2031 | IRTLILDAY | 0.92 | 10 |
2033-2041 | GIDLDKLKI | 0.99 | 10 |
2035-2043 | DLDKLKILA | 3.33 | 5 |
2036-2044 | LDKLKILAY | 1.7 | 8 |
2044-2052 | YGDDLIVSY | 11.68 | 1 |
2045-2053 | GDDLIVSYP | 2.31 | 6 |
2055-2063 | ELDPQVLAT | 9.12 | 1 |
2060-2068 | VLATLGKNY | 4.53 | 3 |
2115-2123 | IHESIRWTK | 1.85 | 7 |
2128-2136 | TQDHVRSLC | 4 | 4 |
2134-2142 | SLCMLAWHS | 3.84 | 4 |
2142-2150 | SGEKEYNEF | 2.01 | 7 |
2156-2164 | TTDIGKCLI | 1.82 | 8 |
As it happens in these pages, the formatting of the output, in particular of the results table in this case, will be influenced by the CSS of this web site. On running the script on your own server, which you are warmly encouraged to do, the visual results will be slightly different. Still, you can see that we are getting some interesting results here, 205 peptides out of 2172 were found positive with a few ranking as high as 1.
It would be nice to be able to dynamically sort this results table according to the scores or ranks. This can be easily achieved by using jQuery and DataTables as we will see later on during the development of the full T-Score web application. You can see that at this stage, all the pieces of code we need to fully score and rank the peptides in a protein are in place, so we are ready to proceed to the development of the web application. Keep on reading about this in the next section.
Chapter Sections
[pagelist include=”1461″]
[siblings]
WORK IN PROGRESS ON CHAPTER 5!