4-6: PHP programming language basics – built-in predefined functions, strings and biological sequences manipulation

On writing PHP code, a lot of functions are available to us that are built-in into the language, or if you prefer, predefined: you can use them without having to write them, and are usually quite solid code you can rely on.

In this section and the next we hope to show you how the use of a few readily available PHP functions can go a long way in manipulating strings and biological sequences data. This knowledge will be a strong base for building great web applications in the field of biology and bioinformatics. Keep reading!

Before starting, we should mention some generic information on functions calls.

A function call is performed by writing the name of the function followed by round brackets. Let’s imagine to have a function called “bestfunction”. We can call it like this:

bestfunction();

If the function returns something we need to collect, we can write:

$function_output = bestfunction();

so that the bestfunction() output is now stored to the $function_output variable.

It it not uncommon for functions to need “arguments” in order to perform their duty. Some functions do not require arguments at all, others can optionally take arguments and other absolutely require one or more arguments in order to work. Arguments are passed to the function within the round brackets that follow the function name. If there are several arguments (more than one), they are separated by commas.

Here are a few sample calls to a function:

$function_output = bestfunction($argument1); // If the argument is passed as a variable name, the variable is actually automatically interpolated to it’s value “under the hood” and then passed to the function. The variable that is passed as argument should have of course been defined in the script code before the function call.

$function_output = bestfunction(“gaattc”); // In this case we are passing a value directly, rather than a variable

$function_output = bestfunction($argument1,$argument2); // A function called with 2 arguments

$function_output = bestfunction(“GTCTAGTGA”, 2); // Depending on the function, arguments may be of different types: numbers, strings, arrays, boolean values etc…

Listing all available built-in PHP functions

If you want to list all PHP built-in functions, there is a special predefined function for that, as well, that allows you see the names of all the functions available during the execution of a script, both the predefined functions (also called “internal” functions) and those eventually defined by the user.

This function is get_defined_functions(). A call to it, without arguments, will return an array with two keys, “internal” and “user”. As you may guess, the key “internal” provides access to an array with the names of PHP built-in functions, while the “user” key provides access to an array with all the names of the functions defined by the user (you, the programmer), if any.

Please try to run the following code on your web server and check out the result for a complete listing of all available PHP built-in functions.

At the time of this writing 1115 PHP built-in functions are available on our Linux test server. For the purposes of this book, we will need just a few of those that we will review in this section. This will not necessarily be a comprehensive review of the predefined PHP functions used in this book, more of them might be described in subsequent sections, as we come to use them.

strtoupper() and strtolower()

The strtoupper(), string-to-upper, and strtolower(), string-to-lower functions allow us to convert strings passed as arguments to all uppercase or all lowercase, respectively.

Reverse strings with strrev()

The strrev() string reverse function takes a string as an argument and reverses it, as simple as that. mystring will become gnirtsym.

strlen()

The strlen() (string length) function takes a string as argument and returns the number of characters composing it, as an integer.

count()

count() takes an array as argument and returns the number of elements it contains, as an integer.

round()

The round() function allows to round the number of digits after the decimal point in float numbers. It can be used to adjust the precision of a number deriving from a fraction, for example.

The “precision” directive (number of handled digits in a float number) in the php.ini configuration file defaults to 14. For example, the pi() function that returns the pi number (π) will actually return 3.1415926535898.

round() can be called with two arguments, the first is the float we wish to round while the second is the desired precision. The last digit is rounded to the next or previous digit depending on what follows. For example round(5.324, 2) returns 5.32 while round(5.326, 2) returns 5.33. Interestingly, round(5.325, 2) returns 5.33, so there is a bias toward the upper digit.

trim()

The trim() function removes unwanted characters at the beginning and end of a string. By default, when called with just one argument (the string to be trimmed), it will remove the following characters:

” ” ordinary space
“\n” newline
“\r” carriage return
“\t” tab
“\0” the null byte
“\x0B” vertical tab

trim() can be called with a second argument that specifies which characters are to be included in the remove list.

One reason it is sometimes a great idea to trim strings is that there are at least two widely used ways to generate a newline. Some protocols or application will use \r\n. This is a legacy of how manual typewriters worked: to get a new line you first returned the carriage to the start position (\r), and then moved the paper up by one position to start writing a newline (\n). Hence the double character \r\n for a newline. However, many protocols and applications dropped the \r and produce a newline by using just \n.

Let’s quote Wikipedia:

“Most textual Internet protocols (including HTTP, SMTP, FTP, IRC, and many others) mandate the use of ASCII CR+LF (‘\r\n’, 0x0D 0x0A) on the protocol level, but recommend that tolerant applications recognize lone LF (‘\n’, 0x0A) as well. Despite the dictated standard, many applications erroneously use the C newline escape sequence ‘\n’ (LF) instead of the correct combination of carriage return escape and newline escape sequences ‘\r\n’ (CR+LF) “

You can split a text file ($text) into lines ($lines) with a command such as:

This will always work nicely. However if the text used \r\n to specify newlines, there you now have a trailing \r on each line. Believe it or not, this little tiny \r at the end of each line can cause lots of troubles later in your code. Since you normally do not see this character, you may end up having weird consequences without having a clue of what is going on.

It is therefore an excellent idea, when you come to use these lines later on in the code, to trim them with trim() so as to get rid of the (potential) trailing \r:

By specifying the second argument, trim() becomes a powerful tool for text manipulation. Say you have lines with a variable number of dots, commas or spaces at the end that you would like to remove:

$mystring = Hello, World… ..
$mystring1 = Hello, World. ,
$mystring2 = .Hello, World,

By calling a line like:

on any of the above, the value of $cleaned will always be “Hello, World”.

So now you can’t say we didn’t tell you to always trim your lines. Maybe there is a trailing \r, maybe there isn’t (you may never know for sure), just trim and be an happier coder 🙂

file_get_contents() and how to retrieve biological sequences from the Internet

file_get_contents() turns the contents of a local or online file, whose path is passed to it as an argument, to a string.

The only required argument is the path of the file, however more optional arguments are supported.

We have already used it in the first section of this chapter to generate a dynamic webpage from contents stored in different files.

Let’s now use it to retrieve a FASTA sequence from the UniProt database instead, for instance the sequence of the human protein ABL1 (Uniprot ID: P00519).

You can reach any FASTA sequence text file on the UniProt web site at an address composed as follows:

http://www.uniprot.org/uniprot/(sequence ID here).fasta

For example:

http://www.uniprot.org/uniprot/P00519.fasta

Here’s a slightly more sophisticated example in which we get FASTA sequences for 5 different proteins, store them into an array and then provide an output.

We will get the sequences corresponding to the following 5 ids:

P21333
P00533
P68133
P35222
O75369

We highly recommend you try this example yourself on your server.

A note on the code sample above. We are accessing sequences on a third party server, through URLs. UniProt seems to be perfectly OK with that according to their guidelines for programmatic access. They do however specify the following, which is quite a mild statement:

“Please consider to provide your email address as part of the User-Agent header that your programs set. This will allow us to contact you in case of problems.”

Indeed sending requests to a server generates a load. If this load is excessive, the server owner may have a problem with that, and eventually ban your IP address. It is therefore a good idea, if you plan to send lots of automated (software generated) requests to a particular server, to check the server guidelines and eventually get in contact to check with them if they are OK with what you plan to do.

explode() and implode() – How to format FASTA sequences for the web

The explode() function allows us to split a string in parts based on a delimiter and store the parts as elements of an array. It takes the delimiter as first argument and the string as second argument. Mind that the delimiter cannot be an empty string.

implode() does the opposite: it take a delimiter as first argument and an array as second argument and will join the elements of the array into a string, placing the delimiter in between.

We can combine explode and implode sequentially in a single expression to perform a quick “search and replace” within a string. This comes very handy right now. Remember we had a problem with the formatting of the FASTA sequences for display in a web page, as the newlines that split the FASTA sequence in several lines in the FASTA text file is totally ignored in HTML, and each FASTA sequence was rendered in a single line when given in output in a web page as we did in previous examples in this section? We have a nice fix.

Before getting to the FASTA sequences, try this simpler example in which we replace commas with hyphens surrounded by spaces in a string.

And now try the following example in which the newlines \n in FASTA sequences are replaced by break tags before being shown in the web page. This happens in the output part of the script, the second foreach cycle. We have also embedded the sequence in a span tag that was given a monospace font family (courier). This makes so that every character in the sequence has the same width resulting in an orderly appearance. The use of monospace fonts for sequences also has important implications for alignments, but this is a topic we will discuss in another section.

Please run this example on your server. Find a part of the output of this script below:

Sequence 1:
>sp|P21333|FLNA_HUMAN Filamin-A OS=Homo sapiens GN=FLNA PE=1 SV=4
MSSSHSRAGQSAAGAAPGGGVDTRDAEMPATEKDLAEDAPWKKIQQNTFTRWCNEHLKCV
SKRIANLQTDLSDGLRLIALLEVLSQKKMHRKHNQRPTFRQMQLENVSVALEFLDRESIK
LVSIDSKAIVDGNLKLILGLIWTLILHYSISMPMWDEEEDEEAKKQTPKQRLLGWIQNKL
PQLPITNFSRDWQSGRALGALVDSCAPGLCPDWDSWDASKPVTNAREAMQQADDWLGIPQ
VITPEEIVDPNVDEHSVMTYLSQFPKAKLKPGAPLRPKLNPKKARAYGPGIEPTGNMVKK
RAEFTVETRSAGQGEVLVYVEDPAGHQEEAKVTANNDKNRTFSVWYVPEVTGTHKVTVLF
AGQHIAKSPFEVYVDKSQGDASKVTAQGPGLEPSGNIANKTTYFEIFTAGAGTGEVEVVI
QDPMGQKGTVEPQLEARGDSTYRCSYQPTMEGVHTVHVTFAGVPIPRSPYTVTVGQACNP
SACRAVGRGLQPKGVRVKETADFKVYTKGAGSGELKVTVKGPKGEERVKQKDLGDGVYGF
EYYPMVPGTYIVTITWGGQNIGRSPFEVKVGTECGNQKVRAWGPGLEGGVVGKSADFVVE
AIGDDVGTLGFSVEGPSQAKIECDDKGDGSCDVRYWPQEAGEYAVHVLCNSEDIRLSPFM
ADIRDAPQDFHPDRVKARGPGLEKTGVAVNKPAEFTVDAKHGGKAPLRVQVQDNEGCPVE
ALVKDNGNGTYSCSYVPRKPVKHTAMVSWGGVSIPNSPFRVNVGAGSHPNKVKVYGPGVA
KTGLKAHEPTYFTVDCAEAGQGDVSIGIKCAPGVVGPAEADIDFDIIRNDNDTFTVKYTP
RGAGSYTIMVLFADQATPTSPIRVKVEPSHDASKVKAEGPGLSRTGVELGKPTHFTVNAK
AAGKGKLDVQFSGLTKGDAVRDVDIIDHHDNTYTVKYTPVQQGPVGVNVTYGGDPIPKSP
FSVAVSPSLDLSKIKVSGLGEKVDVGKDQEFTVKSKGAGGQGKVASKIVGPSGAAVPCKV
EPGLGADNSVVRFLPREEGPYEVEVTYDGVPVPGSPFPLEAVAPTKPSKVKAFGPGLQGG
SAGSPARFTIDTKGAGTGGLGLTVEGPCEAQLECLDNGDGTCSVSYVPTEPGDYNINILF
ADTHIPGSPFKAHVVPCFDASKVKCSGPGLERATAGEVGQFQVDCSSAGSAELTIEICSE
AGLPAEVYIQDHGDGTHTITYIPLCPGAYTVTIKYGGQPVPNFPSKLQVEPAVDTSGVQC
YGPGIEGQGVFREATTEFSVDARALTQTGGPHVKARVANPSGNLTETYVQDRGDGMYKVE
YTPYEEGLHSVDVTYDGSPVPSSPFQVPVTEGCDPSRVRVHGPGIQSGTTNKPNKFTVET
RGAGTGGLGLAVEGPSEAKMSCMDNKDGSCSVEYIPYEAGTYSLNVTYGGHQVPGSPFKV
PVHDVTDASKVKCSGPGLSPGMVRANLPQSFQVDTSKAGVAPLQVKVQGPKGLVEPVDVV
DNADGTQTVNYVPSREGPYSISVLYGDEEVPRSPFKVKVLPTHDASKVKASGPGLNTTGV
PASLPVEFTIDAKDAGEGLLAVQITDPEGKPKKTHIQDNHDGTYTVAYVPDVTGRYTILI
KYGGDEIPFSPYRVRAVPTGDASKCTVTVSIGGHGLGAGIGPTIQIGEETVITVDTKAAG
KGKVTCTVCTPDGSEVDVDVVENEDGTFDIFYTAPQPGKYVICVRFGGEHVPNSPFQVTA
LAGDQPSVQPPLRSQQLAPQYTYAQGGQQTWAPERPLVGVNGLDVTSLRPFDLVIPFTIK
KGEITGEVRMPSGKVAQPTITDNKDGTVTVRYAPSEAGLHEMDIRYDNMHIPGSPLQFYV
DYVNCGHVTAYGPGLTHGVVNKPATFTVNTKDAGEGGLSLAIEGPSKAEISCTDNQDGTC
SVSYLPVLPGDYSILVKYNEQHVPGSPFTARVTGDDSMRMSHLKVGSAADIPINISETDL
SLLTATVVPPSGREEPCLLKRLRNGHVGISFVPKETGEHLVHVKKNGQHVASSPIPVVIS
QSEIGDASRVRVSGQGLHEGHTFEPAEFIIDTRDAGYGGLSLSIEGPSKVDINTEDLEDG
TCRVTYCPTEPGNYIINIKFADQHVPGSPFSVKVTGEGRVKESITRRRRAPSVANVGSHC
DLSLKIPEISIQDMTAQVTSPSGKTHEAEIVEGENHTYCIRFVPAEMGTHTVSVKYKGQH
VPGSPFQFTVGPLGEGGAHKVRAGGPGLERAEAGVPAEFSIWTREAGAGGLAIAVEGPSK
AEISFEDRKDGSCGVAYVVQEPGDYEVSVKFNEEHIPDSPFVVPVASPSGDARRLTVSSL
QESGLKVNQPASFAVSLNGAKGAIDAKVHSPSGALEECYVTEIDQDKYAVRFIPRENGVY
LIDVKFNGTHIPGSPFKIRVGEPGHGGDPGLVSAYGAGLEGGVTGNPAEFVVNTSNAGAG
ALSVTIDGPSKVKMDCQECPEGYRVTYTPMAPGSYLISIKYGGPYHIGGSPFKAKVTGPR
LVSNHSLHETSSVFVDSLTKATCAPQHGAPGPGPADASKVVAKGLGLSKAYVGQKSSFTV
DCSKAGNNMLLVGVHGPRTPCEEILVKHVGSRLYSVSYLLKDKGEYTLVVKWGDEHIPGS
PYRVVVP

Sequence 2:
>sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
ENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKL
FGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCN
LLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVM
GENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVV
ALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGS
GAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGI
CLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAA
RNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSY
GVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPK
FRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQ
QGFFSSPSTSRTPLLSSLSATSNNSTVACIDRNGLQSCPIKEDSFLQRYSSDPTGALTED
SIDDTFLPVPEYINQSVPKRPAGSVQNPVYHNQPLNPAPSRDPHYQDPHSTAVGNPEYLN
TVQPTCVNSTFDSPAHWAQKGSHQISLDNPDYQQDFFPKEAKPNGIFKGSTAENAEYLRV
APQSSEFIGA

Searching a substring or character inside a string with strrchr()

It is often useful to be able to check if a sequence of characters, for example an oligonucleotide or short peptide sequence, or a single character, for example a single base or amino-acid, is present within a longer sequence. The strrchr() function (string research) can help. it takes 2 arguments: the first is the string in which to search (haystack), the second the substring to be searched (needle). It returns a portion of the haystack that starts with the first occurrence of the needle to the end of the haystack or “false” if the needle is not found in the haystack. Therefore the result will equal (==) true if the substring is found and false if not found.

Let’s try a slightly more sophisticated and useful example in which we will check if a given amino-acid is nonpolar, polar, basic or acidic.

This is the output of this script:

Amino-acid K is basic

In the next section we will look at a few more built-in functions that will allow us further strings and biological sequences manipulation with PHP. Stay tuned!

Chapter Sections

[pagelist include=”435″]

[siblings]

Leave a Reply

Your email address will not be published. Required fields are marked *