[phpBB Debug] PHP Warning: in file /home/biologyonline/public_html/kb/print.php on line 19: include(../common.php): failed to open stream: No such file or directory
[phpBB Debug] PHP Warning: in file /home/biologyonline/public_html/kb/print.php on line 19: include(): Failed opening '../common.php' for inclusion (include_path='.:/usr/share/pear:/usr/share/php')
Print - Biology-Online

Introductory Workbook on Perl for Biology Students


Introductory Workbook on Perl for Biology Students

Dipl.Ing.S.Sai Giridhar, B.Tech., M.S.(Ph.D.)(Vienna)
Department of Computer-aided Automation
Vienna University of Technology
Vienna, Austria
Email: sai.giridhar@inso.tuwien.ac.at


Prof.S.Krupanidhi, M.Sc., Ph.D.,
Head, Department of Biosciences,
Coordinator UGC SAP DRS, DST FIST and DBT BIF programs
Sri Sathya Sai University,
Prasanthi Nilayam 515 134 A P, India

Department of Biosciences, Sri Sathya Sai University, Vidyagiri Complex Prasanthi Nialyam


An article submitted on January 14, 2009, published by Biology-Online.org on January 17, 2009.



The language of DNA is made up of the least number of alphabets viz., 4. A set of triplet combination of these alphabets constitute a word (codon), which is having a sense or meaning viz., defining an amino acid, the building block of proteins. There are 21 amino acids, each of them corresponds to one or more sets of triplet codons. The amino acids in various chosen combinations make up a variety of discrete proteins, reflecting on the phenotypic diversity of organisms. Consequently, the bewildering diversity around us is nevertheless being dictated by the random triplet combination of the alphabets of genomic DNA. Moreover, the array of life is being guided by the scalar variables of DNA string and the same is incidentally amenable and maneuverable for in silico tools to evaluate their virtual features.

Along with molecular tools, the in silico tools are supplementing and enriching to unravel the mysteries of genomic and proteomic analyses. The language of Perl programme conceived by Larry Wall (1987) reasonably suits as one of the best scripting languages to display a variety of diagnostic features of DNA depicted through simple exercises shown in the following pages.

I. Introduction

Perl script is designed to visualize virtually the features of nucleotide strings. In the present endeavour, a few exercises are conceived to unravel the inherent features of DNA using perl script. The perl programme could be downloaded from internet and further the same can be installed in UNIX (kubuntu) platform. In the present book-let, the exercises began with the elementary arithmetic calculations and extended the same to the DNA string.  Incidentally, DNA string is one of the best biological core materials to adopt perl programme to unfold its salient features, whose vast array of information have been deciphered over the years through various biochemical tools. The language of DNA infused with four nucleotides and in a unique species-specific combination constitutes genomic DNA and the same also happens to be the source for proteins and disease causing malformed proteins. Hence, the central dogma of transcription of gene and translation of gene products are the crux for the computer languages to percolate into the biological systems. Moreover, the number of nucleotides in any species is beyond arithmetic proportion and the number of variations i.e., polymorphism within the genes and the foot prints for the transcription factors and enzymatic machinery - are all tending to increase the complexity of the function of genomic DNA in vivo.  In such a scenario, the computer programme such as perl has come to the rescue of biologists to unravel the mysteries of creation. The present book-let deals with a few exercises viz., in silico evaluation of DNA properties such as complementary strand, transcription of RNA, identification of start and stop signals, finding out percent GC and length of the each strand, concatenating two strings, joining two stings and chopping the terminals of RNA and translation of DNA genetic code into a protein string using the syntax of perl programme.


Perl is a scripting language, developed by Larry Wall in 1987, who designed perl language for UNIX environmental system. Perl is an acronym (precisely, retronym) and stands for Practical Extraction and Report Language.  In the jargon of computer science, the scripting languages are often called interpreted languages.  However, perl is both a compiled and interpreted language and hence facilitates to modify the perl scripts instantaneously than in any other programming languages. Perl is ubiquitous and a powerful language which assists to write structured programmes, advanced data structures and object oriented programme.


Unix (kubuntu) administrator is one of the best choices to write perl script.  The first line of the script is called ‘shebang' with hash and exclamation mark (#!/usr/bin/per w).  The symbol ‘#' at the start of the line indicates that the respective line constitutes a comment. The programme lines prefixed with $ and @ constitute either commands or arguments. Two strings of random nucleotides' sequences are written in a text file as shown in Appendix.  Later, executable text files are named as ‘filename.pl', i.e., individual exercises.  The retrieving of strings of nucleotides from the text file is done using scalar variable commands, followed by ‘chomp' and ‘join' functions to make the array of strings as a single string.  The present conceived exercises are designed to work with one string and also array of variables.  They represent scalar data and list data respectively.  The simplest example chosen here is the two strings of nucleotides written in a text file "krupa.seq". A few short steps in perl programme are designed to retrieve one string to begin with and later two strings using ‘$' and ‘@' array commands respectively.  The two strings are brought into one continuous line and made them joined using perl commands such as ‘chomp' and ‘join' respectively.  Considering the joined string as a template, the following parameters of DNA are virtually derived viz., length, complementary strand, transcribed strand, substitution of ‘start' and ‘stop' codons, total count of nucleotides in the transcribed string, GC count, and percent of GC and chop function. The last exercise deals with the retrieving of nucleotide sequences of the gene of our choice from internet using Web sources. These exercises invariably provide an impetus for the beginners to practice the perl script to visualize in silico features of DNA.

II. Perl Notation











Not equals


Less than


Greater than



Less than or equals



Greater than or equals


















String concatenate












String repeat





? , !

Conditional Operators



New line



















Default variable

III. Syntax

Scalar variable stores a single (scalar) value. Perl scalar names are prefixed with a dollar sign ($), so for example, $x, $y, $z, $username, and $url are all examples of scalar variable names.


Array variable is capable of storing more than one value. The array names are prefixed by an ‘@’ sign. For example @x, @y, @z. The values assigned to the array should be separated by commas or by spaces (if they are separated by spaces then the values should be prefixed with “qw”). You could also specify ranges ([“a”..”z”] or [1..20]). A point to note is there should be only 2 dots.


@y=("Quarter","Dime","Nickel"); Numbers don’t have quotes but

 strings(characters) do.

$y[1] while retrieving a value of the array you can address

 them by their position in the array and placing a $ or @

 in the front as it holds only one value. Here the result

 is “Dime”.

@z= qw(value1 value2);

@a=(0..10); Even ranges can be specified. Here the range is

            from 0 to 10


Hash variable A Perl hash is similar to an ordinary array, but instead of using integer indexes, a hash uses "keys" that can take on any scalar value. These are usually strings or numbers. The Syntax is: instead of the @ operator, associative arrays use the % symbol, and rather than square brackets [], as in $myarray[0], hash elements are referenced using curly brackets {}, as in $myhash{"george"}


Subroutine is a user defined function to perform a particular or a set of tasks. The subroutine could be placed anywhere in the program body, but usually it is best to put them either at the beginning or at the end of the program


The syntax of the subroutine is defined as (as seen in line 11 of program 2):

sub <name_of_the_subroutine>{

      <body of the subroutine>



Substitution is a method by which you can substitute a word with a new word. If the new word is empty, then the word is substituted by null. In other terms deleted as null has no value. The syntax for substitution is given as:

s/<word>/<new word>/<options> as in line 13 of program 3

The option are:

g – Global substitution. That is to substitute the old word

with the new word in the whole value of the variable

i – Ignore case. That is for example you can substitute the

word “bIO” to “bioinfo” and also any occurrences of


If non of the options are specified then the command substitutes only the first occurrence of the word in the variable.


Translation is a method by which you can translate a character to a new character, but you don’t really substitute it. If you give a word, then each letter of the word is translated to the corresponding new letter of the new word. If the new letter is empty then the letter is substituted by null, therefore deleted.The syntax for translation is given as:

tr/<letter>/<new letter>/ as in line 6 to 9 of program 3

You can also specify range such as tr/[a-z]/[A-Z] which is used to change to uppercase.


While loop. The While Loop is used to perform an operation when the condition provided is true or in other terms the loops is executed until the condition turns false. The syntax is:







Do while loop. To create a loop that executes statements first, and then tests an expression, you need to combine while with a preceding do {} statement. For example:


            $calc +=($fact*$ivalue);

   }while ($calc<100);


In this case, the code block is executed first, and the conditional expression is only evaluated at the end of each loop iteration. Here we multiply $fact and $ivalue and add it to the value $calc and save the result in $calc. The condition is then checked in the end that is to execute the loop until $calc is less than 100. Once it is equal or greater than 100, the loop breaks.


Until Loop. Its syntax is:





Until the condition is met the loop keeps executing. Once the condition is met the loop breaks or exits. It is the inverse of the while loop


Do until. Here The statements are first executed and then the condition is checked. The syntax is:








  $calc += ($fact*$ivalue);

} until ($calc >= 100);


Here we multiply $fact and $ivalue and add it to the value $calc and save the result in $calc. The unitl condition then checks if the $calc is greater than or equal to 100. If not the loop is again executed and terminates only when the condition is satisfied.


For loop. Its syntax is:


for(<initialization>; <condition>; <increment>){




for($i=0; $i<10; $i++){

      print “the value of i is $i”;



In the above example we initialize the value of $i to 0 and then print the value of ‘i’ until the condition is met. That is until the value of $i has become less then 10. Each time the loop is executed the value of $i is incremented by 1 due to the $i++ element. If you want to increment it by 2, then you write $i+2. $i++ is only another form for writing $i+1.


Foreach loop. The Foreach Loop is used along with arrays so that each value of the array undergoes the desired operation using an iterator. The syntax is:






Pop function. The pop function is to eliminate the last value of an array. The syntax is:



Push function. The push function is to push a value to the last position of an array. The syntax is:

push(@array, ‘value’);


Sort function. The sort function is to alphabetically sort the values in the array. The syntax is:



Open. The function open uses a file handler to open a file and is used for reading and writing to a file. The Open command ends with a use of another command called “close”. The syntax is:


open(<file handler>, <name of the file>)



close <file handler>;

The example of the command is shown in program 8.


Chomp. It takes off the end character of a specified string ONLY if that character is a RETURN (Enter). The return character is sometimes created from input information or by the coding itself. Either way, to parse that character off, CHOMP is the command to use. IT will not affect any other characters.

Note: The RETURN character is the same as the ENTER character which is also known as a NEWLINE character. It is symbolized as \n. The syntax is:


chomp(<variable>) example:- chomp($dna); or chomp(@dna).


Chop. The next command is CHOP. This is a very similar command as CHOMP, but... it takes of the ending character of a string no matter what it is.


chop(<variable>) example:- chop($dna); or chop(@dna);


Join. It takes two arguments - a scalar variable to use as a separator (not a regular expression just a simple variable) and an array, and it returns a string that contains the elements of the array separated by the given separator. The syntax is:


This LIST could be a set of scalar variables or just an array

join(EXPR, $a,$b,$c); or join(EXPR, @array);

example:-@array=(“one”, “two”)

 print join("_", @array);

result:     one_two


Split. It breaks up a string according to a separator. This is useful for tab-separated data, or colon-separated, whitespace-separated, or anything-separated data. If you can specify the separator with a regular expression, you can use split: The syntax is


@array_variable = split(/separator/, string);

Example:-my $str = "one:two:three:";

@fields = split(/:/, $str);

print @fields;

result:- one two three


Example2:- my $str = "one two three:";

@fields = split (//,$str); that is each of the three words have become a value in the array so the array has now three values and they are not anymore as a sentence.

print @fields;

result:- one two there



If elsif else. The syntax is:

if (boolean expression) {



If the boolean expression evaluates to true, the statements in the curly brackets will be executed. The curly brackets are mandatory. The boolean expression conditional can contain any one of the comparison operators covered in the next section.

Multiple conditions can be checked together using the boolean expression operators:


if ( ($x == 20) || ( ($x > 0)&&($x < 10)&&(! $x == 5) ) ){

    print "x is equal to 20 or either between 0 and 10 but not 5.\n";


Conditional statements can also be extended with the elsif and else structures:

if (boolean expression 1) {

    statement 1;


elsif (boolean expression 2) {

    statement 2;


else {

    statement 3;


Note that an if statement is followed by any number (including zero) of elsif statements, and finally an optional else statement. The statements of an elsif will be executed if its boolean expression is true, and no preceding (els)if statement's boolean expression is true. The trailing else (if present) is executed if none of the preceding statements' boolean expressions are true.


Unless. The syntax is:

unless (Boolean expression) {



Unless the Boolean expression is true, the statements within the curly brackets are executed. The “Unless” command is not a loop command.


Substr. It returns a substring of EXPR, starting at OFFSET within the string. If OFFSET is negative, starts that many characters from the end of the string. If LEN is specified, returns that number of bytes, or all bytes up until end-of-string if not specified. If LEN is negative, leaves that many characters off the end of the string.

If REPLACEMENT is specified, replaces the substring with the REPLACEMENT string.

If you specify a substring that passes beyond the end of the string, it returns only the valid element of the original string.

The syntax is:






$temp = substr("okay", 2);

print "Substring value is $temp\n";


$temp = substr("okay", 1,2);

print "Substring value is $temp\n";


result:- Substring value is ay

Substring value is ka

1. Perl program for calculating the arithmetic averages

The objective of the program is to calculate an average for any two given numbers. Here, to begin with store the desired numbers for which averages are to be determined in scalar variables ($a, $b, $c, etc). Add any two numbers using the scalar variables and divide them by 2. The result of the calculation is stored in another scalar variable ($avg1, $avg2, $avg3, etc). Once the results are obtained and stored, we need to print the result on the screen. For this purpose we have a command called “print”.  The print command also includes some special commands such as “\n”, “\t” for new line and tab respectively. These commands are optional.


  1. #calculate averages for the following pairs:
  2. $a=12;
  3. $b=13;
  4. $c=14;
  5. $d=15;
  6. $e=16;
  7. $f=17;
  8. $avg1 = ($a+$b)/2;
  9. $avg2 = ($c+$d)/2;
  10. $avg3 = ($e+$f)/2;
  11. $avg4 = ($b+$c)/2;
  12. print "$avg1\n$avg2\n$avg3\n$avg4\n"










2. Perl Program for calculating averages using subroutine

The objective of the present program is to calculate the average of any two numbers with a use of a subroutine.  


As acquainted with the previous program that for finding out averages the method followed to calculate involved three steps. Here, this method envisages to use the formula once and provides three different averages. The function of the formula is done by the subroutine. So whenever we want to calculate an average, we just pass the two values to the subroutine. The subroutine calculates the average and returns the result. This way we use the formula only once.


Whenever there is a need to pass values to the subroutine we pass it by specifying the name of the subroutine and the value/s specified in brackets (as seen in line 8, 9, and 10). The subroutine collects the values by a special variable depicted as “@_” (as see in line 12).


Here in line 8, the values of $a and $b are passed to the subroutine. The subroutine gets the values and stores them in $x and $y variables for manipulation purpose. The average is then calculated using these variables and the result is stored in the $res variable. The value of $res is returned by using the command called “return”. The returned value is then stored in $avg1. The same is with $avg2 and $avg3.



  1. #calculate averages using subroutine
  2. $a=12;
  3. $b=13;
  4. $c=14;
  5. $d=15;
  6. $e=16;
  7. $f=17;


  1. $avg1 = avg($a,$b);
  2. $avg2 = avg($b,$c);
  3. $avg3 = avg($c,$d);


  1. sub avg{


  1. ($x,$y)=@_;
  2. $res=($x+$y)/2;
  3. return $res;
  4. }
  5. print "$avg1\n$avg2\n$avg3\n";
  6. exit;







3. Perl program for concatenating two sequences

The Perl program is to perform concatenation of two sequences or strings. The concatenation as shown in the line 4 is just done by placing a dot. Here, the value of $DNA1 is being concatenated with the value of $DNA2. Hence, we write it as “$DNA1.$DNA2” and store the result as one complete string in $DNA3.


1.    #concatenate DNA sequence



4.    $DNA3=$DNA1.$DNA2;

5.    print"concatenated seq:$DNA3\n";





4. Perl program for file handling

This program demonstrates how to read contents from a file. Here we first define a scalar variable with a null value and provide the name of the file in $krupa.seq. If the file is located in the same directory where the Perl program is located then we need not specify the full path, else we need to specify full path of the file.


To open the file we use a command called “open”. The “FILE” here is the file handler. The $krupa.seq contains the file name to be opened. The file is read by using the file handler.


<FILE> reads the first line in the file and stores it in the variable $dna. Here the whole line is saved as one value as the variable is a scalar. If the variable was an array, all the lines would have been read and each line would have been saved as one value in the array. This <FILE> operation is used in the while loop to read all the lines of the file in a sequence. Once the line is read and stored in $dna, we perform a chomp on the variable to remove the trailing character (enter symbol). We now concatenate the value with the existing value of $longdna and save the whole result again in $longdna. Therefore it should be written as




To make it short we write it as



That is, we have reduced it by putting “.=” which means concatenate the two values and save the result to the left hand side variable. The same is with “+=” for addition and other mathematical operations.


After every line is read, chomp and concatenated the variable $longdna contains the complete result which is printed on to the screen. The while loop ends when there are no more lines to be read, that is it has reached to the end of the file.


  1. #filehandler and nullstring
  2. $longdna='';
  3. $pkrupa.seq="sample.seq";
  4. open(FILE,$krupa.seq);
  5. while($dna=<FILE>){
  6. chomp($dna);
  7. $krupa.seq.=$protein;
  8. }
  9. close FILE;

10.     print"here is the dna:$longdna:\n\n";

  1. exit;






5. Perl Program for File handling For Scalar And Array Variables

This program shows the difference in results in using scalar variables and array variables in retrieving data from a file. We first assign a file name to a scalar variable ($DNA) for our program. We then use the open command to open the file. The command <dna1> reads the first line of the file and stores the value in the scalar variable $DNA2. A scalar variable can hold only one value at a time and hence the next lines of the file are not stored in the variable instead it skips to the next line where the close command is issued. So the variable $DNA2 contains only the first line of the file.


The next test is to use an array. In line 9, the first line of the file is read and stored in the array variable @DNA2. Since the variable is an array all the lines from the file can be read and stored into this array. Each line would be stored as one value of the array. Once the file is read completely the program goes to the next line where the file is closed.


We would now like to join all the values of the array $DNA2 into one complete string. For that we would first need to remove the “enter” character at the end of each value of the array. The enter character appears each time we hit the enter key on our keyboard while writing to a file. This enter character is invisible. As all the lines are copied into the array, the values also contain this enter character. So we first remove this character by using chomp on the array. Now the array is free of the enter character. We will now join the values of the array by using a command called “join”. The values of the array are joined based on ‘null’ (‘’ without any space between the quotes) as quoted in line 16.  Hence there would be no value add between the values of the array. We now get one complete string from the values of the array and are stored in $DNA3. We find out the length of this variable by using a command called “length”.


  1. #file handler
  2. $DNA="sample.seq";
  3. open(dna1,$DNA);
  4. $DNA2=<dna1>;
  5. close dna1;
  6. print"the sequence of krupa.seq:$DNA2\n";
  7. #file handler for two strings
  8. open(dna1,$DNA);#removed the above line
  9. @DNA2=<dna1>;
  10. close dna1;
  11. print"the sequence of two strings of krupa.seq:@DNA2\n";
  12. #chomp the two strings
  13. chomp (@DNA2);
  14. print"the results of chomp DNA:@DNA2\n";
  15. #join the two strings
  16. $DNA3=join('',@DNA2);
  17. print"the result of join of two strings:\n$DNA3\n";
  18. #length of DNA sequence
  19. $length=length$DNA3;
  20. print"the length of two strings of DNA=$length\n";





the sequence of two strings of krupa.seq:TACTGTGGCCGTGCGTGGCTGCCGTTGCGCTGC




the result of join of two strings:


the length of two strings of DNA=66

6. Perl program for calculating the length, total bases, GC and AT counts

The objective of the program is to calculate the length of the given DNA sequence and to find out the total number of dinucleotides’ sequences viz GCs and ATs in the DNA.


At first we assign the value of the DNA to a scalar variable called $DNA. The length of the DNA could be obtained by just using a command called “length”, result of which is stored in $length.   The number of ‘A’ bases in the DNA are calculated by calculating the number of translations occurred by translating the ‘A’ to NULL in line 6.  $DNA=~tr/A// translates the ‘A’ to NULL and by putting it in brackets and assigning it to another variable gives the number of occurrences. Similarly we get the number of ‘C’, ‘G’, ‘T’ bases. We then can total all the bases with the use of ‘+’ as in normal arithmetic calculations. It is same with subtraction, multiplication and division. To count the number of GCs in the DNA (line 13) we use substitute the GC with GC itself as we just need the number of occurrences but do not want to distort the DNA. As substitution would change the DNA values but the translation does not, it only translates the value to a new one. We did not use the translation in line 13 as we have more than one letter that is a word. Similarly we would also obtain the number of ATs.


The percentage of GCs is got by using the percentage formula.


  1. #Calculating the length, total nucleotides, dinucleotide sequence GC and AT counts
  3. #length of DNA
  4. ($length=length$DNA);
  5. print"the length of DNA $length\n";
  6. $a=($DNA=~tr/A//);
  7. $b=($DNA=~tr/C//);
  8. $c=($DNA=~tr/G//);
  9. $d=($DNA=~tr/T//);
  10. $Total=$a+$b+$c+$d;
  11. print"total bases in DNA $Total:\n";
  12. #count of GC
  13. $GC=($DNA=~s/GC/GC/g);
  14. print"the total number of dinucleotide GC in DNA :$GC:\n";
  15. #count of AT
  16. $AT=($DNA=~s/AT/AT/g);
  17. print"the total number of dinucleotide AT in DNA:$AT:\n";
  18. #percentage of GC
  19. $GCper=($GC/($Total)*100);
  20. print"the percentage of GC: $GCper:\n";
  21. exit;




the length of DNA 40

total bases in DNA 40:

the total number of dinucleotide GC in DNA :5:

the total number of dinucleotide AT in DNA:2:

the percentage of GC:12.5:

7. Perl program depicting loop - functions

The aim of this program is to show the use of loops like while, until, for, foreach in the Perl language. Loops are used if we need to perform a certain action for a number of times. 


In the present exercise, for the “until” condition, we assign a value (“venkat”) to a scalar variable $r. We then ask the user to enter a value. This value will be assigned to $i. The <STDIN> command is to tell the program that the user needs to give the value during the execution. Its abbreviation is Standard Input. When a program comes across the command, it stops and waits for the user to give the input. Once the input is given it takes the value for further execution.


So, in the 4th line the program checks whether the value of $r is present in $i (venkatraman). If the value is present it jumps to the 9th line. Otherwise it keeps printing “wrong password” and again asks the user to enter the value.


In the line 11 we used for loop to print the numbers from 0 to 9 and to print each number in a new line. We initialized the $i to 0 and specified the condition that it should be doing the loop until $i is less than 10 and incremented each time by 1. The body of for loop has a print statement to print the value of $i with a \n so that it prints it in a new line.


In line 17, each value from the array variable @ar is passed to the scalar variable $i. The value is then incremented by 2 and printed on the screen. This way all the values of the array are incremented by 2 and printed.


The while loop is used to print numbers from 0 to 5. We first initialize $k to 0 and then specify the condition ($k<5) in the While and perform the print operation and incrementing the value of $k by 1. This way the values of $k are printed from 0 to 4.


  1. #while, until, for, foreach, operation
  2. $r="venkat";
  3. $i=<STDIN>;
  4. until($i=~$r)
  5. {
  6. print"wrong password\n";
  7. $i=<STDIN>;
  8. }
  9. print"Welcome\n";
  10. #forloop
  11. for($i=0;$i<10;$i++)
  12. {
  13. print"$i\n";
  14. }
  15. #increment every element by 2
  16. @ar=(0..5);
  17. foreach$i(@ar)
  18. {
  19. $q=$i+2;
  20. print"$q.";
  21. }
  22. print"\n";
  23. #while
  24. $k=0;
  25. while($k<5)
  26. {
  27. print"$k\n";
  28. $k++;
  29. }
  30. exit;




















8. Perl program for DNA to RNA transcription

The objective of the program is to convert the DNA string to RNA string. We store the DNA in $dna. We translate the T to U in the dna and store it back in $dna. This translated DNA is then copied to $rna. The base UGC is substituted by “start” in the RNA and the number of times it has been substituted is stored in $start Count. We would need this to find the actual length of the RNA.


When we find the length of the RNA it counts every character in the RNA. Therefore, it also counts all the letters of the word “start”. We need to remove this number from the total length of the RNA to get the actual length. Hence in the 10th line, we do a subtraction between the length of the RNA and the $start Count which is multiplied by 5 (as the start contains 5 letters). This actual length is stored in $length.


1.    #program for transcribing

2.    $dna="TGCAGCGAAATGT";

3.    ($dna=~tr/T/U/);

4.    $rna=$dna;

5.    print"translated seq:$rna\n";

6.    #Show start codon in RNA

7.    $startCount=($rna=~(s/UGC/start/));

8.    print"the start seq:$rna\n";

9.    #length of rna

10.     $length=(length($rna)-5*$startCount);

11.     print"length of RNA:$length\n";





translated seq: UGCAGCGAAAUGU

the start seq: startAGCGAAAUGU

length of RNA:10

9. Perl program to operate array of variables

The present program demonstrates pop, push, copy and sort operations on arrays. These are very straight forward commands which are very easy to understand. The array @array is filled with 4 values. Using the pop operation we remove the last value of the array. The value is stored in $array. The push operation pushes the value ‘iso’ into the last position of the array. The result of the new array is stored in @RE. The copy operation is done by just passing the values to the new array @array1. The sort operation alphabetically sorts all the values of the array and stores them in @RE.


1.    #perform pop, push, copy and sort functions

2.    #original array operations

3.    @array=('ecor1', 'bam', 'hind', 'avi');

4.    print"the original array:@array\n";

5.    #pop operation

6.    $array=pop(@array);

7.    print"the pop results:@array\n";

8.    #push operation

9.    @RE=push(@array,'iso');

10.     print"push function:@array\n";

11.     #copy operations

12.     @array1=@array;

13.     print"the copy results:@array1\n";

14.     #sort operations

15.     @RE=sort(@array);

16.     print"the result of sort:@RE\n";





the original array:ecor1 bam hind avi

the pop results:ecor1 bam hind

push function:ecor1 bam hind iso

the copy results:ecor1 bam hind iso

the result of sort:bam ecor1 hind iso

10. If and else operations

The current program demonstrates the use of “If”, “Elsif” and “Else” operations. We first store a filename in $filename. We then open the file read all the contents of the file and store them in the array and close the file. Each value of the array can be independently addressed by specifying its position. Example as shown in line 6 and 7. The 0th value (That is the first value of the array. The count starts from (0) of the array is stored in $line1 and the second value is stored in $line2.


So we now ask the user to enter any base sequence. The entered value is stored in the variable $user. Since the user hits the enter button after providing the input, the value stored in $user contains the enter character which should be removed. For that purpose we do chomp on the variable.


Now in line 13, we check if the value entered by the user is present in $line1. The expression ~/$user/ depicts that the value entered by the user can be a part of the complete string in $line1. Else you can just use $line1=$user this means to see if both are equal. But here we just enter one part (substring) of the complete string, therefore we use the expression as in line 13. If the value exists in $line1, it prints out “I found it”. If it is not present in the $line1, we check in $line2. For that reason we use “elsif”. The “elsif” command is used when the “if” condition returned false, that means if the above “if” condition fails. If the above “if” condition is true then the “elsif” line is not executed. So if the value entered by the user is not present in $line1 it checks if it is present in $line2. If it is there it prints “I found it in line2”. If both the condition fails we then have final command called else. That means if the above “if” conditions (if and elsif) fails then the final option is “else”. Here it is not checked with anything it just prints that “I did not find it”


  1. #if operation
  2. $filename="sample.seq";
  3. open(FILE,$filename);
  4. @line=<FILE>;
  5. close FILE;
  6. $line1=@line[0];
  7. $line2=@line[1];
  8. print"$line1\n";
  9. print"$line2\n";
  10. print"enter the required base seq: ";
  11. $user=<STDIN>;
  12. chomp($user);
  13. if($line1=~/$user/){
  14. print"I found it\n";
  15. }
  16. elsif($line2=~/$user/)
  17. {
  18. print"I found it in line2\n";
  19. }
  20. else{
  21. print"I did not find it\n";
  22. }
  23. exit;








enter the required base seq: TCT

I found it in line2

---2nd try--------





enter the required base seq: TGC

I found it

11. Retrieving protein sequences and locating motifs

This program demonstrates of how to find motifs within a protein sequence. We first give the protein file name in line 3. We then check if the file exists by using the “Unless” command. The command does a particular operation unless the condition provided is true. That is if the condition is true it does not do the operation (in this case the print statement in line 5 will not be executed). If the condition is false it does the operation.


Once the file is ready, the contents of the file is read and stored in the array variable @protein. We then use chomp to remove the enter characters and use join to create one complete string. Now we have one big string of protein. We ask the user to enter a motif, so that we can search if the particular motif is present in the string or not. We chomp the user’s input and then use the “if” condition to see if the user’s motif is present in $protein. If it is present it print “I found it” else it prints “I couldn’t find it”. All this is put in a “do until” loop, which is from line 14 to 24. We have put it in a loop because the program keeps requesting the user to enter inputs after having finished with the previous one. When the user enters a space or simply hits the enter button, the condition in the “until” statement becomes true and hence the loop breaks.



  1. #retrieving protein seq
  2. #print"please type the filename of the protein sequence data:";
  3. $proteinfilename="prot.prot";
  4. unless (open(FILE,$proteinfilename)){
  5. print"cannot open file $proteinfilename\n";
  6. exit;
  7. }
  8. @protein=<FILE>;
  9. closeFILE;
  10. chomp(@protein);
  11. $protein=join('',@protein);
  12. print"protein:$protein\n";
  13. #location of motifs
  14. do{
  15. print"enter a motif to search for:\n";
  16. $motif=<STDIN>;
  17. chomp($motif);
  18. #Look for the motif
  19. if($protein=~/$motif/){
  20. print"I found it!\n\n";
  21. }else {
  22. print"I couldn\'t find it:\n\n";
  23. }
  24. }until ($motif=~/^\s*$/);
  25. exit;





enter a motif to search for:


I couldn't find it:


enter a motif to search for:


I found it!


enter a motif to search for:

12. In silico Salient features of DNA

This program aims to demonstrate a few salient features of DNA got from a file. We first store the file name in a scalar variable $DNA. We open the file and read the first line of the file and store it $DNA2. As it is scalar it holds only one value at a time and hence the other lines of the file cannot be stored. We again open the same file but this time store the contents of the file in an array so that all the contents can be read and stored as values of the array @DNA.


We do a chomp on the array (@DNA) to remove the “enter” character. The “chop” command is used to remove the last character of a particular value. We demonstrate here the use of chop. When we use chop on the array @DNA (line 16), the last character of each value of the array is deleted or removed.


We use the “join” command to make one complete string of DNA from the array values. In line 22, a complementary of the DNA string ($DNA3) is got by translating A to T, C to G, G to C and T to A and saved again in $DNA3. A point to note here is that the translation works only letter by letter and not with words. We then again translate T to U to get the RNA. We now find the length of RNA by using the command called “length”. We then find out the total number of bases or nucleotides in the string by calculating the number of occurrences occurred while translating each of the nucleotides to a null value (see line 31 to 34) and totaling all the individual values.


We find out the GC percentage by calculating the number of occurrences while substituting the GC with itself, dividing it by total and multiplying it by 100. We substitute it by itself as we do not want to disturb the DNA structure but would want to know how many GCs are present. The number of adenines are already calculated in the line 31 we just do a copy to $A. We then use the substitute command to substitute the AUG and UAG to start and stop respectively.


  1. #file handler for single strand
  2. $DNA="sample.seq";
  3. open (FILE,$DNA);
  4. $DNA2=<FILE>;
  5. close FILE;
  6. print"the sequence of single strand of DNA:$DNA2\n";
  7. #file handler for multiple strands of DNA
  8. open (FILE,$DNA);
  9. @DNA=<FILE>;
  10. close FILE;
  11. print"the sequences of multiple strands of DNA:@DNA\n";
  12. #Chomp the multiple strands of DNA'
  13. chomp (@DNA);
  14. print"the result of chomp:@DNA\n";
  15. #chop the  sequences
  16. chop (@DNA);
  17. print"the result of chop function:@DNA\n";
  18. #join the two strands
  19. $DNA3=join('',@DNA);
  20. print"the result of join:$DNA3\n";
  21. #substitute for making a complementary copy of the joined strand
  22. $DNA3=~(tr/ACGT/TGCA/);
  23. print"the complementary result:$DNA3\n";
  24. #to make a transcribed copy of the string
  25. $DNA3=~(tr/T/U/);
  26. print"the transcribed RNA:$DNA3\n";
  27. #total length of RNA
  28. $length=length($DNA3);
  29. print"the length of RNA:$length\n";
  30. #total number of nucleotides
  31. $a=($DNA3=~tr/A//);
  32. $b=($DNA3=~tr/C//);
  33. $c=($DNA3=~tr/G//);
  34. $d=($DNA3=~tr/U//);
  35. $Total=$a+$b+$c+$d;
  36. print"the total nucleotides:in RNA:$Total\n";
  37. #percent GC count
  38. $GCcount=($DNA3=~s/GC/GC/g);
  39. print"the total number of GC in DNA :$GCcount:\n";
  40. $GCper=($GCcount/($Total)*100);
  41. print"the GC percentage:$GCper\n";
  42. #number of A
  43. $A=($a);
  44. print"the total number of Adenines:$A\n";
  45. #substitute the start and stop codons
  46. $DNA3=~(s/AUG/start/g);
  47. $DNA3=~(s/UAG/Stop/g);
  48. print"the start and stop codon:$DNA3\n";
  49. #chop the terminal nucleotide of RNA
  50. chop ($DNA3);
  51. print"the result of chop:$DNA3\n";
  52. exit;





the sequence of single strand of DNA:



the sequences of multiple strands of DNA:





the result of join:


the result of complementary:


the transcribed RNA:


the length of RNA:64

the total nucleotides:in RNA:64

the total number of GC in DNA :7:

the GC percentage:10.9375

the total number of Adenines:18

the start and stop codon:


the result of chop:


13. Perl Program to translate DNA into protein

The objective of the program is to read the DNA from a file or through a manual entry and convert the DNA string into a protein string. For this purpose we first use a subroutine which translates the three letter codon of DNA to the respective protein value. The subroutine contains a hash variable called “%genetic_code”. This hash variable stores the protein values with their respective DNA codon. So when the subroutine gets a codon, the codon is converted to uppercase (as in line 3 with the command “uc”) and then checked if the particular codon exists in this hash variable. If it exists it returns the protein value else it prints the error occurred. So at the end the subroutine returns either the protein value or the error.


We now ask the user to choose 1 for entering the DNA manually or 2 for entering the DNA through a file. If it is through manual method, the user enters the DNA value on the screen and a chomp is done on the value to remove the enter character and then stored in $dna. If it is through a file, the file is opened; the contents of the file are read and stored in an array and the values of this array then joined to form one complete DNA string and are stored in $dna.


We have used a command called “my” in front of some of the variables. This command symbolizes that the particular variable is a private variable. Which means the variable cannot be inherited by any another subroutine or program and is only applicable for that particular subroutine or program.


Now we initialize a variable $protein and $codon. We use the “for loop” to go through the DNA and get the three letter codon, save it in $codon and pass the codon to the subroutine to get the respective protein value. We then save this protein value in $protein. This way the for loop scans the DNA string until it reaches to the end.


The for loop (for(my $i=0; $i<(length($dna)-2); $i+=3)) initializes the variable $i to 0 and is limited until the last third value of the DNA by using the second element in the for loop and incrementing the value of $i by 3 for each loop. You then take the substring of the DNA by using the command in line 102 (substr($dna,$i,3)). That is you take the substring of the DNA starting from the “i” position with the length of 3. So we get the three letter codon from the DNA string. This codon is then passed to the substring which is named after “codon2aa”. The codon is converted to uppercase and then the if condition is executed to see if a protein value exists with this codon and returns the protein value which saved in $protein. The next time the for loop is execute the protein value is concatenated with the existing value of the $protein and saved again in $protein, therefore getting one complete protein sequence. We can find out the length of the protein sequence by using the command called length.


Flow Chart depicting the methodology by adopting genetic code to translate DNA sting into a protein string.




  1. sub codon2aa{
  2. my($codon)=@_;
  3. $codon= uc $codon;
  4. my(%genetic_code) = (
  5. 'TCA'=>'S', #Serine
  6. 'TCC'=>'S', #Serine
  7. 'TCG'=>'S',  #Serine
  8. 'TCT'=>'S', #Serine

9.    'TTC'=>'F', #Phenylalanine

10.     'TTT'=>'F', #Phenylalanine

11.     'TTA'=>'L', #Leucine

12.     'TTG'=>'L', #Leucine

13.     'TAC'=>'Y', #Tyrosine

14.     'TAT'=>'Y', #Tyrosine

15.     'TAA'=>'_', #Stop

16.     'TAG'=>'_', #Stop

17.     'TGC'=>'C', #Cysteine

18.     'TGT'=>'C', #Cysteine

19.     'TGA'=>'_', #Stop

20.     'TGG'=>'W', #Tryptophan

21.     'CTA'=>'L', #Leucine

22.     'CTC'=>'L', #Leucine

23.     'CTG'=>'L', #Leucine

24.     'CTT'=>'L', #Leucine

25.     'CCA'=>'P', #Proline

26.     'CAT'=>'H', #Histidine

27.     'CAA'=>'Q', #Glutamine

28.     'CAG'=>'Q', #Glutamine

29.     'CGA'=>'R', #Arginine

30.     'CGC'=>'R', #Arginine

31.     'CGG'=>'R', #Arginine

32.     'CGT'=>'R', #Arginine

33.     'ATA'=>'T', #Isoleucine

34.     'ATC'=>'T', #Isoleucine

35.     'ATT'=>'T', #Isoleucine

36.     'ATG'=>'M', #Methionine

37.     'ACA'=>'T', #Threonine

38.     'ACC'=>'T', #Threonine

39.     'ACG'=>'T', #Threonine

40.     'ACT'=>'T', #Threonine

41.     'AAC'=>'N', #Asparagine

42.     'AAT'=>'N', #Asparagine

43.     'AAA'=>'K', #Lysine

44.     'AAG'=>'K', #Lysine

45.     'AGC'=>'S', #Serine#Valine

46.     'AGT'=>'S', #Serine

47.     'AGA'=>'R', #Arginine

48.     'AGG'=>'R', #Arginine

49.     'CCC'=>'P', #Proline

50.     'CCG'=>'P', #Proline

51.     'CCT'=>'P', #Proline

52.     'CAC'=>'H', #Histidine

53.     'GTA'=>'V', #Valine

54.     'GTC'=>'V', #Valine

55.     'GTG'=>'V', #Valine

56.     'GTT'=>'V', #Valine

57.     'GCA'=>'A', #Alanine

58.     'GCC'=>'A', #Alanine

59.     'GCG'=>'A', #Alanine

60.     'GCT'=>'A', #Alanine

61.     'GAC'=>'D', #Aspartic Acid

62.     'GAT'=>'D', #Aspartic Acid

63.     'GAA'=>'E', #Glutamic Acid

64.     'GAG'=>'E', #Glutamic Acid

65.     'GGA'=>'G', #Glycine

66.     'GGC'=>'G', #Glycine

67.     'GGG'=>'G', #Glycine

68.     'GGT'=>'G', #Glycine

69.     );


70.     if(exists $genetic_code{$codon}){

71.     return $genetic_code{$codon};

72.     }

73.     else{

74.     print STDERR "Bad Codon\"$codon\"!!\n";

75.     exit;

76.     }

77.     }


  1. print "Please choose one of the following for entring the DNA sequence\n";
  2. print "Press 1 for manual entry or 2 for file\n";


  1. $entry=<STDIN>;
  2. chomp($entry);
  3. if($entry==1){
  4. print "enter the sequence:-\n";
  5. $dna=<STDIN>;
  6. chomp($dna);
  7. }elsif($entry==2){
  8. print "enter the filename(if it is in another folder please specify the path)\n";
  9. $dnafile=<STDIN>;
  10. open(FILE,$dnafile);
  11. @filecontents=<FILE>;
  12. close FILE;
  13. chomp(@filecontents);
  14. $dna=join('',@filecontents);
  15. }else{
  16. print "please press only 1 or 2\n";
  17. exit;
  18. }


  1. #my $dna="TCATTCTCATTC";
  2. my $protein='';
  3. my $codon3;
  4. for(my $i=0; $i<(length($dna)-2); $i+=3){
  5. $codon3=substr($dna,$i,3);


  1. $protein.=  codon2aa($codon3);
  2. }


  1. print "\nThe protein is:-\n$protein\n";
  2. $len = length($protein);
  3. print "The length of the protein is:-\n$len\n";


  1. exit;





krupanidhi@biolin:~/Desktop/bioperl$ perl dna2prot.pl

Please choose one of the following for entering the DNA sequence

Press 1 for manual entry or 2 for file


enter the sequence:-



The protein is:-


The length of the protein is:-


krupanidhi@biolin:~/Desktop/bioperl$ perl dna2prot.pl

Please choose one of the following for entring the DNA sequence

Press 1 for manual entry or 2 for file


enter the filename(if it is in another folder please specify the path)





The protein is:-




The length of the protein is:-



press 1 for manual entry or 2 for file


press either 1 or 2 only



14. Bio-perl programme to retrieve DNA sequence from swissprot

This program uses the bioperl package to connect to the internet and get a Swissprot sequence from the internet databases using the get_sequence and is saved in $seq_object. The blast_sequence blasts a sequence against standard databases at NCBI and the result of the blast is stored in $blast_result. The write_blast writes a blast report out to a file. In this case the file name is raol1.blast.


  1. #!/usr/bin/perl -w


  1. use Bio::Perl;


  1. $seq_object = get_sequence('swissprot', "ROA1_HUMAN");
  2. $blast_result = blast_sequence($seq_object);
  3. write_blast(">raol1.blast",$blast_result);
  4. exit;


i. krupa.seq





ii.Beta-globin gene:













iii. Prot.prot










iv. IL2:



Genomic sequence in Fasta format


>ref|NT_016354.18|Hs4_16510:47920772-47925797 Homo sapiens chromosome 4 genomic contig, reference assembly












































































v. mRNA sequence: IL 2


>gi|125661059|ref|NM_000586.3| Homo sapiens interleukin 2 (IL2), mRNA















vi. Hemoglobin alpha 1:


Gene seq:


>ref|NT_037887.4|Hs16_37891:166679-167520 Homo sapiens chromosome 16 genomic contig, reference assembly
















vii. mRNA: alpha 1


>gi|14456711|ref|NM_000558.3| Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA













viii. Hemoglobin Beta Chain




>ref|NT_009237.17|Hs11_9394:4033937-4035542 Homo sapiens chromosome 11 genomic contig, reference assembly


























ix. mRNA hb b-chain


>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA












Suggested reading:

1.    Randal L Schwartz, Tom Phoenix and Brian d foy, 2005, Learning perl, 4th edition, O’Reilly.

2.    Brian D Foy, 2006, Learning Perl Student Workbook, 4th edition, O’Reilly.

3.    Larry Wall, 2000, Programming Perl, 3rd Edition, O’Reilly

4.    James Tisdall, 2001, Beginning Bioperl for Bioinformatics, 1st edition, O’Reilly

5.    James D. Tisdall, 2003, Mastering Perl for Bioinformatics, 1st edition, O’Reilly

6.    Ian Korf, Mark Yandell, Joseph Bedell, 2003, BLAST, 1st edition, O’Reilly