13. Perl Program to translate DNA into protein
- Introductory Workbook on Perl for Biology Students

The objective of the program is to read the DNA from a file or through a manual entry and convert the DNA string into a protein string. For this purpose we first use a subroutine which translates the three letter codon of DNA to the respective protein value. The subroutine contains a hash variable called “%genetic_code”. This hash variable stores the protein values with their respective DNA codon. So when the subroutine gets a codon, the codon is converted to uppercase (as in line 3 with the command “uc”) and then checked if the particular codon exists in this hash variable. If it exists it returns the protein value else it prints the error occurred. So at the end the subroutine returns either the protein value or the error.


We now ask the user to choose 1 for entering the DNA manually or 2 for entering the DNA through a file. If it is through manual method, the user enters the DNA value on the screen and a chomp is done on the value to remove the enter character and then stored in $dna. If it is through a file, the file is opened; the contents of the file are read and stored in an array and the values of this array then joined to form one complete DNA string and are stored in $dna.


We have used a command called “my” in front of some of the variables. This command symbolizes that the particular variable is a private variable. Which means the variable cannot be inherited by any another subroutine or program and is only applicable for that particular subroutine or program.


Now we initialize a variable $protein and $codon. We use the “for loop” to go through the DNA and get the three letter codon, save it in $codon and pass the codon to the subroutine to get the respective protein value. We then save this protein value in $protein. This way the for loop scans the DNA string until it reaches to the end.


The for loop (for(my $i=0; $i<(length($dna)-2); $i+=3)) initializes the variable $i to 0 and is limited until the last third value of the DNA by using the second element in the for loop and incrementing the value of $i by 3 for each loop. You then take the substring of the DNA by using the command in line 102 (substr($dna,$i,3)). That is you take the substring of the DNA starting from the “i” position with the length of 3. So we get the three letter codon from the DNA string. This codon is then passed to the substring which is named after “codon2aa”. The codon is converted to uppercase and then the if condition is executed to see if a protein value exists with this codon and returns the protein value which saved in $protein. The next time the for loop is execute the protein value is concatenated with the existing value of the $protein and saved again in $protein, therefore getting one complete protein sequence. We can find out the length of the protein sequence by using the command called length.


Flow Chart depicting the methodology by adopting genetic code to translate DNA sting into a protein string.




  1. sub codon2aa{
  2. my($codon)[email protected]_;
  3. $codon= uc $codon;
  4. my(%genetic_code) = (
  5. 'TCA'=>'S', #Serine
  6. 'TCC'=>'S', #Serine
  7. 'TCG'=>'S',  #Serine
  8. 'TCT'=>'S', #Serine

9.    'TTC'=>'F', #Phenylalanine

10.     'TTT'=>'F', #Phenylalanine

11.     'TTA'=>'L', #Leucine

12.     'TTG'=>'L', #Leucine

13.     'TAC'=>'Y', #Tyrosine

14.     'TAT'=>'Y', #Tyrosine

15.     'TAA'=>'_', #Stop

16.     'TAG'=>'_', #Stop

17.     'TGC'=>'C', #Cysteine

18.     'TGT'=>'C', #Cysteine

19.     'TGA'=>'_', #Stop

20.     'TGG'=>'W', #Tryptophan

21.     'CTA'=>'L', #Leucine

22.     'CTC'=>'L', #Leucine

23.     'CTG'=>'L', #Leucine

24.     'CTT'=>'L', #Leucine

25.     'CCA'=>'P', #Proline

26.     'CAT'=>'H', #Histidine

27.     'CAA'=>'Q', #Glutamine

28.     'CAG'=>'Q', #Glutamine

29.     'CGA'=>'R', #Arginine

30.     'CGC'=>'R', #Arginine

31.     'CGG'=>'R', #Arginine

32.     'CGT'=>'R', #Arginine

33.     'ATA'=>'T', #Isoleucine

34.     'ATC'=>'T', #Isoleucine

35.     'ATT'=>'T', #Isoleucine

36.     'ATG'=>'M', #Methionine

37.     'ACA'=>'T', #Threonine

38.     'ACC'=>'T', #Threonine

39.     'ACG'=>'T', #Threonine

40.     'ACT'=>'T', #Threonine

41.     'AAC'=>'N', #Asparagine

42.     'AAT'=>'N', #Asparagine

43.     'AAA'=>'K', #Lysine

44.     'AAG'=>'K', #Lysine

45.     'AGC'=>'S', #Serine#Valine

46.     'AGT'=>'S', #Serine

47.     'AGA'=>'R', #Arginine

48.     'AGG'=>'R', #Arginine

49.     'CCC'=>'P', #Proline

50.     'CCG'=>'P', #Proline

51.     'CCT'=>'P', #Proline

52.     'CAC'=>'H', #Histidine

53.     'GTA'=>'V', #Valine

54.     'GTC'=>'V', #Valine

55.     'GTG'=>'V', #Valine

56.     'GTT'=>'V', #Valine

57.     'GCA'=>'A', #Alanine

58.     'GCC'=>'A', #Alanine

59.     'GCG'=>'A', #Alanine

60.     'GCT'=>'A', #Alanine

61.     'GAC'=>'D', #Aspartic Acid

62.     'GAT'=>'D', #Aspartic Acid

63.     'GAA'=>'E', #Glutamic Acid

64.     'GAG'=>'E', #Glutamic Acid

65.     'GGA'=>'G', #Glycine

66.     'GGC'=>'G', #Glycine

67.     'GGG'=>'G', #Glycine

68.     'GGT'=>'G', #Glycine

69.     );


70.     if(exists $genetic_code{$codon}){

71.     return $genetic_code{$codon};

72.     }

73.     else{

74.     print STDERR "Bad Codon\"$codon\"!!\n";

75.     exit;

76.     }

77.     }


  1. print "Please choose one of the following for entring the DNA sequence\n";
  2. print "Press 1 for manual entry or 2 for file\n";


  1. $entry=<STDIN>;
  2. chomp($entry);
  3. if($entry==1){
  4. print "enter the sequence:-\n";
  5. $dna=<STDIN>;
  6. chomp($dna);
  7. }elsif($entry==2){
  8. print "enter the filename(if it is in another folder please specify the path)\n";
  9. $dnafile=<STDIN>;
  10. open(FILE,$dnafile);
  11. @filecontents=<FILE>;
  12. close FILE;
  13. chomp(@filecontents);
  14. $dna=join('',@filecontents);
  15. }else{
  16. print "please press only 1 or 2\n";
  17. exit;
  18. }


  1. #my $dna="TCATTCTCATTC";
  2. my $protein='';
  3. my $codon3;
  4. for(my $i=0; $i<(length($dna)-2); $i+=3){
  5. $codon3=substr($dna,$i,3);


  1. $protein.=  codon2aa($codon3);
  2. }


  1. print "\nThe protein is:-\n$protein\n";
  2. $len = length($protein);
  3. print "The length of the protein is:-\n$len\n";


  1. exit;





[email protected]:~/Desktop/bioperl$ perl dna2prot.pl

Please choose one of the following for entering the DNA sequence

Press 1 for manual entry or 2 for file


enter the sequence:-



The protein is:-


The length of the protein is:-


[email protected]:~/Desktop/bioperl$ perl dna2prot.pl

Please choose one of the following for entring the DNA sequence

Press 1 for manual entry or 2 for file


enter the filename(if it is in another folder please specify the path)





The protein is:-




The length of the protein is:-



press 1 for manual entry or 2 for file


press either 1 or 2 only



rating: 3.70 from 116 votes | updated on: 30 Jan 2009 | views: 130457 |

Rate article:

excellent! bad…