Login

Join for Free!
118252 members


Genomic Scan, Frequency & Distribution of SEED

Everything on bioinformatics, the science of information technology as applied to biological research.

Moderator: BioTeam

Genomic Scan, Frequency & Distribution of SEED

Postby Kwak » Fri Jan 11, 2008 3:27 pm

I want to scan a whole genome (fasta format) for its 100% exact matches with a seed sequence which is 7-mer long. Blast (and that goes for any version as far as i know) NEVER, even with the correct parameters which adjust for word size and its e-value - gets ONLY the exact matches because of its extension and similarity methodology which focusses on NEAR-exact matches (similarity).

It's seems very elementary, I want to scan a fasta formatted file, determine its sequence length and mark the seed start and end of every exact match which will allow me to map a distribution along the genome and determine it's frequency. However, I can't find any protocol, method or website that points me into the right direction eventhough the principle is SO BASIC. And trust me I have been trying. I'm aware that this must be a piece-of-cake for somebody that has the perl/coding skills I, quite obviously, lack

I would have given up and done it with notepad a long time ago if I didnt have so many different sequences, large genomes, EST's etc to go through.

Any help would be greatly appreciated, though i fear that the explanation will be too simple and humiliate me :)
Kwak
Garter
Garter
 
Posts: 3
Joined: Fri Jan 11, 2008 2:17 pm

Postby mith » Fri Jan 11, 2008 5:05 pm

so let me get this straight, you just want it to mark the ends/starts of exact matches?
Living one day at a time;
Enjoying one moment at a time;
Accepting hardships as the pathway to peace;
~Niebuhr
User avatar
mith
Inland Taipan
Inland Taipan
 
Posts: 5345
Joined: Thu Jan 20, 2005 8:14 pm
Location: Nashville, TN

Re: Genomic Scan, Frequency & Distribution of SEED

Postby mith » Fri Jan 11, 2008 7:06 pm

I don't know what visualization system you're using but something like this might work if I understood you correctly

Code: Select all
#!/usr/local/bin/perl

#fastastream from http://search.cpan.org/~reneeb/Bio-FastaStream-0.03/lib/Bio/FastaStream.pm
  use Bio::FastaStream;
  my $fasta = '/path/to/file.fasta';
  my $seq = Bio::FastaStream->new($fasta);

# get sequence data
  my $sequence = $seq->getSequence();
#get length of sequence
  my $length = $seq->getSequenceLength();

#replace matchpattern with your query sequence
$sequence =~ s/MATCHPATTERN/...../g;

#output to file


So basically this code just finds your pattern and then replaces it with period characters which would show up if you're doing a pairwise comparison.
Living one day at a time;
Enjoying one moment at a time;
Accepting hardships as the pathway to peace;
~Niebuhr
User avatar
mith
Inland Taipan
Inland Taipan
 
Posts: 5345
Joined: Thu Jan 20, 2005 8:14 pm
Location: Nashville, TN


Postby Kwak » Sat Jan 12, 2008 12:01 am

Thank you for your time mith. I apologize for any ambiguity but we are nearly there I think, if only we could replace the "#replace matchpattern with your query sequence" into a print output which would include the start and end of the position of the seed (search sequence) in the whole sequence string.
e.g. ID(occurrence)-Seed-Start-End-Length(FullSequenceLength)


To illustrate it an example - we have a string of length 42:
T G C G T C C T A C A A G A C A A C T A T G G G G C C G C G T T T T C G C C G C C A
And lets say I’m looking for a seed sequence of 3-mer, CGT.
T G.C G T.C C T A C A A G A C A A C T A T G G G G C C G.C G T.T T T C G C C G C C A
What I want to print to a file would be the following tab delimited data:

ID(occurrence)-Seed-Start-End-Length(FullSequenceLength) like:
1-CGT-3-6-42
2 -CGT-29-32-42

I'm sorry to bother any of you with my lack of perl skills, but as a european working/studying biochemistry at a chinese university ; my skills as a researcher are not that well balanced, plus IT personnel has yet to be discovered in my department.
Kwak
Garter
Garter
 
Posts: 3
Joined: Fri Jan 11, 2008 2:17 pm

Postby mith » Sat Jan 12, 2008 3:52 am

I don't have perl with me right now(on vacation) but here's how I would do it(no testing done so it's probably buggy).

Code: Select all
#!/usr/local/bin/perl

#fastastream from http://search.cpan.org/~reneeb/Bio-FastaStream-0.03/lib/Bio/FastaStream.pm
  use Bio::FastaStream;
  my $fasta = '/path/to/file.fasta';
  my $seq = Bio::FastaStream->new($fasta);
 
  #query data here   
  $query = "something";
  $querylength = 10;

  $offset=$querylength-1;


# get sequence data
  my $sequence = $seq->getSequence();
#get length of sequence
  my $length = $seq->getSequenceLength();

#replace matchpattern with your query sequence, function substitute matches pattern with "." of equal length to query sequence(they're placeholders)
$sequence =~ s/$query/...../g;

#cuts sequence into array of individual letters
chomp $sequence;
@chars = split(//, $sequence);

for ($i=0; $i<=$#chars; $i++) {
      if($char[$i] eq "."){
      push(@found, $i);
      $i=$i+$offset;
      }
      
   }

print @found;
print "\n $length";



I'm not quite sure why you would want the redundancy of printing the seed info and the sequence length multiple times. The end position is also easily found as start+seed length.
The example I posted does the following things

1. Scans thru the fasta file and gets the sequence.
2. Matches the sequence to seed sequence. Each time it matches, it replaces those letters with periods. Make sure you change it so the number of periods added equals the length of the seed sequence. Now your sequence looks something like this

TCG...AGCTA...A

3. It breaks the sequence down into an array of chars, each array element is a letter. It attempts to find the occurence of ".". The period signifies the start of the seed. Then it skips forward seed length-1 to find the next seed occurence.

4. Each seed start position is recorded in the found array which gets printed along with the sequence length.

On an unrelated note, why are you working in China? I thought the Europeans have great bioinformatics programs...the germans certainly do.
Living one day at a time;
Enjoying one moment at a time;
Accepting hardships as the pathway to peace;
~Niebuhr
User avatar
mith
Inland Taipan
Inland Taipan
 
Posts: 5345
Joined: Thu Jan 20, 2005 8:14 pm
Location: Nashville, TN

Re: Genomic Scan, Frequency & Distribution of SEED

Postby Kwak » Sun Jan 13, 2008 7:00 am

Below I posted the code that I eventually managed to frabricate with some help from tutorials that elucidated perl for me. I read the fasta seed from a fasta file and the genomic db is being read in line by line buffered. The code is not beautiful but does the job. The superfluous printing is because I have many seed sequences and many databases to scan and I don't want to get lost in my own data. The tabs in the output are for later processing.

hmm China, where to start. I originally come from Wageningen, The Netherlands. I wanted something more than just studying with mediocre effort and getting a job in the commercial sector (like many students sadly do these days) and came here. Why China? Because China is not 'Western'.The standards ofcourse are not quite up to par just yet, but they are definitly getting there. Being here is great, it has opened my eyes as a scientist and as a person because the differences are surprisingly immense. Getting things done here takes more effort but that's has been a learning experience too. The value of having been here the last two years = priceless. I wonder what it will be like though if I decide to go back to Europe, time will tell.

Mith you're a legend for helping out people during your vacation.

With great appreciation,
my best regards!

Code: Select all
#!/usr/local/bin/perl
use warnings;
use strict;
use Carp;

# Make sure the program is called with two arguments the first the db the other the fasta to scan with

open(FSTRING,$ARGV[1]) or croak("Database $ARGV[1] not found:$!\n");
my $fragmenth = <FSTRING>;
chomp $fragmenth;
my $fragment =  <FSTRING>;
my $fraglen = length($fragment);
close (FSTRING);

# $fragment: the pattern to search for
# $fraglen:  the length of $fragment
# $buffer:   a buffer to hold the DNA from the input file
# $position: the position of the buffer in the total DNA
my $position = 0;

# The first line of a FASTA file is a header and begins with '>'
open(DNA,"$ARGV[0]") or croak("Cannot open $ARGV[0]:$!\n");
my $header =  <DNA>;
chomp $header;
# Get the first line of DNA data, to start the ball rolling
my $buffer = <DNA>;
my $count = 0;
chomp $buffer;
my $str = time();
# The remaining lines are DNA data ending with newlines
while(my $newline = <DNA>) {

    # Add the new line to the buffer      
    chomp $newline;
    $buffer .= $newline;
    # Search for the DNA fragment and length
    # (Report the character at string position 0 as being at position 1,
    # as usual in biology)
    while($buffer =~ /$fragment/gi) {
    $count ++;
    print "$count\t$fragment\t$fragmenth\t$ARGV[0]\t",$position + $-[0] + 1,"\t", $position + $-[0] + $fraglen,"\t$str\n";
    open (DBMATCH, ">>output.dat");   
    print DBMATCH ("$count\t$fragment\t$fragmenth\t$ARGV[0]\t",$position + $-[0] + 1,"\t", $position + $-[0] + $fraglen,"\t$str   \n");
    close (DBMATCH);

    }

    # Reset the position counter (will be true after you reset the buffer, next)
    $position = $position + length($buffer) - $fraglen + 1;

    # Discard the data in the buffer, except for a portion at the end
    # so patterns that appear across line breaks are not missed
    $buffer = substr($buffer, length($buffer) - $fraglen + 1, $fraglen - 1);
}
my $totlength = $position + $fraglen - 1;

open (DBMATCH, ">>output.dat");   
print DBMATCH ("\t\t\t\t\t\t\tTOTAL\t$count\tmatches\t$totlength\tlengh\tath.fa\t$ARGV[0]\t$fragmenth\t$fragment\n\n");
close (DBMATCH);
print "\t\t\t\t\t\t\tTOTAL\t$count\tmatches\t$totlength\tlengh\tath.fa\t$ARGV[0]\t$fragmenth\t$fragment\n\n";

print $str;
close (DNA);
Kwak
Garter
Garter
 
Posts: 3
Joined: Fri Jan 11, 2008 2:17 pm

Postby mith » Sun Jan 13, 2008 8:16 pm

Hey, best of luck. Looks better than my code, but if you want to skip all the fasta file parsing, just install the biofastastream module.
Living one day at a time;
Enjoying one moment at a time;
Accepting hardships as the pathway to peace;
~Niebuhr
User avatar
mith
Inland Taipan
Inland Taipan
 
Posts: 5345
Joined: Thu Jan 20, 2005 8:14 pm
Location: Nashville, TN


Return to Bioinformatics

Who is online

Users browsing this forum: No registered users and 0 guests