How to count amino acids in fasta formated file? - 【StackMirror】|python|bioinformatics|biopython|fasta

I found the code that parses the fasta frmated file. I need to count how many A, T, G and so on is in each sequence, for example:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIIL

in this sequence theres:

M - 2
R - 4
G - 1
L - 3
I - 2
P - 1

The code is very simple:

 def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')

  print "%d sequences found" % len(order)
  return order, sequences

x, y = FASTA("drosoph_b.fasta")

but how can I count those amino acids? I dont want to use BioPython, I would like to know how to do this with, for example count...

2013-01-18 10:59
by mazix

Your code just puts the > lines and the sequence lines in two lists. To count the characters in the sequence lines you want collections.Counter(line) - Katriel 2013-01-18 11:13

"I dont want to use BioPython" -- why? It's fine if you just want to learn Python, but otherwise there are good reasons to use existing libraries - Lev Levitsky 2013-01-18 11:40

you could also ask biostars: http://biostar.stackexchange.com - Pierre 2013-01-18 18:20

An alternative to what katrielalex mentioned in the comments would to use another dictionary, code is below

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
  counts = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
      for aa in sequences[name]:
        if aa in counts:
            counts[aa] = counts[aa] + 1
        else:
            counts[aa] = 1  


  print "%d sequences found" % len(order)
  print counts
  return order, sequences

x, y = FASTA("drosoph_b.fasta")

this outputs:

1 sequences found
{'G': 1, 'I': 2, 'M': 2, 'L': 3, 'P': 1, 'R': 4}

2013-01-18 11:18
by Harpal

is it possible to count sequences in this sequence, I mean: with my example: MRMRGRRLLPIILXKKK I would like to have output like this: M - 2, R - 2, RR - 2, G - 1, LL - 2, P - 1, I - 2, L - 1, X - 1, KKK - 1 - mazix 2013-01-23 10:14

As katrielalex points out, collections.Counter is a great fit for the task:

In [1]: from collections import Counter

In [2]: Counter('MRMRGRRLLPIIL')
Out[2]: Counter({'R': 4, 'L': 3, 'M': 2, 'I': 2, 'G': 1, 'P': 1})

You can apply it to the values of sequences dict in your code.

However, I'd recommend against using this code in real life. Libraries such as BioPython do it better. For example, the code you show will generate quite bulky data structures. As FASTA files are sometimes quite large, you can possibly run out of memory. Also, it doesn't handle possible exceptions in the best way possible.

Finally, using libraries just saves you time.

Example code with BioPython:

In [1]: from Bio import SeqIO

In [2]: from collections import Counter

In [3]: for entry in SeqIO.parse('1.fasta', 'fasta'):
   ...:     print Counter(entry.seq)
   ...:     
Counter({'R': 4, 'L': 3, 'I': 2, 'M': 2, 'G': 1, 'P': 1})

2013-01-18 11:18
by Lev Levitsky

is it possible to count sequences in this sequence, I mean: with my example: MRMRGRRLLPIIL I would like to have output like this: M - 2, R - 2, RR - 2, G - 1, LL - 2, P - 1, I - 2, L - 1 - mazix 2013-01-23 10:14

@mazix That's more tricky. You can use <code>itertools.groupby</code> to split the string: for aa, s in groupby('MRMRGRRLLPIILXKKK'): print ''.join(s) will give you an idea. But for repeated LL you'll have to increment the counter for L by two, etc - Lev Levitsky 2013-01-23 12:07

This can be obtained using very simple bash commands, my answer is below

cat input.fasta #my input file
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
    MRMRGRRLLPIIL

cat input.fasta | grep -v ">" | fold -w1 | sort | uniq -c

Output:

fold -w1 splits at every character, you sort them and count the unique ones

2013-01-30 13:12
by bala

# your previous code here

x, y = FASTA("drosoph_b.fasta")

import collections

for order in x:
  print order, ':',
  print '\n'.join('%s - %d' % (k, v) for k, v in collections.Counter(y[order]).iteritems())
  print

2013-01-18 13:09
by Nikolay Vyahhi