I found the code that parses the fasta frmated file. I need to count how many A, T, G and so on is in each sequence, for example:
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIIL
in this sequence theres:
M - 2
R - 4
G - 1
L - 3
I - 2
P - 1
The code is very simple:
def FASTA(filename):
try:
f = file(filename)
except IOError:
print "The file, %s, does not exist" % filename
return
order = []
sequences = {}
for line in f:
if line.startswith('>'):
name = line[1:].rstrip('\n')
name = name.replace('_', ' ')
order.append(name)
sequences[name] = ''
else:
sequences[name] += line.rstrip('\n').rstrip('*')
print "%d sequences found" % len(order)
return order, sequences
x, y = FASTA("drosoph_b.fasta")
but how can I count those amino acids? I dont want to use BioPython, I would like to know how to do this with, for example count...
An alternative to what katrielalex mentioned in the comments would to use another dictionary, code is below
def FASTA(filename):
try:
f = file(filename)
except IOError:
print "The file, %s, does not exist" % filename
return
order = []
sequences = {}
counts = {}
for line in f:
if line.startswith('>'):
name = line[1:].rstrip('\n')
name = name.replace('_', ' ')
order.append(name)
sequences[name] = ''
else:
sequences[name] += line.rstrip('\n').rstrip('*')
for aa in sequences[name]:
if aa in counts:
counts[aa] = counts[aa] + 1
else:
counts[aa] = 1
print "%d sequences found" % len(order)
print counts
return order, sequences
x, y = FASTA("drosoph_b.fasta")
this outputs:
1 sequences found
{'G': 1, 'I': 2, 'M': 2, 'L': 3, 'P': 1, 'R': 4}
MRMRGRRLLPIILXKKK I would like to have output like this: M - 2, R - 2, RR - 2, G - 1, LL - 2, P - 1, I - 2, L - 1, X - 1, KKK - 1 - mazix 2013-01-23 10:14
As katrielalex points out, collections.Counter is a great fit for the task:
In [1]: from collections import Counter
In [2]: Counter('MRMRGRRLLPIIL')
Out[2]: Counter({'R': 4, 'L': 3, 'M': 2, 'I': 2, 'G': 1, 'P': 1})
You can apply it to the values of sequences dict in your code.
However, I'd recommend against using this code in real life. Libraries such as BioPython do it better. For example, the code you show will generate quite bulky data structures. As FASTA files are sometimes quite large, you can possibly run out of memory. Also, it doesn't handle possible exceptions in the best way possible.
Finally, using libraries just saves you time.
Example code with BioPython:
In [1]: from Bio import SeqIO
In [2]: from collections import Counter
In [3]: for entry in SeqIO.parse('1.fasta', 'fasta'):
...: print Counter(entry.seq)
...:
Counter({'R': 4, 'L': 3, 'I': 2, 'M': 2, 'G': 1, 'P': 1})
MRMRGRRLLPIIL I would like to have output like this: M - 2, R - 2, RR - 2, G - 1, LL - 2, P - 1, I - 2, L - 1 - mazix 2013-01-23 10:14
for aa, s in groupby('MRMRGRRLLPIILXKKK'): print ''.join(s) will give you an idea. But for repeated LL you'll have to increment the counter for L by two, etc - Lev Levitsky 2013-01-23 12:07
This can be obtained using very simple bash commands, my answer is below
cat input.fasta #my input file
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIIL
cat input.fasta | grep -v ">" | fold -w1 | sort | uniq -c
Output:
1 G
2 I
3 L
2 M
1 P
4 R
fold -w1 splits at every character, you sort them and count the unique ones
# your previous code here
x, y = FASTA("drosoph_b.fasta")
import collections
for order in x:
print order, ':',
print '\n'.join('%s - %d' % (k, v) for k, v in collections.Counter(y[order]).iteritems())
print
>lines and the sequence lines in two lists. To count the characters in the sequence lines you wantcollections.Counter(line)- Katriel 2013-01-18 11:13