Python: how do I match the string - only if the next line has a given string?

I have a text file, which looks like this:

node13 
    state = free 
np = 8 
properties = beta,eightcores 
ntype = cluster 
status = opsys=linux,uname=Linux node13 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64,sessions=? 15201,nsessions=? 01,nusers=0,idletime=6837317,totmem=20506268kb,availmem=20259728kb,physmem=20506268kb,ncpus=8,loadave=0.00,gres=,netload=17130666575,se=free,jobs=,varattr=,rectime=1333639375 

node14 
    state = job-exclusive 
np = 8 
properties = beta,eightcores 
ntype = cluster

I want to grab nodes only if they are free. For that I have to make a regexp which will match node(..) only if the following line has state = free. Can You help me with this?

Edit:

Nothing works so far. May be because I'm not reading in the file, but

proc = subprocess.Popen("pbsnodes", stdout=subprocess.PIPE)
listOfFreeNodes = proc.stdout.read()

Could it harm the solutions some-how? Here's the full pbsnodes output:

node01                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node01 2.6.27.19-5-01,nusers=0,idletime=861913,totmem=16432576kb,availmem=16=free,jobs=,varattr=,rectime=1333641123                  

node02                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node02 2.6.27.19-5-nusers=2,idletime=5357510,totmem=16432576kb,availmem=1617ree,jobs=,varattr=,rectime=1333641107                    

node03                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node03 2.6.27.19-5-s=1,idletime=8564681,totmem=16432576kb,availmem=16029924kobs=60966.hpchead.linux,varattr=,rectime=1333641119      

node04                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node04 2.6.27.19-5-01,nusers=0,idletime=8564678,totmem=16432576kb,availmem=1e=free,jobs=,varattr=,rectime=1333641124                 

node05                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node05 2.6.27.19-5-01,nusers=0,idletime=2072593,totmem=16432652kb,availmem=1=free,jobs=,varattr=,rectime=1333641091                  

node06                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node06 2.6.27.19-5-s=1,idletime=9038,totmem=16432576kb,availmem=16200960kb,p,varattr=,rectime=1333641096                             

node07                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node07 2.6.27.19-5-s=1,idletime=8564671,totmem=16432576kb,availmem=16173848kobs=,varattr=,rectime=1333641134                         

node08                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node08 2.6.27.19-5- 21356,nsessions=5,nusers=1,idletime=8564604,totmem=1643219260329746,state=free,jobs=,varattr=,rectime=1333641095 

node09                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node09 2.6.27.19-5-01,nusers=0,idletime=8564648,totmem=16432552kb,availmem=1e=free,jobs=,varattr=,rectime=1333641126                 

node10                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node10 2.6.27.19-5-2,nsessions=5,nusers=1,idletime=6821493,totmem=16432552kb036941,state=free,jobs=,varattr=,rectime=1333641133      

node11                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node11 2.6.27.19-5-01,nusers=0,idletime=8564599,totmem=16432556kb,availmem=1e=free,jobs=,varattr=,rectime=1333641120                 

node12                                                   
     state = free                                        
     np = 8                                              
     properties = alpha,eightcores                       
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node12 2.6.27.19-5-01,nusers=0,idletime=8564627,totmem=16432556kb,availmem=1e=free,jobs=,varattr=,rectime=1333641121                 

node13                                                   
     state = free                                        
     np = 8                                              
     properties = beta,eightcores                        
     ntype = cluster                                     
     status = opsys=linux,uname=Linux node13 2.6.27.19-5-01,nusers=0,idletime=6839072,totmem=20506268kb,availmem=2e=free,jobs=,varattr=,rectime=1333641130                 

node14                                                   
     state = job-exclusive                               
     np = 8                                              
     properties = beta,eightcores                        
     ntype = cluster                                     
     jobs = 0/66481.hpchead.linux, 1/66481.hpchead.linux,chead.linux, 6/66481.hpchead.linux, 7/66481.hpchead.linux
     status = opsys=linux,uname=Linux node14 2.6.27.19-5-,nusers=1,idletime=8568052,totmem=24635060kb,availmem=206free,jobs=66481.hpchead.linux,varattr=,rectime=1333641132

node15                                                   
     state = job-exclusive                               
     np = 8                                              
     properties = beta,eightcores                        
     ntype = cluster                                     
     jobs = 0/66482.hpchead.linux, 1/66482.hpchead.linux,chead.linux, 6/66482.hpchead.linux, 7/66482.hpchead.linux
     status = opsys=linux,uname=Linux node15 2.6.27.19-5-,nusers=1,idletime=8567636,totmem=24635012kb,availmem=212free,jobs=66482.hpchead.linux,varattr=,rectime=1333641092

node16                                                   
     state = job-exclusive                               
     np = 8                                              
     properties = beta,eightcores                        
     ntype = cluster                                     
     jobs = 0/66481.hpchead.linux, 1/66481.hpchead.linux,chead.linux, 6/66481.hpchead.linux, 7/66481.hpchead.linux
     status = opsys=linux,uname=Linux node16 2.6.27.19-5-=1,idletime=8564418,totmem=24634928kb,availmem=20700104kbbs=66481.hpchead.linux,varattr=,rectime=1333641117       

node17                                                   
     state = job-exclusive                               
     np = 8                                              
     properties = beta,eightcores                        
     ntype = cluster                                     
     jobs = 0/66482.hpchead.linux, 1/66482.hpchead.linux,chead.linux, 6/66482.hpchead.linux, 7/66482.hpchead.linux
     status = opsys=linux,uname=Linux node17 2.6.27.19-5-s=1,idletime=6824915,totmem=24634928kb,availmem=20598068kbs=66482.hpchead.linux,varattr=,rectime=1333641113       

node21                                                   
     state = job-exclusive                               
     np = 12                                             
     properties = blade                                  
     ntype = cluster                                     
     jobs = 0/66483.hpchead.linux, 1/66483.hpchead.linux,chead.linux, 6/66483.hpchead.linux, 7/66483.hpchead.linux.hpchead.linux                                           
     status = opsys=linux,uname=Linux node21 2.6.27.19-5-,nusers=1,idletime=8569176,totmem=26790348kb,availmem=203e=free,jobs=66483.hpchead.linux,varattr=,rectime=13336411

node22                                                   
     state = job-exclusive                               
     np = 12                                             
     properties = blade                                  
     ntype = cluster                                     
     jobs = 0/66475.hpchead.linux, 1/66475.hpchead.linux,chead.linux, 6/66475.hpchead.linux, 7/66475.hpchead.linux.hpchead.linux                                           
     status = opsys=linux,uname=Linux node22 2.6.27.19-5-users=1,idletime=8569178,totmem=26790348kb,availmem=21384free,jobs=66475.hpchead.linux,varattr=,rectime=1333641118

node23                                                   
     state = job-exclusive                               
     np = 12                                             
     properties = blade
     ntype = cluster
     jobs = 0/66484.hpchead.linux, 1/66484.hpchead.linux, 2/66484.hpchead.linux, 3/66484.hpchead.linux, 4/66484.hpchead.linux, 5/66484.hpchead.linux, 6/66484.hpchead.linux, 7/66484.hpchead.linux, 8/66484.hpchead.linux, 9/66484.hpchead.linux, 10/66484.hpchead.linux, 11/66484.hpchead.linux
     status = opsys=linux,uname=Linux node23 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64,sessions=10309 10370,nsessions=2,nusers=1,idletime=8569255,totmem=26790348kb,availmem=20165484kb,physmem=24685876kb,ncpus=12,loadave=12.01,gres=,netload=21742922098,state=free,jobs=66484.hpchead.linux,varattr=,rectime=1333641120

node24
     state = job-exclusive
     np = 12
     properties = blade
     ntype = cluster
     jobs = 0/66485.hpchead.linux, 1/66485.hpchead.linux, 2/66485.hpchead.linux, 3/66485.hpchead.linux, 4/66485.hpchead.linux, 5/66485.hpchead.linux, 6/66485.hpchead.linux, 7/66485.hpchead.linux, 8/66485.hpchead.linux, 9/66485.hpchead.linux, 10/66485.hpchead.linux, 11/66485.hpchead.linux
     status = opsys=linux,uname=Linux node24 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64,sessions=11157 11218,nsessions=2,nusers=1,idletime=8569254,totmem=26790348kb,availmem=21489804kb,physmem=24685876kb,ncpus=12,loadave=12.05,gres=,netload=18486923435,state=free,jobs=66485.hpchead.linux,varattr=,rectime=1333641114

node25
     state = job-exclusive
     np = 12
     properties = blade
     ntype = cluster
     jobs = 0/66469.hpchead.linux, 1/66469.hpchead.linux, 2/66469.hpchead.linux, 3/66469.hpchead.linux, 4/66469.hpchead.linux, 5/66469.hpchead.linux, 6/66469.hpchead.linux, 7/66469.hpchead.linux, 8/66469.hpchead.linux, 9/66469.hpchead.linux, 10/66469.hpchead.linux, 11/66469.hpchead.linux
     status = opsys=linux,uname=Linux node25 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64,sessions=6711 6772,nsessions=2,nusers=1,idletime=8569282,totmem=26790348kb,availmem=21082316kb,physmem=24685876kb,ncpus=12,loadave=12.00,gres=,netload=15199518313,state=free,jobs=66469.hpchead.linux,varattr=,rectime=1333641095

Edit:

Thanks to all those who have answered.

python
regex

2012-04-05 15:26
by Adobe

Am I correct in assuming that the next line should contain only the text state=free - machine yearning 2012-04-05 15:30

@Francis W. Usher: Yes, You are - Adobe 2012-04-05 15:38

This should return the correct node value(s)

r'node\d+(?=[^\n]*\n\s*state\s*=\s*free)'

This uses a positive lookahead to peek past the end of line, but not capture anything it finds. It only matches the node value.

l = re.findall(r'node\d+(?=[^\n]*\n\s*state\s*=\s*free)', s)
print l
>>> ['node13']

Edit: Inspired by a comment from @hexparrot, I realized there is a simpler way. This regex r'node\d+(?=\s*state\s*=\s*free)' is simpler, also works, even though it does not explicitly search for a newline (since the \s includes EOL characters). HOWEVER... it also does not guarantee that the state=free will be found on the following line, as stated in the OP's requirements. It would also match node99 state=free on the same line. So explicitly searching for the \n better meets the OP's requirements.

2012-04-05 15:30
by alan

It gives me ['node0']Adobe 2012-04-05 15:34

@Adobe Sorry. Didn't need the '^' character. Try it now - alan 2012-04-05 15:40

Curious why you use [^\n]+ since there won't necessarily be any non-newline characters before the newline - machine yearning 2012-04-05 15:40

@FrancisW.Usher Well, on Windows there might be \r :) But you're right. * would probably be better - alan 2012-04-05 15:42

Now it gives: ['node0', 'node0', 'node0', 'node0', 'node0', 'node0', 'node0', 'node0', 'node0', 'node1', 'node1', 'node1', 'node1'] - so it can't see the second figure - Adobe 2012-04-05 15:43

@Adobe: not sure what's wrong. It works on your sample input. See edit in answe - alan 2012-04-05 15:49

It works now. I didn't use r at r'regexp' before that. Probably that was the mistake (of mine) - Adobe 2012-04-05 15:58

I have to say - this is the most virtuoso regexp I've seen so far - Adobe 2012-04-05 15:59

If you insist on regex, there's probably better methods than anticipating newlines by consuming them greedily; ^(node[\d]+).*?(\bstate\b) ?= ?freehexparrot 2012-04-05 17:18

@hexparrot I don't understand your comment. I'm not using a greedy operator to anticipate newlines. Also, the regex you show here would require re.DOTALL. The one in my answer does not - alan 2012-04-05 22:05

My apologies; I worded it poorly. My intent was to say that there is no reason to anticipate a newline and greedily consume until one is found..if the newline itself is not a relevant part of the regex. In other words, why 'take until newline'+'take until state', when you can just 'take until state'. If state were the second attribute instead of the first, such an approach would then have to repeat 'take until newline'+'take until newline'+take until state' to accommodate - hexparrot 2012-04-05 22:18

@hexparrot. Yes, I see your point. I guess the short answer is I have a bias against using DOTALL. So since I was looking for text on the following line, I chose to search for newline first. But I admit it's entirely subjective - alan 2012-04-07 12:50

@hexparrot. I'm probably abusing comments by carrying this a step further, but one reason for including the \n in the regex is because the requirement called for finding a state==foo on the following line. Unless I'm mistaken (always possible), the only way to fulfill that requirement is to look for the \n. Now the horse is officially dead! Cheer - alan 2012-04-08 00:03

Regex is sometimes a little more hefty than is necessary if you can depend on your generated file is dependably constructed (as in, follows the same format as you've shown).

Thus, here's an approach that uses simple iteration:

with open('yourfile.txt', 'r') as fp:
    node_dict = {}
    node = None
    for line in fp:
        if line[0:4] == 'node':
            node = line.strip()
            node_dict[node] = 0
        elif "state" in line:
            node_dict[node] = line.split('=')[1].strip()

print node_dict

Returns

{'node13': 'free', 'node14': 'job-exclusive'}

It's then very easy to get just the 'free' nodes:

>>> print [k for k,v in node_dict.items() if v == 'free']
['node13']

2012-04-05 15:42
by hexparrot

Would make more sense to construct the dictionary with the states as keys and lists of nodes having that status as values. Then a simple node_dict["free"][0] gets you the first free node - kindall 2012-04-05 18:26

I'd suggest parsing the text into a python structure first and then manipulate that structure. Regular expressions are too complicated and too fragile for this job. Consider:

doc = """
node13 
    state = free 
np = 8 
properties = beta,eightcores 
ntype = cluster 
status = opsys=linux,uname=Linux node13 2.6.27.19-5-default etc

node14 
    state = job-exclusive 
np = 8 
properties = beta,eightcores 
ntype = cluster
"""

data = {}
lastkey = None
for line in map(str.strip, doc.splitlines()):
    if ' = ' in line and lastkey:
        k, v = line.split(' = ', 1)
        data[lastkey][k] = v
    elif len(line):
        lastkey = line
        data[lastkey] = {}

This creates a dictionary like this:

{'node13': {'np': '8',
            'ntype': 'cluster',
            'properties': 'beta,eightcores',
            'state': 'free',
            'status': 'opsys=linux,uname=Linux node13 2.6.27.19-5-default etc'},
 'node14': {'np': '8',
            'ntype': 'cluster',
            'properties': 'beta,eightcores',
            'state': 'job-exclusive'}}

which you can manipulate in a normal python way:

 free_nodes = [v for v in data.values() if v['state'] == 'free']

2012-04-05 15:42
by georg

Good point about the fragile regexps - Adobe 2012-04-05 16:00

You can use the re.DOTALL flag so that . matches everything including newline. Here is a sample

>>> st="""
node13 
    state = free 
np = 8 
properties = beta,eightcores 
ntype = cluster 
status = opsys=linux,uname=Linux node13 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64,sessions=? 15201,nsessions=? 01,nusers=0,idletime=6837317,totmem=20506268kb,availmem=20259728kb,physmem=20506268kb,ncpus=8,loadave=0.00,gres=,netload=17130666575,se=free,jobs=,varattr=,rectime=1333639375 

node14 
    state = job-exclusive 
np = 8 
properties = beta,eightcores 
ntype = cluster
"""

>>> re.findall("(node\d+).*?state.*?free",st,re.DOTALL)
['node13']

Please note, this can also be done without regex

>>> stlines=st.splitlines()
>>> [stlines[i]  for i in xrange(0,len(stlines)-1) if stlines[i+1].partition("=")[-1].strip() == 'free']
['node13']
>>>

Note*** If you need a more robust regex, as Francis have shown in his example you can use the below

>>> re.findall("(node\d+).*?state[ ]*=[ ]*free",st,re.DOTALL)
['node13']
>>>

2012-04-05 15:32
by Abhijit

Wouldn't state.*?free match the string "statedsjc3(*@N(*RNWNWNSD*S*(Y#N(F#*(DFM#(#N(#$($#(#$N(#(*free" - machine yearning 2012-04-05 15:38

@FrancisW.Usher: Off course it will, but I have used the example to demonstrate re.DOTALL feature. If you need a more robust regex, you can easily build on i - Abhijit 2012-04-05 15:41

Sure, but by that logic I could say that the regex .* matches his string with DOTALL on. Your regex is incorrect because it matches a super-set of the desired strings - machine yearning 2012-04-05 15:42

@FrancisW.Usher: I have used the example provided by OP. Anyway I have just updated my answer with a bit more robust version - Abhijit 2012-04-05 15:44

I agree with @thg435 that regex is too powerful for this job. I'd prefer a really simple solution:

lines = data.split('\n')
num_lines = len(lines)
[lines[i] for i in range(numlines - 1) if 'state = free' in lines[i+1]]

This really captures the essence of what you want to do: if the next line (lines[i+1]) contains the desired text, the current line (presumably the name of the node) goes into the list.

2012-04-05 15:52
by machine yearning

Isn't this one of the solution I posted earlier - Abhijit 2012-04-05 15:58

It works on the given example - but doesn't work in my script. I don't know why - Adobe 2012-04-05 16:02

@Abhijit Not really, your example was a bit more complicated, and less correct since you're only checking if the right-hand side of the equality is free... what about the left-hand side - machine yearning 2012-04-05 16:24

It's often easier to look backward than to look ahead. So don't think about getting the current line when the next line contains something; you want to get the previous line when the current line contains something. Framed in these terms, it is easy to conceive and implement:

def find_free_node(doc):
    prevline = ""
    for line in doc.splitlines():
       if line.strip() == "state = free" and previine.startswith("node"):
           return prevline.strip()
       prevline = line

Another way is to keep track of what node you're in rather than what the previous line was. This will work even if the state = free line doesn't immediately follow the node name line.

def find_free_node(doc):
    node = ""
    for line in doc.splitlines():
        if line.startswith("node"):
            node = line.strip()
        elif line.strip() = "state = free" and node:
            return node

To me, these are a lot clearer than multiline-regex-based solutions.

2012-04-05 18:24
by kindall