Searching a string and returning only things I specify

Go To StackoverFlow.com

0

Hopefully this post goes better..

So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.

ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.

I have tried a bunch of different approaches, such as,

text = "One of the values, I=1mV is used"
print(re.split('I=', text))

However, this returns the same String without I in it, so it would return

['One of the values, ', '1mV is used']

If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.

text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))

['I=1'] 

When I tried using re.match,

text = "One of the values, I=1mV is used"
print(re.search("I=", text))

<_sre.SRE_Match object at 0x02408BF0>

What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?

2012-04-04 02:29
by user1210304
+1! For the record - thank you for asking an excellent question. You've said what you want, shown what you've tried, and generally demonstrated that you are willing to learn. Bravo - Li-aung Yip 2012-04-04 04:38


2

A better way would be to split the text into words first:

>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']

And then filter the words to find the one you need:

>>> [w for w in words if 'I=' in w]
['I=1mV']

This returns a list of all words with I= in them. We can then just take the first element found:

>>> [w for w in words if 'I=' in w][0]
'I=1mV'

Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:

>>> next(w for w in words if 'I=' in w)
'I=1mV'

Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.

2012-04-04 02:37
by Niklas B.
Very helpful. Thanks - user1210304 2012-04-04 18:59
Question though, what is the type of the value that is being returned? Is it a string or a list - user1210304 2012-04-06 03:20
@user: if there are square brackets, it's a list - Niklas B. 2012-04-06 09:01


2

Using string methods

For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.

>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')

Using regular expressions

A more flexible and robust solution is to use a regular expression:

>>> import re
>>> pattern = r"""
... I=             # specific string "I="
... \s*            # Possible whitespace
... -?             # possible minus sign
... \s*            # possible whitespace
... \d+            # at least one digit
... (\.\d+)?       # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'

This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).

Note the use of:

  • Quantifiers to say how many of each thing you want.
    • a* - zero or more as
    • a+ - at least one a
    • a? - "optional" - one or zero as
  • Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
  • Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.

Exercises

  • Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".

  • Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?

2012-04-04 04:22
by Li-aung Yip
I appreciate the answer. Helped broaden my understanding a lot - user1210304 2012-04-04 18:59


1

import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]

if you have more than one I= do the above for each odd index in l sice 0 is the first one.

that is a good way since one can write "One of the values, I= 1mV is used" and I guess you want to get that I is 1mv.

BTW I is current and its units are Ampers and not Volts :)

2012-04-04 02:39
by 0x90


1

With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:

import re

test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."

result = re.findall(r'I=[\d\.]+m[vV]', test)

print(result)

test = "One of the values, I=1mV is used"

result = re.search(r'I=([\d\.]+m[vV])', test)

print(result.group(1))

The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']

I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.

2012-04-04 03:18
by Honest Abe
Ads