Filtering Items In a List
Suppose we have a list. Often, we want to gather only the items that meet certain criteria. Below, we have a list of words, and we want to extract from it only the ones that contain 'wo'. For this, we will need to first make a new empty list, and then iterate through the original list to find items put in:
| >>> wood = 'How much wood would a woodchuck chuck if a woodchuck could
chuck wood?'.split()
>>> wood
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if', 'a',
'woodchuck', 'could', 'chuck', 'wood?']
>>> wolist = []
>>> for x in wood:
if 'wo' in x:
>>> wolist
['wood', 'would', 'woodchuck', 'woodchuck', 'wood?']
| |
OK, that works, but that's a lot of lines of code. What if I told you you can accomplish it all with one line of Python code? Well you can! Behold the superpower of list comprehension:
| >>> [x for x in wood if 'wo' in x]
['wood', 'would', 'woodchuck', 'woodchuck', 'wood?']
| |
You want a list of words that are 5+ characters? That too can be done with list comprehension:
| >>> [x for x in wood if len(x) >= 5]
['would', 'woodchuck', 'chuck', 'woodchuck', 'could', 'chuck', 'wood?']
| |
Words that are 5+ characters AND end with 'ck':
| >>> [x for x in wood if len(x) >= 5 and x.endswith('ck')]
['woodchuck', 'chuck', 'woodchuck', 'chuck']
| |
You get the idea. Basically, list comprehension for filtering starts with [x for x in li], which in fact creates a new list that's identical to li, and then tacks on an if ... clause at the end, which works as filtering criteria.
| >>> [x for x in wood if len(x) <= 4]
['How', 'much', 'wood', 'a', 'if', 'a']
| |
Transforming Items in a List
Another popular type of task with a list is to transform each item. For example, suppose I want to create a new list where each 'o' is replaced by 'oo' in every word. As before, the usual for-loop process gets the job done but is tedious:
| >>> wood
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if', 'a',
'woodchuck', 'could', 'chuck', 'wood?']
>>> doubleo = []
>>> for x in wood:
doubleo.append(x.replace('o', 'oo'))
>>> doubleo
['Hoow', 'much', 'wooood', 'woould', 'a', 'woooodchuck', 'chuck', 'if',
'a', 'woooodchuck', 'coould', 'chuck', 'wooood?']
| |
Again, with list comprehension, all you need is one line of code:
| >>> [x.replace('o', 'oo') for x in wood]
['Hoow', 'much', 'wooood', 'woould', 'a', 'woooodchuck', 'chuck', 'if',
'a', 'woooodchuck', 'coould', 'chuck', 'wooood?']
| |
Another example -- capitalizing every word:
| >>> [x.capitalize() for x in wood]
['How', 'Much', 'Wood', 'Would', 'A', 'Woodchuck', 'Chuck', 'If', 'A',
'Woodchuck', 'Could', 'Chuck', 'Wood?']
| |
A list of word length, for every word in wood:
| >>> [len(x) for x in wood]
[3, 4, 4, 5, 1, 9, 5, 2, 1, 9, 5, 5, 5]
| |
So you can see how handy this is. The syntax works like this: starting with [x for x in li], which creates a new list that's identical to li, the initial x is substituted with f(x), a certain function with x as the input. The result is a new list where each x is transformed to f(x).
Filtering and Transformation, Applied Together
You might ask: can we filter AND transform at the same time? Sure we can. Below, we are filtering in only those words with 'wo' and then uppercasing them:
| >>> [x.upper() for x in wood if 'wo' in x]
| |
What we have here is this syntax: [f(x) for x in li if ...]. Here's another example:
| >>> [x+'-away' for x in wood if len(x) <= 4]
['How-away', 'much-away', 'wood-away', 'a-away', 'if-away', 'a-away']
| |
In the NLTK book, you will see a lot of examples of list comprehension in action, performing exciting operations on gigantic lists of words and other linguistic data. You should get comfortable with list comprehension: it will super-charge your text processing.