Nerding out with IPA: HEARDLE vs. CMU Pronouncing Dictionary¶

  • PyLing, September 17, 2025
  • Na-Rae Han (naraehan@pitt.edu)

We all know and love WORDLE, but have you heard of, ahem, HEARDLE (https://joshuamandel.com/heardle/)?

No description has been provided for this image

It's an IPA-based "phonemic version" of English word game, which has a way of tickling our linguist brain. HEARDLE is based on the famous CMU Pronouncing Dictionary. Let's explore the two! We'll address such burning questions as:

  • Is Heardle harder than Wordle? Why? How do we prove?
  • What would make excellent "opener" of a word? For Wordle, many folks swear by adieu. Can we find something similar?
  • CCVCC, CVCVC, VCCVC, ... which CV patterns are most common?

Trying out HEARDLE¶

  • Head to https://joshuamandel.com/heardle//, try out today's puzzle
  • Is it harder than WORDLE? In what way?
  • Alphabet pool: how big is it for HEARDLE?
  • How big is the target word pool?

Exploring the CMU Pronouncing Dictionary¶

  • Part of NLTK's corpus resource: nltk.corpus.cmudict

Q1: How is it structured? How many entries?¶

In [1]:
from nltk.corpus import cmudict
prondict = cmudict.dict()
In [2]:
prondict['pumpkin']
Out[2]:
[['P', 'AH1', 'M', 'P', 'K', 'IH0', 'N'], ['P', 'AH1', 'M', 'K', 'IH0', 'N']]
In [3]:
len(prondict)
Out[3]:
123455
In [4]:
'linguist' in prondict
Out[4]:
True
In [5]:
prondict['often']
Out[5]:
[['AO1', 'F', 'AH0', 'N'], ['AO1', 'F', 'T', 'AH0', 'N']]
In [6]:
prondict['accent']
Out[6]:
[['AH0', 'K', 'S', 'EH1', 'N', 'T'], ['AE1', 'K', 'S', 'EH2', 'N', 'T']]
In [7]:
pronlist = cmudict.entries()  # as a list
len(pronlist)
Out[7]:
133737
In [8]:
prondict['read']    # Nya: homographic words? 
Out[8]:
[['R', 'EH1', 'D'], ['R', 'IY1', 'D']]
In [9]:
pronlist[469:475]
Out[9]:
[('accelerometers',
  ['AE0', 'K', 'S', 'EH2', 'L', 'ER0', 'AA1', 'M', 'AH0', 'T', 'ER0', 'Z']),
 ('accent', ['AH0', 'K', 'S', 'EH1', 'N', 'T']),
 ('accent', ['AE1', 'K', 'S', 'EH2', 'N', 'T']),
 ('accented', ['AE1', 'K', 'S', 'EH0', 'N', 'T', 'IH0', 'D']),
 ('accenting', ['AE1', 'K', 'S', 'EH0', 'N', 'T', 'IH0', 'NG']),
 ('accents', ['AE1', 'K', 'S', 'EH0', 'N', 'T', 'S'])]
In [10]:
prondict['calculator']  # Jenna's word 
Out[10]:
[['K', 'AE1', 'L', 'K', 'Y', 'AH0', 'L', 'EY2', 'T', 'ER0']]

Q2: How long is the longest word? How many sounds?¶

In [11]:
for x in sorted([(len(pron), w,pron) for (w,pron) in pronlist], reverse=True)[:5]:
    print (x)
(32, 'supercalifragilisticexpealidoshus', ['S', 'UW2', 'P', 'ER0', 'K', 'AE2', 'L', 'AH0', 'F', 'R', 'AE1', 'JH', 'AH0', 'L', 'IH2', 'S', 'T', 'IH0', 'K', 'EH2', 'K', 'S', 'P', 'IY0', 'AE2', 'L', 'AH0', 'D', 'OW1', 'SH', 'AH0', 'S'])
(28, 'antidisestablishmentarianism', ['AE2', 'N', 'T', 'AY0', 'D', 'IH0', 'S', 'AH0', 'S', 'T', 'AE2', 'B', 'L', 'IH0', 'SH', 'M', 'AH0', 'N', 'T', 'EH1', 'R', 'IY0', 'AH0', 'N', 'IH0', 'Z', 'AH0', 'M'])
(20, 'deinstitutionalization', ['D', 'IY0', 'IH2', 'N', 'S', 'T', 'IH0', 'T', 'UW2', 'SH', 'AH0', 'N', 'AH0', 'L', 'AH0', 'Z', 'EY1', 'SH', 'AH0', 'N'])
(19, 'supercalifragilistic', ['S', 'UW2', 'P', 'ER0', 'K', 'AE2', 'L', 'AH0', 'F', 'R', 'AE1', 'JH', 'AH0', 'L', 'IH2', 'S', 'T', 'IH0', 'K'])
(19, 'extraterritoriality', ['EH2', 'K', 'S', 'T', 'R', 'AH0', 'T', 'EH2', 'R', 'AH0', 'T', 'AO2', 'R', 'IY0', 'AE1', 'L', 'AH0', 'T', 'IY0'])

Q3: How many words end in /ʒ/?¶

In [12]:
[(w,pron) for (w,pron) in pronlist if pron[-1] == 'ZH']
Out[12]:
[('arbitrage', ['AA1', 'R', 'B', 'IH0', 'T', 'R', 'AA2', 'ZH']),
 ('barrage', ['B', 'ER0', 'AA1', 'ZH']),
 ('beige', ['B', 'EY1', 'ZH']),
 ('bruges', ['B', 'R', 'UW1', 'ZH']),
 ('camouflage', ['K', 'AE1', 'M', 'AH0', 'F', 'L', 'AA2', 'ZH']),
 ('collage', ['K', 'AH0', 'L', 'AA1', 'ZH']),
 ('concierge', ['K', 'AA2', 'N', 'S', 'IY0', 'EH1', 'R', 'ZH']),
 ('corsage', ['K', 'AO0', 'R', 'S', 'AA1', 'ZH']),
 ('cortege', ['K', 'AO0', 'R', 'T', 'EH1', 'ZH']),
 ('dhiraj', ['D', 'IH2', 'R', 'AA1', 'ZH']),
 ('dressage', ['D', 'R', 'EH0', 'S', 'AA1', 'ZH']),
 ('entourage', ['AA2', 'N', 'T', 'UH0', 'R', 'AA1', 'ZH']),
 ('entourage', ['AA2', 'N', 'T', 'ER0', 'AA1', 'ZH']),
 ('garage', ['G', 'ER0', 'AA1', 'ZH']),
 ('limoges', ['L', 'AH0', 'M', 'OW1', 'ZH']),
 ('massage', ['M', 'AH0', 'S', 'AA1', 'ZH']),
 ('mirage', ['M', 'ER0', 'AA1', 'ZH']),
 ('montage', ['M', 'AA0', 'N', 'T', 'AA1', 'ZH']),
 ('prestige', ['P', 'R', 'EH0', 'S', 'T', 'IY1', 'ZH']),
 ('raj', ['R', 'AA1', 'ZH']),
 ('rouge', ['R', 'UW1', 'ZH']),
 ('sabotage', ['S', 'AE1', 'B', 'AH0', 'T', 'AA2', 'ZH']),
 ('taj', ['T', 'AA1', 'ZH']),
 ('thivierge', ['TH', 'IH0', 'V', 'Y', 'EH1', 'R', 'ZH'])]
In [13]:
len(_)
Out[13]:
24

Q4: How many 5-phone words (aka "Heardle words")?¶

In [14]:
cmuphone5 = [(w,pron) for (w,pron) in pronlist if len(pron)==5]
len(cmuphone5)
Out[14]:
25821
In [15]:
cmuletter5 = [(w,pron) for (w,pron) in pronlist if len(w)==5]
len(cmuletter5)
Out[15]:
15469

Q5: And is that larger or smaller than 5-letter words (aka "Wordle words")?¶

  • Compare with: nltk.corpus.words.words('en')
In [16]:
import nltk
enwords = nltk.corpus.words.words('en')
len(enwords)
Out[16]:
235886
In [17]:
%pprint
enwords[:100]
Pretty printing has been turned OFF
Out[17]:
['A', 'a', 'aa', 'aal', 'aalii', 'aam', 'Aani', 'aardvark', 'aardwolf', 'Aaron', 'Aaronic', 'Aaronical', 'Aaronite', 'Aaronitic', 'Aaru', 'Ab', 'aba', 'Ababdeh', 'Ababua', 'abac', 'abaca', 'abacate', 'abacay', 'abacinate', 'abacination', 'abaciscus', 'abacist', 'aback', 'abactinal', 'abactinally', 'abaction', 'abactor', 'abaculus', 'abacus', 'Abadite', 'abaff', 'abaft', 'abaisance', 'abaiser', 'abaissed', 'abalienate', 'abalienation', 'abalone', 'Abama', 'abampere', 'abandon', 'abandonable', 'abandoned', 'abandonedly', 'abandonee', 'abandoner', 'abandonment', 'Abanic', 'Abantes', 'abaptiston', 'Abarambo', 'Abaris', 'abarthrosis', 'abarticular', 'abarticulation', 'abas', 'abase', 'abased', 'abasedly', 'abasedness', 'abasement', 'abaser', 'Abasgi', 'abash', 'abashed', 'abashedly', 'abashedness', 'abashless', 'abashlessly', 'abashment', 'abasia', 'abasic', 'abask', 'Abassin', 'abastardize', 'abatable', 'abate', 'abatement', 'abater', 'abatis', 'abatised', 'abaton', 'abator', 'abattoir', 'Abatua', 'abature', 'abave', 'abaxial', 'abaxile', 'abaze', 'abb', 'Abba', 'abbacomes', 'abbacy', 'Abbadide']
In [18]:
letter5 = [w for w in enwords if len(w) == 5]
len(letter5)
Out[18]:
10230
In [19]:
letter5[:100]
Out[19]:
['aalii', 'Aaron', 'abaca', 'aback', 'abaff', 'abaft', 'Abama', 'abase', 'abash', 'abask', 'abate', 'abave', 'abaze', 'abbas', 'abbey', 'Abbie', 'abbot', 'abdal', 'abdat', 'abeam', 'abear', 'abele', 'abhor', 'abide', 'abidi', 'Abies', 'abilo', 'abkar', 'abler', 'ablow', 'abmho', 'Abner', 'abnet', 'abode', 'abody', 'abohm', 'aboil', 'aboma', 'aboon', 'abord', 'abort', 'about', 'above', 'Abram', 'abret', 'abrim', 'abrin', 'Abrus', 'absit', 'abuna', 'abura', 'abuse', 'Abuta', 'abuzz', 'abwab', 'abysm', 'abyss', 'acana', 'acapu', 'acara', 'acari', 'acate', 'accoy', 'acedy', 'acerb', 'achar', 'Achen', 'acher', 'achor', 'acier', 'acker', 'ackey', 'aclys', 'acmic', 'acock', 'acoin', 'acold', 'Acoma', 'acoma', 'acone', 'acorn', 'Acrab', 'acred', 'acrid', 'Acroa', 'acron', 'Acrux', 'acryl', 'actin', 'acton', 'actor', 'Acuan', 'acute', 'adage', 'Adapa', 'adapt', 'adati', 'adawe', 'adawn', 'adays']

Q6: ARPABET is confusing. Can we see some real IPA?¶

  • Mapping found in Na-Rae's "text samples" file: https://sites.pitt.edu/~naraehan/python3/text-samples.txt
In [20]:
arpa_map = {'IY0':'i', 'IH0':'ɪ', 'EH0':'ɛ', 'AE0':'æ', 'AA0':'ɑ', 'AO0':'ɔ', 'AH0':'ʌ/ə', 'UH0':'ʊ',
          'UW0':'u', 'ER0':'ɝ/ɚ', 'AY0':'aɪ', 'EY0':'eɪ', 'AW0':'aʊ', 'OW0':'oʊ', 'OY0':'ɔɪ',
          'IY1':'i', 'IH1':'ɪ', 'EH1':'ɛ', 'AE1':'æ', 'AA1':'ɑ', 'AO1':'ɔ', 'AH1':'ʌ/ə', 'UH1':'ʊ',
          'UW1':'u', 'ER1':'ɝ/ɚ', 'AY1':'aɪ', 'EY1':'eɪ', 'AW1':'aʊ', 'OW1':'oʊ', 'OY1':'ɔɪ',
          'IY2':'i', 'IH2':'ɪ', 'EH2':'ɛ', 'AE2':'æ', 'AA2':'ɑ', 'AO2':'ɔ', 'AH2':'ʌ/ə', 'UH2':'ʊ',
          'UW2':'u', 'ER2':'ɝ/ɚ', 'AY2':'aɪ', 'EY2':'eɪ', 'AW2':'aʊ', 'OW2':'oʊ', 'OY2':'ɔɪ',
          'P':'p', 'B':'b', 'T':'t', 'D':'d', 'K':'k', 'G':'g', 'M':'m', 'N':'n', 'NG':'ŋ',
          'F':'f', 'V':'v', 'TH':'θ', 'DH':'ð', 'S':'s', 'Z':'z', 'SH':'ʃ', 'ZH':'ʒ',
          'HH':'h', 'CH':'tʃ', 'JH':'dʒ', 'W':'w', 'R':'ɹ', 'Y':'j', 'L':'l'}

def ipa_fy(phones):
    "Converts CMU arpabet list to IPA string. Ignores stress."
    return ' '.join([arpa_map[p] for p in phones])

ipa_fy(['AE1', 'NG', 'K', 'SH', 'AH0', 'S'])
Out[20]:
'æ ŋ k ʃ ʌ/ə s'
In [21]:
prondict['anxious']
Out[21]:
[['AE1', 'NG', 'K', 'SH', 'AH0', 'S'], ['AE1', 'NG', 'SH', 'AH0', 'S']]
In [22]:
for pron in prondict['anxious']:
    print(ipa_fy(pron))
æ ŋ k ʃ ʌ/ə s
æ ŋ ʃ ʌ/ə s
In [23]:
ipa_fy(prondict['aphasia'][0])  # Ben's word
Out[23]:
'ʌ/ə f eɪ ʒ ʌ/ə'

Dissecting HEARDLE¶

  • Let's examine HEARDLE's source code. Can you find the two data files derived from CMU Pronouncing Dictionary?
    • One for all words, of every length
    • Another for a smaller set of "target" words: 5-segment and less obscure. These can be HEARDLE answers.
In [24]:
import pandas as pd
allwords_df = pd.read_json('https://rawcdn.githack.com/jmandel/heardle/453f0c8feb0d1755788a5a7c8d0bd16baf8be130/words.json')
In [25]:
allwords_df[100:110]
Out[25]:
word variant phonemes stress
100 ABAD 0 [AH, B, AA, D] 2
101 ABADAKA 0 [AH, B, AE, D, AH, K, AH] 2
102 ABADI 0 [AH, B, AE, D, IY] 2
103 ABADIE 0 [AH, B, AE, D, IY] 2
104 ABAIR 0 [AH, B, EH, R] 2
105 ABALKIN 0 [AH, B, AA, L, K, IH, N] 2
106 ABALONE 0 [AE, B, AH, L, OW, N, IY] 4
107 ABALONES 0 [AE, B, AH, L, OW, N, IY, Z] 4
108 ABALOS 0 [AA, B, AA, L, OW, Z] 2
109 ABANDON 0 [AH, B, AE, N, D, AH, N] 2
In [26]:
allwords_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128456 entries, 0 to 128455
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   word      128456 non-null  object
 1   variant   128456 non-null  int64 
 2   phonemes  128456 non-null  object
 3   stress    128456 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 3.9+ MB
In [27]:
allwords_df[allwords_df.phonemes.map(len)==5]
Out[27]:
word variant phonemes stress
37 (PARENS 0 [P, ER, EH, N, Z] 2
44 )PARENS 0 [P, ER, EH, N, Z] 2
50 -HYPHEN 0 [HH, AY, F, AH, N] 1
58 3-D 0 [TH, R, IY, D, IY] 2
59 3D 0 [TH, R, IY, D, IY] 2
... ... ... ... ...
128432 ZYCAD 0 [Z, IH, K, AE, D] 1
128437 ZYGOTE 0 [Z, AY, G, OW, T] 1
128439 ZYLKA 0 [Z, IH, L, K, AH] 1
128441 ZYMAN 0 [Z, AY, M, AH, N] 1
128442 ZYNDA 0 [Z, IH, N, D, AH] 1

26017 rows × 4 columns

In [28]:
targetwords_df = pd.read_json('https://joshuamandel.com/heardle/words-to-target.json')
targetwords_df.head()
Out[28]:
word variant phonemes stress p
0 abashed 0 [AH, B, AE, SH, T] 2 0.941176
1 abhor 0 [AE, B, HH, AO, R] 3 0.885856
2 ablate 0 [AH, B, L, EY, T] 3 0.550000
3 ablaze 0 [AH, B, L, EY, Z] 3 0.983562
4 abloom 0 [AH, B, L, UW, M] 3 0.705234
In [29]:
targetwords_df.tail()
Out[29]:
word variant phonemes stress p
4110 zipping 0 [Z, IH, P, IH, NG] 1 0.988180
4111 zircon 0 [Z, ER, K, AH, N] 1 0.677966
4112 zombie 0 [Z, AA, M, B, IY] 1 0.997669
4113 zoning 0 [Z, OW, N, IH, NG] 1 0.958525
4114 zygote 0 [Z, AY, G, OW, T] 1 0.884804
In [30]:
targetwords = list(zip(targetwords_df.word, targetwords_df.phonemes))
targetwords[-5:]
Out[30]:
[('zipping', ['Z', 'IH', 'P', 'IH', 'NG']), ('zircon', ['Z', 'ER', 'K', 'AH', 'N']), ('zombie', ['Z', 'AA', 'M', 'B', 'IY']), ('zoning', ['Z', 'OW', 'N', 'IH', 'NG']), ('zygote', ['Z', 'AY', 'G', 'OW', 'T'])]
In [31]:
len(targetwords)
Out[31]:
4115

4115 target words. Let's get up-close!¶

Q1: What sort of words are in there?¶

In [32]:
[w for (w,pron) in targetwords][:100]
Out[32]:
['abashed', 'abhor', 'ablate', 'ablaze', 'abloom', 'aboard', 'abort', 'abound', 'abrade', 'abridge', 'abroad', 'abscess', 'absurd', 'acacia', 'accede', 'access', 'acclaim', 'accord', 'accost', 'account', 'accursed', 'accuse', 'achiever', 'acorn', 'acquire', 'acquit', 'acrid', 'across', 'acting', 'action', 'active', 'acute', 'adapt', 'addled', 'adduct', 'adept', 'adhere', 'adjust', 'admin', 'admire', 'admit', 'adobe', 'adopt', 'adorn', 'adroit', 'adverb', 'advert', 'aerial', 'aesthete', 'affect', 'afford', 'afghan', 'afield', 'aflame', 'afloat', 'afraid', 'afresh', 'ageless', 'agent', 'aggress', 'aggrieve', 'aghast', 'agony', 'agreed', 'ahold', 'aimless', 'airbag', 'airbase', 'airboat', 'aircrew', 'airfare', 'airflow', 'airfoil', 'airhead', 'airless', 'airline', 'airlock', 'airmail', 'airman', 'airpower', 'airship', 'airtight', 'airtime', 'airwave', 'alarm', 'album', 'alcove', 'alehouse', 'alias', 'alibi', 'alien', 'alleged', 'allele', 'allergy', 'alleyway', 'almond', 'aloft', 'aloha', 'alpine', 'alright']
In [33]:
[w for (w,pron) in targetwords][2000:2100]
Out[33]:
['lineup', 'lingo', 'lining', 'linked', 'linker', 'links', 'lipid', 'listen', 'lithic', 'little', 'lively', 'liven', 'livery', 'livid', 'living', 'lizard', 'loading', 'loathing', 'local', 'locale', 'locate', 'locket', 'locking', 'lockout', 'lockup', 'locus', 'lodging', 'lofty', 'logging', 'logic', 'login', 'logos', 'lonely', 'longbow', 'longer', 'longing', 'longish', 'looking', 'lookout', 'lookup', 'looming', 'looping', 'loosely', 'loosen', 'loosing', 'lopping', 'losing', 'lotion', 'lottery', 'lotus', 'loudly', 'lovely', 'loving', 'lowdown', 'lowering', 'loyally', 'lucid', 'luddite', 'lumber', 'lumen', 'lumpy', 'lunged', 'lupine', 'lupus', 'lurid', 'luscious', 'lushly', 'luster', 'lusty', 'lycra', 'lynx', 'lyric', 'lysine', 'machine', 'macro', 'madam', 'madden', 'madding', 'madly', 'mafia', 'maggot', 'magic', 'magma', 'magna', 'magpie', 'magus', 'maiden', 'mailing', 'mainly', 'maitre', 'makeup', 'making', 'malaise', 'malign', 'mallard', 'mallet', 'mamba', 'mambo', 'mammal', 'mammary']
In [34]:
[w for (w,pron) in targetwords][-10:]
Out[34]:
['youtube', 'zealous', 'zebra', 'zenith', 'zippered', 'zipping', 'zircon', 'zombie', 'zoning', 'zygote']
In [35]:
prondict['zenith']
Out[35]:
[['Z', 'IY1', 'N', 'AH0', 'TH'], ['Z', 'IY1', 'N', 'IH0', 'TH']]

Q2: Can we see some glorious IPA?¶

  • We gotta extend our arpa_map to include vowels without stress marking
In [36]:
arpa_map_nostress = {'IY':'i', 'IH':'ɪ', 'EH':'ɛ', 'AE':'æ', 'AA':'ɑ', 'AO':'ɔ', 'AH':'ʌ/ə', 'UH':'ʊ',
          'UW':'u', 'ER':'ɝ/ɚ', 'AY':'aɪ', 'EY':'eɪ', 'AW':'aʊ', 'OW':'oʊ', 'OY':'ɔɪ'}
arpa_map = arpa_map_nostress | arpa_map
In [37]:
[(w,pron) for (w,pron) in targetwords][-10:]
Out[37]:
[('youtube', ['Y', 'UW', 'T', 'UW', 'B']), ('zealous', ['Z', 'EH', 'L', 'AH', 'S']), ('zebra', ['Z', 'IY', 'B', 'R', 'AH']), ('zenith', ['Z', 'IY', 'N', 'IH', 'TH']), ('zippered', ['Z', 'IH', 'P', 'ER', 'D']), ('zipping', ['Z', 'IH', 'P', 'IH', 'NG']), ('zircon', ['Z', 'ER', 'K', 'AH', 'N']), ('zombie', ['Z', 'AA', 'M', 'B', 'IY']), ('zoning', ['Z', 'OW', 'N', 'IH', 'NG']), ('zygote', ['Z', 'AY', 'G', 'OW', 'T'])]
In [38]:
[(w,ipa_fy(pron)) for (w,pron) in targetwords][-10:]
Out[38]:
[('youtube', 'j u t u b'), ('zealous', 'z ɛ l ʌ/ə s'), ('zebra', 'z i b ɹ ʌ/ə'), ('zenith', 'z i n ɪ θ'), ('zippered', 'z ɪ p ɝ/ɚ d'), ('zipping', 'z ɪ p ɪ ŋ'), ('zircon', 'z ɝ/ɚ k ʌ/ə n'), ('zombie', 'z ɑ m b i'), ('zoning', 'z oʊ n ɪ ŋ'), ('zygote', 'z aɪ g oʊ t')]

Q3: Which sounds are most frequent? Which are the rarest?¶

In [39]:
prons_list = [pron for (w,pron) in targetwords]
prons_list[:5]
Out[39]:
[['AH', 'B', 'AE', 'SH', 'T'], ['AE', 'B', 'HH', 'AO', 'R'], ['AH', 'B', 'L', 'EY', 'T'], ['AH', 'B', 'L', 'EY', 'Z'], ['AH', 'B', 'L', 'UW', 'M']]
In [40]:
prons_flat = [p for pron in prons_list for p in pron]
prons_flat[:20]
Out[40]:
['AH', 'B', 'AE', 'SH', 'T', 'AE', 'B', 'HH', 'AO', 'R', 'AH', 'B', 'L', 'EY', 'T', 'AH', 'B', 'L', 'EY', 'Z']
In [41]:
prons_flat_ipa = [arpa_map[p] for p in prons_flat]
prons_flat_ipa[:20]
Out[41]:
['ʌ/ə', 'b', 'æ', 'ʃ', 't', 'æ', 'b', 'h', 'ɔ', 'ɹ', 'ʌ/ə', 'b', 'l', 'eɪ', 't', 'ʌ/ə', 'b', 'l', 'eɪ', 'z']
In [42]:
len(prons_flat_ipa)    # 5 * 4115 
Out[42]:
20575
In [43]:
import nltk
phone_fd = nltk.FreqDist(prons_flat_ipa)
In [44]:
len(phone_fd)
Out[44]:
39
In [45]:
phone_fd['h']
Out[45]:
173
In [46]:
phone_fd['ʌ/ə']
Out[46]:
1381
In [47]:
phone_fd.freq('ʌ/ə')
Out[47]:
0.06712029161603889
In [48]:
phone_fd.most_common()
Out[48]:
[('ʌ/ə', 1381), ('ɪ', 1304), ('l', 1263), ('t', 1247), ('s', 1161), ('ɹ', 1109), ('n', 1062), ('ɝ/ɚ', 1012), ('k', 991), ('i', 945), ('d', 880), ('p', 710), ('m', 582), ('ŋ', 574), ('b', 560), ('æ', 555), ('ɛ', 496), ('eɪ', 451), ('ɑ', 410), ('f', 385), ('oʊ', 368), ('aɪ', 349), ('g', 332), ('ɔ', 275), ('v', 263), ('w', 242), ('ʃ', 233), ('z', 230), ('u', 224), ('h', 173), ('dʒ', 171), ('tʃ', 165), ('aʊ', 144), ('θ', 89), ('ʊ', 68), ('j', 64), ('ɔɪ', 61), ('ð', 34), ('ʒ', 12)]

Q4: What's the most common initial sound? The final sound?¶

In [49]:
# only 1st sounds
phone1 = [arpa_map[p[0]] for p in prons_list]
phone1[:5], phone1[-5:]
Out[49]:
(['ʌ/ə', 'æ', 'ʌ/ə', 'ʌ/ə', 'ʌ/ə'], ['z', 'z', 'z', 'z', 'z'])
In [50]:
phone1_fd = nltk.FreqDist(phone1)
phone1_fd.most_common()
Out[50]:
[('s', 553), ('k', 352), ('b', 326), ('p', 277), ('ɹ', 243), ('f', 218), ('t', 201), ('d', 196), ('m', 183), ('ʌ/ə', 180), ('g', 166), ('l', 166), ('h', 139), ('w', 108), ('ɪ', 91), ('ʃ', 89), ('n', 84), ('æ', 72), ('ɛ', 69), ('v', 57), ('tʃ', 47), ('dʒ', 44), ('ɑ', 40), ('θ', 37), ('aʊ', 33), ('ɔ', 28), ('oʊ', 22), ('j', 19), ('aɪ', 17), ('eɪ', 16), ('ɝ/ɚ', 13), ('z', 10), ('i', 9), ('ð', 6), ('ɔɪ', 2), ('ʒ', 1), ('u', 1)]
In [51]:
phone2 = [arpa_map[p[1]] for p in prons_list]
phone2_fd = nltk.FreqDist(phone2)
phone2_fd.most_common()
Out[51]:
[('ɪ', 382), ('ɹ', 371), ('æ', 301), ('ʌ/ə', 290), ('ɛ', 251), ('ɑ', 235), ('l', 231), ('i', 226), ('eɪ', 194), ('n', 168), ('t', 161), ('ɝ/ɚ', 155), ('aɪ', 150), ('oʊ', 144), ('ɔ', 137), ('k', 122), ('p', 100), ('u', 78), ('w', 61), ('m', 58), ('ʊ', 47), ('s', 36), ('aʊ', 35), ('d', 28), ('ɔɪ', 26), ('b', 23), ('v', 21), ('g', 19), ('f', 17), ('j', 14), ('ŋ', 10), ('θ', 9), ('dʒ', 7), ('ʃ', 5), ('tʃ', 1), ('h', 1), ('z', 1)]
In [52]:
phone3 = [arpa_map[p[2]] for p in prons_list]
phone3_fd = nltk.FreqDist(phone3)
phone3_fd.most_common()
Out[52]:
[('ɹ', 331), ('l', 325), ('n', 315), ('t', 262), ('k', 240), ('s', 227), ('d', 194), ('m', 176), ('p', 146), ('ʌ/ə', 145), ('ɪ', 137), ('b', 133), ('æ', 130), ('v', 110), ('ɛ', 103), ('g', 100), ('i', 98), ('f', 86), ('z', 85), ('eɪ', 73), ('ɑ', 73), ('ŋ', 64), ('ɔ', 59), ('ʃ', 57), ('u', 57), ('w', 48), ('oʊ', 47), ('aɪ', 47), ('ɝ/ɚ', 45), ('tʃ', 41), ('dʒ', 39), ('h', 29), ('aʊ', 26), ('θ', 20), ('ð', 18), ('j', 9), ('ɔɪ', 9), ('ʊ', 6), ('ʒ', 5)]
In [53]:
phone4 = [arpa_map[p[3]] for p in prons_list]
phone4_fd = nltk.FreqDist(phone4)
phone4_fd.most_common()
Out[53]:
[('ɪ', 694), ('ʌ/ə', 683), ('ɝ/ɚ', 209), ('l', 205), ('t', 202), ('n', 184), ('k', 143), ('i', 137), ('s', 137), ('aɪ', 113), ('ɹ', 112), ('p', 109), ('eɪ', 108), ('d', 106), ('m', 96), ('oʊ', 73), ('ɛ', 72), ('b', 61), ('u', 59), ('ɑ', 57), ('æ', 52), ('tʃ', 49), ('ɔ', 47), ('aʊ', 46), ('v', 43), ('f', 40), ('g', 40), ('dʒ', 37), ('ŋ', 36), ('z', 33), ('ʃ', 27), ('w', 25), ('j', 22), ('ɔɪ', 16), ('ʊ', 14), ('ð', 10), ('θ', 10), ('h', 4), ('ʒ', 4)]
In [54]:
phone5 = [arpa_map[p[4]] for p in prons_list]
phone5_fd = nltk.FreqDist(phone5)
phone5_fd.most_common()
Out[54]:
[('ɝ/ɚ', 590), ('i', 475), ('ŋ', 464), ('t', 421), ('d', 356), ('l', 336), ('n', 311), ('s', 208), ('k', 134), ('z', 101), ('ʌ/ə', 83), ('oʊ', 82), ('p', 78), ('m', 69), ('eɪ', 60), ('ʃ', 55), ('ɹ', 52), ('dʒ', 44), ('v', 32), ('u', 29), ('tʃ', 27), ('f', 24), ('aɪ', 22), ('b', 17), ('θ', 13), ('ɔɪ', 8), ('g', 7), ('ɑ', 5), ('aʊ', 4), ('ɔ', 4), ('ʒ', 2), ('ʊ', 1), ('ɛ', 1)]

Q5: What's the most common CV pattern? Could it be... CVCVC? Perhaps VCCVC?¶

In [55]:
import re
def CV_fy(arpa_pron):
    cv_list = []
    for p in arpa_pron:
        if re.match(r'[AEIOU]', p): cv_list.append('V')
        else: cv_list.append('C')
    return ' '.join(cv_list)

CV_fy(['AH', 'B', 'AE', 'SH', 'T'])
Out[55]:
'V C V C C'
In [56]:
prons_list_cv = [CV_fy(pron) for pron in prons_list]
prons_list_cv[-5:]
Out[56]:
['C V C V C', 'C V C V C', 'C V C C V', 'C V C V C', 'C V C V C']
In [57]:
targetwords[-5:]
Out[57]:
[('zipping', ['Z', 'IH', 'P', 'IH', 'NG']), ('zircon', ['Z', 'ER', 'K', 'AH', 'N']), ('zombie', ['Z', 'AA', 'M', 'B', 'IY']), ('zoning', ['Z', 'OW', 'N', 'IH', 'NG']), ('zygote', ['Z', 'AY', 'G', 'OW', 'T'])]
In [58]:
cv_fd = nltk.FreqDist(prons_list_cv)
cv_fd.most_common()
Out[58]:
[('C V C V C', 1735), ('C V C C V', 670), ('C C V C V', 418), ('V C C V C', 349), ('C C V C C', 334), ('C V C V V', 111), ('V C V C C', 93), ('C C C V C', 73), ('V C V C V', 66), ('C V C C C', 57), ('C C V V C', 50), ('C V V C V', 40), ('V C C C V', 34), ('V C V V C', 23), ('C V V C C', 20), ('V C C V V', 16), ('V V C V C', 9), ('C V V V C', 7), ('C C V V V', 4), ('C C C V V', 3), ('V V C C V', 2), ('V C C C C', 1)]

Very curious! What is this lone VCCCC word? And VVCCV?

Before we can search/filter, let's update our targetwords DataFrame with IPA strings and VC patterns.

In [59]:
targetwords_df['ipa'] = targetwords_df.phonemes.map(ipa_fy)
targetwords_df.head()
Out[59]:
word variant phonemes stress p ipa
0 abashed 0 [AH, B, AE, SH, T] 2 0.941176 ʌ/ə b æ ʃ t
1 abhor 0 [AE, B, HH, AO, R] 3 0.885856 æ b h ɔ ɹ
2 ablate 0 [AH, B, L, EY, T] 3 0.550000 ʌ/ə b l eɪ t
3 ablaze 0 [AH, B, L, EY, Z] 3 0.983562 ʌ/ə b l eɪ z
4 abloom 0 [AH, B, L, UW, M] 3 0.705234 ʌ/ə b l u m
In [60]:
targetwords_df['CV'] = prons_list_cv
targetwords_df.head()
Out[60]:
word variant phonemes stress p ipa CV
0 abashed 0 [AH, B, AE, SH, T] 2 0.941176 ʌ/ə b æ ʃ t V C V C C
1 abhor 0 [AE, B, HH, AO, R] 3 0.885856 æ b h ɔ ɹ V C C V C
2 ablate 0 [AH, B, L, EY, T] 3 0.550000 ʌ/ə b l eɪ t V C C V C
3 ablaze 0 [AH, B, L, EY, Z] 3 0.983562 ʌ/ə b l eɪ z V C C V C
4 abloom 0 [AH, B, L, UW, M] 3 0.705234 ʌ/ə b l u m V C C V C
In [61]:
targetwords_df[targetwords_df.CV=='V C C C C']
Out[61]:
word variant phonemes stress p ipa CV
123 angst 0 [AA, NG, K, S, T] 0 0.966667 ɑ ŋ k s t V C C C C
In [62]:
prondict['angst']  # Go, Ben! You guessed it! 
Out[62]:
[['AA1', 'NG', 'K', 'S', 'T']]
In [63]:
targetwords_df[targetwords_df.CV=='V V C C V']
Out[63]:
word variant phonemes stress p ipa CV
142 aorta 0 [EY, AO, R, T, AH] 1 0.929730 eɪ ɔ ɹ t ʌ/ə V V C C V
184 arranger 0 [ER, EY, N, JH, ER] 1 0.850356 ɝ/ɚ eɪ n dʒ ɝ/ɚ V V C C V
In [64]:
targetwords_df[targetwords_df.CV=='C C C V V']
Out[64]:
word variant phonemes stress p ipa CV
3060 screwy 0 [S, K, R, UW, IY] 3 0.950249 s k ɹ u i C C C V V
3236 skewer 0 [S, K, Y, UW, ER] 3 0.988067 s k j u ɝ/ɚ C C C V V
3402 sprayer 0 [S, P, R, EY, ER] 3 0.964557 s p ɹ eɪ ɝ/ɚ C C C V V

Q6: What is the BEST opening word? (How do we operationalize this?)¶

  • We could try to maximize "hits".

Take 1: unigram probability -- each word is a "bag of 5 sounds"

  • Basically, a word with distinct 5 sounds, each ranking very high in the overall frequency distribution (phone_fd)
  • Top 5 were: ('ʌ/ə', 1381), ('ɪ', 1304), ('l', 1263), ('t', 1247), ('s', 1161). Any word made out of these five would be it! 's ɪ t ʌ/ə l'? `t ʌ/ə l ɪ s'? 's t ɪ l ʌ/ə'?
In [65]:
def get_bag_probs(pron):
    ipas = [arpa_map[p] for p in pron]
    probs = [phone_fd.freq(x) for x in ipas]
    return probs

get_bag_probs(['S', 'IY', 'T', 'IH', 'D'])    # seated
Out[65]:
[0.056427703523693806, 0.04592952612393682, 0.06060753341433779, 0.0633778857837181, 0.042770352369380316]
In [66]:
import math
math.prod(_)
Out[66]:
4.2578614537093817e-07
In [67]:
get_bag_probs(['D', 'IH', 'S', 'IY', 'T'])   # deceit, same 5 sounds
Out[67]:
[0.042770352369380316, 0.0633778857837181, 0.056427703523693806, 0.04592952612393682, 0.06060753341433779]
In [68]:
math.prod(_)   # same probability
Out[68]:
4.2578614537093817e-07
In [69]:
# Dan's pick! 
prondict['until'], ipa_fy(prondict['until'][0])
Out[69]:
([['AH0', 'N', 'T', 'IH1', 'L']], 'ʌ/ə n t ɪ l')
In [70]:
get_bag_probs(prondict['until'][0])
Out[70]:
[0.06712029161603889, 0.05161603888213852, 0.06060753341433779, 0.0633778857837181, 0.06138517618469016]
In [71]:
math.prod(_)   # very high! 
Out[71]:
8.168952510477823e-07
In [72]:
targetwords_df['bagprob'] = targetwords_df.phonemes.map(lambda x: math.prod(get_bag_probs(x)))
targetwords_df.head(5)
Out[72]:
word variant phonemes stress p ipa CV bagprob
0 abashed 0 [AH, B, AE, SH, T] 2 0.941176 ʌ/ə b æ ʃ t V C V C C 3.382189e-08
1 abhor 0 [AE, B, HH, AO, R] 3 0.885856 æ b h ɔ ɹ V C C V C 4.447256e-09
2 ablate 0 [AH, B, L, EY, T] 3 0.550000 ʌ/ə b l eɪ t V C C V C 1.489803e-07
3 ablaze 0 [AH, B, L, EY, Z] 3 0.983562 ʌ/ə b l eɪ z V C C V C 2.747832e-08
4 abloom 0 [AH, B, L, UW, M] 3 0.705234 ʌ/ə b l u m V C C V C 3.453479e-08
In [73]:
targetwords_df.sort_values(by=['bagprob'], ascending=False)
Out[73]:
word variant phonemes stress p ipa CV bagprob
2009 little 0 [L, IH, T, AH, L] 1 0.997429 l ɪ t ʌ/ə l C V C V C 9.715054e-07
3732 tittle 0 [T, IH, T, AH, L] 1 0.691932 t ɪ t ʌ/ə l C V C V C 9.591981e-07
3539 subtle 0 [S, AH, T, AH, L] 1 0.990123 s ʌ/ə t ʌ/ə l C V C V C 9.457801e-07
3824 tussle 0 [T, AH, S, AH, L] 1 0.963054 t ʌ/ə s ʌ/ə l C V C V C 9.457801e-07
3841 ultra 0 [AH, L, T, R, AH] 0 0.987685 ʌ/ə l t ɹ ʌ/ə V C C C V 9.034196e-07
... ... ... ... ... ... ... ... ...
1425 fused 0 [F, Y, UW, Z, D] 2 0.982587 f j u z d C C V C C 3.029703e-10
537 bureau 0 [B, Y, UH, R, OW] 2 0.990123 b j ʊ ɹ oʊ C C V C V 2.697473e-10
1427 future 0 [F, Y, UW, CH, ER] 2 0.994845 f j u tʃ ɝ/ɚ C C V C V 2.499505e-10
1717 hosiery 0 [HH, OW, ZH, ER, IY] 1 0.957346 h oʊ ʒ ɝ/ɚ i C V C V V 1.981474e-10
473 boyhood 0 [B, OY, HH, UH, D] 1 0.976064 b ɔɪ h ʊ d C V C V C 9.590833e-11

4115 rows × 8 columns

In [74]:
targetwords_df.sort_values(by=['bagprob'], ascending=False).head(10)
Out[74]:
word variant phonemes stress p ipa CV bagprob
2009 little 0 [L, IH, T, AH, L] 1 0.997429 l ɪ t ʌ/ə l C V C V C 9.715054e-07
3732 tittle 0 [T, IH, T, AH, L] 1 0.691932 t ɪ t ʌ/ə l C V C V C 9.591981e-07
3539 subtle 0 [S, AH, T, AH, L] 1 0.990123 s ʌ/ə t ʌ/ə l C V C V C 9.457801e-07
3824 tussle 0 [T, AH, S, AH, L] 1 0.963054 t ʌ/ə s ʌ/ə l C V C V C 9.457801e-07
3841 ultra 0 [AH, L, T, R, AH] 0 0.987685 ʌ/ə l t ɹ ʌ/ə V C C C V 9.034196e-07
3812 tunnel 0 [T, AH, N, AH, L] 1 0.997416 t ʌ/ə n ʌ/ə l C V C V C 8.651322e-07
2972 rustle 0 [R, AH, S, AH, L] 1 0.988858 ɹ ʌ/ə s ʌ/ə l C V C V C 8.411148e-07
206 assist 0 [AH, S, IH, S, T] 2 0.994667 ʌ/ə s ɪ s t V C V C C 8.209240e-07
3886 until 0 [AH, N, T, IH, L] 3 0.989691 ʌ/ə n t ɪ l V C C V C 8.168953e-07
2337 occult 0 [AH, K, AH, L, T] 2 0.958025 ʌ/ə k ʌ/ə l t V C V C C 8.072938e-07

Yep, 'until' is the top word without duplicate phones!

Take 2: take position into account (positional probability)

In [75]:
# 's' count and probability as the first phone. High! 
phone1_fd['s'], phone1_fd.freq('s')
Out[75]:
(553, 0.13438639125151883)
In [76]:
# 's' as the 2nd phone. Much less likely. 
phone2_fd['s'], phone2_fd.freq('s')
Out[76]:
(36, 0.008748481166464156)
In [77]:
def get_slot_probs(pron):
    ipas = [arpa_map[p] for p in pron]
    slot_probs = [phone1_fd.freq(ipas[0]), phone2_fd.freq(ipas[1]), phone3_fd.freq(ipas[2]), 
                  phone4_fd.freq(ipas[3]), phone5_fd.freq(ipas[4])]
    return slot_probs

get_slot_probs(['S', 'IY', 'T', 'IH', 'D'])
Out[77]:
[0.13438639125151883, 0.05492102065613609, 0.06366950182260024, 0.16865127582017012, 0.0865127582017011]
In [78]:
math.prod(_)
Out[78]:
6.856383994933627e-06
In [79]:
get_slot_probs(['D', 'IH', 'S', 'IY', 'T'])
Out[79]:
[0.04763061968408262, 0.0928311057108141, 0.0551640340218712, 0.03329283110571082, 0.10230862697448359]
In [80]:
math.prod(_)  # 'deceit' positional probability, much lower than 'seated'
Out[80]:
8.308043402913229e-07
In [81]:
targetwords_df['positprob'] = targetwords_df.phonemes.map(lambda x: math.prod(get_slot_probs(x)))
targetwords_df.tail()
Out[81]:
word variant phonemes stress p ipa CV bagprob positprob
4110 zipping 0 [Z, IH, P, IH, NG] 1 0.988180 z ɪ p ɪ ŋ C V C V C 4.322689e-08 1.522105e-07
4111 zircon 0 [Z, ER, K, AH, N] 1 0.677966 z ɝ/ɚ k ʌ/ə n C V C V C 9.174892e-08 6.696916e-08
4112 zombie 0 [Z, AA, M, B, IY] 1 0.997669 z ɑ m b i C V C C V 7.876899e-09 1.015675e-08
4113 zoning 0 [Z, OW, N, IH, NG] 1 0.958525 z oʊ n ɪ ŋ C V C V C 1.824696e-08 1.237945e-07
4114 zygote 0 [Z, AY, G, OW, T] 1 0.884804 z aɪ g oʊ t C V C V C 3.316702e-09 3.907032e-09
In [82]:
targetwords_df.sort_values(by=['positprob'], ascending=False)
Out[82]:
word variant phonemes stress p ipa CV bagprob positprob
3077 searing 0 [S, IH, R, IH, NG] 1 0.976077 s ɪ ɹ ɪ ŋ C V C V C 3.408251e-07 1.908292e-05
3228 sitting 0 [S, IH, T, IH, NG] 1 0.988479 s ɪ t ɪ ŋ C V C V C 3.832362e-07 1.510491e-05
3595 synod 0 [S, IH, N, AH, D] 1 0.578554 s ɪ n ʌ/ə d C V C V C 5.299214e-07 1.371262e-05
3098 selling 0 [S, EH, L, IH, NG] 1 0.997792 s ɛ l ɪ ŋ C V C V C 1.476412e-07 1.231149e-05
2986 salad 0 [S, AE, L, AH, D] 1 1.000000 s æ l ʌ/ə d C V C V C 2.682290e-07 1.114799e-05
... ... ... ... ... ... ... ... ... ...
2420 overdo 0 [OW, V, ER, D, UW] 0 0.948052 oʊ v ɝ/ɚ d u V C V C V 5.236192e-09 5.416380e-11
2421 overdue 0 [OW, V, ER, D, UW] 0 1.000000 oʊ v ɝ/ɚ d u V C V C V 5.236192e-09 5.416380e-11
97 aloha 0 [AH, L, OW, HH, AA] 2 0.932292 ʌ/ə l oʊ h ɑ V C V C V 1.234740e-08 3.312555e-11
2419 overbuy 0 [OW, V, ER, B, AY] 0 0.803714 oʊ v ɝ/ɚ b aɪ V C V C V 5.191565e-09 2.364601e-11
140 anyhow 0 [EH, N, IY, HH, AW] 0 0.963636 ɛ n i h aʊ V C V C V 3.363159e-09 1.540477e-11

4115 rows × 9 columns

Excluding 'searing' and 'sitting' which have duplicate phones, the top 3 are 'synod', 'selling', and 'salad'.

  • I don't like 'synod'. First off, I don't even know the word! Secondly, I want to save 'ɪ' for a 2nd or 3rd guess word ending in 'ɪ ŋ'.
  • 's ɛ l ɪ ŋ' is also not ideal. Its bag probability is much lower than 'salad', reason being 'ɪ ŋ' ending boosts the positional probability, but 'ŋ' has much lower probability elsewhere. This means, 'ɪ ŋ' will either produce exact-slot hit (green) or just wasted, i.e., it won't offer up "somewhere in there" hits (yellow). This is why '-ing' words are better as the 2nd or even 3rd guess word.
  • So, 'salad' it is! It's a fun word too.

Q7: One opener isn't quite enough against 39 (!!) speech sounds. Can we come up with a set of three words?¶

In [83]:
# code that Na-Rae didn't transfer from Python shell session...
# Step 1: lock 'salad' as WORD1 
# Step 2: find WORD2 - no overlapping phones with 'salad', with highest positional probability
# Step 3: find WORD3 - no overlap with 'salad' and WORD2 
#   (... not strictly going by top probs, applied some favoritism...) 
# the three words I chose: 'salad', 'caring', and 'painter'.  
In [84]:
targetwords_df[targetwords_df.word.isin({'salad', 'painter', 'caring'})]
Out[84]:
word variant phonemes stress p ipa CV bagprob positprob
606 caring 0 [K, EH, R, IH, NG] 1 0.994490 k ɛ ɹ ɪ ŋ C V C V C 1.106566e-07 0.000008
2440 painter 0 [P, EY, N, T, ER] 1 0.995098 p eɪ n t ɝ/ɚ C V C C V 1.163877e-07 0.000002
2986 salad 0 [S, AE, L, AH, D] 1 1.000000 s æ l ʌ/ə d C V C V C 2.682290e-07 0.000011
No description has been provided for this image

With these, my actual HEARDLE game play goes like:

  • Start with 'salad'
  • Then I try 'painter'
  • If either word has produced a green ("exact position") hit in slot 4 or 5 (example -->), 'caring' as the next word becomes pretty pointless, since I already know 'ɪ ŋ' will be a miss and 'ŋ' is low probability outside of slot 5.
  • At this point, I go for a different 3rd guess, such as 'marquee' [m ɑ ɹ k i].

Q8: So, the three openers. How much coverage do they provide between the 15 phonemes? What % of target words get zero hits? What % get all 5 phones covered?¶

In [85]:
prondict['salad'], prondict['painter'], prondict['caring']
Out[85]:
([['S', 'AE1', 'L', 'AH0', 'D']], [['P', 'EY1', 'N', 'T', 'ER0'], ['P', 'EY1', 'N', 'ER0']], [['K', 'EH1', 'R', 'IH0', 'NG']])
In [86]:
opener_phones = set(prondict['salad'][0] + prondict['painter'][0] + prondict['caring'][0])
opener_phones
Out[86]:
{'S', 'P', 'EH1', 'L', 'ER0', 'AE1', 'D', 'IH0', 'AH0', 'N', 'EY1', 'T', 'R', 'K', 'NG'}
In [87]:
opener_phones = {arpa_map[p] for p in opener_phones}
opener_phones
Out[87]:
{'t', 'k', 'p', 'ɪ', 'ʌ/ə', 'd', 'ŋ', 'ɝ/ɚ', 'eɪ', 'ɛ', 'l', 'n', 's', 'ɹ', 'æ'}
In [88]:
len(opener_phones)
Out[88]:
15
In [89]:
targetwords[:20]
Out[89]:
[('abashed', ['AH', 'B', 'AE', 'SH', 'T']), ('abhor', ['AE', 'B', 'HH', 'AO', 'R']), ('ablate', ['AH', 'B', 'L', 'EY', 'T']), ('ablaze', ['AH', 'B', 'L', 'EY', 'Z']), ('abloom', ['AH', 'B', 'L', 'UW', 'M']), ('aboard', ['AH', 'B', 'AO', 'R', 'D']), ('abort', ['AH', 'B', 'AO', 'R', 'T']), ('abound', ['AH', 'B', 'AW', 'N', 'D']), ('abrade', ['AE', 'B', 'R', 'EY', 'D']), ('abridge', ['AH', 'B', 'R', 'IH', 'JH']), ('abroad', ['AH', 'B', 'R', 'AO', 'D']), ('abscess', ['AE', 'B', 'S', 'EH', 'S']), ('absurd', ['AH', 'B', 'S', 'ER', 'D']), ('acacia', ['AH', 'K', 'EY', 'SH', 'AH']), ('accede', ['AE', 'K', 'S', 'IY', 'D']), ('access', ['AE', 'K', 'S', 'EH', 'S']), ('acclaim', ['AH', 'K', 'L', 'EY', 'M']), ('accord', ['AH', 'K', 'AO', 'R', 'D']), ('accost', ['AH', 'K', 'AO', 'S', 'T']), ('account', ['AH', 'K', 'AW', 'N', 'T'])]
In [90]:
targetwords_nohit = [(w,pron) for (w,pron) in targetwords if all([arpa_map[p] not in opener_phones for p in pron])]
len(targetwords_nohit)
Out[90]:
4
In [91]:
targetwords_nohit   #!!!
Out[91]:
[('beehive', ['B', 'IY', 'HH', 'AY', 'V']), ('mambo', ['M', 'AA', 'M', 'B', 'OW']), ('wiseguy', ['W', 'AY', 'Z', 'G', 'AY']), ('zombie', ['Z', 'AA', 'M', 'B', 'IY'])]
In [92]:
targetwords_hitcount = [(w,pron, len([p for p in pron if arpa_map[p] in opener_phones])) for (w,pron) in targetwords]
targetwords_hitcount[:5]
Out[92]:
[('abashed', ['AH', 'B', 'AE', 'SH', 'T'], 3), ('abhor', ['AE', 'B', 'HH', 'AO', 'R'], 2), ('ablate', ['AH', 'B', 'L', 'EY', 'T'], 4), ('ablaze', ['AH', 'B', 'L', 'EY', 'Z'], 3), ('abloom', ['AH', 'B', 'L', 'UW', 'M'], 2)]
In [93]:
nltk.FreqDist([count for (w,pron,count) in targetwords_hitcount])
Out[93]:
FreqDist({4: 1527, 3: 1398, 5: 543, 2: 536, 1: 107, 0: 4})
  • Only 4 (!!!) target words with zero hits. Out of 4115 words. Wow.
  • Over half of target words get 4-5 hits! Pretty good.
In [ ]: