Press "Enter" to skip to content

Fun with grep

I was reading the sed & awk pocket reference book and came along the regex
b[aeiou]g which returns the words bag, beg, big, bog and bug. This got me
thinking; how many two letter combinations can you find that can have any vowel
put between them to form a valid three letter word?

My first attempt was to

  • 1. output the contents of the words file in /usr/share/dict
  • 2. cut out the three letter words
  • 3. get all the words on a single line for grepping, by translating newlines to spaces
  • 4. search for strings that start with any two letters with an a in the middle until
    the same two letters with an e in the middle and so on
  • 5. print the first word from each string
cat /usr/share/dict/words \
| egrep "^[a-z]{3}$" \
| tr '\n' ' ' \
| egrep -o "(\w)a(\w).*?\1e\2.*\1i\2.*\1o\2.*\1u\2" \
| awk '{print $1}'

Which outputs


Notice how bag isn’t on the list? That’s because grep searches for the first match
and then searches for the next one after the first one has finished I.E. bud comes
before bag. To get a complete set of results we need to grep through our word list
multiple times. My solution was to take every three letter word with an a in the
middle, save the first and third letter as a variable and then grep for those letters
with a vowel between them in separate instances, like this:

for i in $(cat words | egrep "^\wa\w$")
                first=$(echo $i | cut -c 1)
                third=$(echo $i | cut -c 3)
                cat words | egrep "^[a-z]{3}$" | tr '\n' ' ' | egrep -o ''$i'.*?'$first'e'$third'.*?'$first'i'$third'.*'$first'o'$third'.*?'$first'u'$third'' | awk '{print $1}'

While we are changing things it’s nicer to format our results and output them to a file

for i in $(
                for i in $(cat words | egrep "^\wa\w$")
                first=$(echo $i | cut -c 1)
                third=$(echo $i | cut -c 3)
                cat words | egrep "^[a-z]{3}$" | tr '\n' ' ' | egrep -o ''$i'.*?'$first'e'$third'.*?'$first'i'$third'.*'$first'o'$third'.*?'$first'u'$third'' | awk '{print $1}'
        first=$(echo $i | cut -c 1)
        third=$(echo $i | cut -c 3)
        echo ''$i' '$first'e'$third' '$first'i'$third' '$first'o'$third' '$first'u'$third'' >> results

which outputs

bad bed bid bod bud
bag beg big bog bug
ban ben bin bon bun
bas bes bis bos bus
bat bet bit bot but
dab deb dib dob dub
dae dee die doe due
dag deg dig dog dug
dam dem dim dom dum
dan den din don dun
dap dep dip dop dup
dar der dir dor dur
fad fed fid fod fud
fan fen fin fon fun
far fer fir for fur
fat fet fit fot fut
gad ged gid god gud
gan gen gin gon gun
gat get git got gut
had hed hid hod hud
hae hee hie hoe hue
ham hem him hom hum
han hen hin hon hun
hap hep hip hop hup
hat het hit hot hut
jag jeg jig jog jug
lad led lid lod lud
lag leg lig log lug
lat let lit lot lut
mad med mid mod mud
mag meg mig mog mug
mam mem mim mom mum
man men min mon mun
mat met mit mot mut
nab neb nib nob nub
nat net nit not nut
pan pen pin pon pun
pap pep pip pop pup
par per pir por pur
pas pes pis pos pus
pat pet pit pot put
rab reb rib rob rub
rad red rid rod rud
rag reg rig rog rug
ram rem rim rom rum
rat ret rit rot rut
rax rex rix rox rux
sae see sie soe sue
san sen sin son sun
sap sep sip sop sup
tae tee tie toe tue
tag teg tig tog tug
tam tem tim tom tum
tan ten tin ton tun
tat tet tit tot tut
vag veg vig vog vug
wad wed wid wod wud
wan wen win won wun
wat wet wit wot wut

This list is dubious. While all words appear in the words file bon is clearly french,
dem is an eye dialect spelling of them and so on. However much I enjoy ged meaning a european pike or rix meaning to reign it’s time to use another
wordlist. So I used one I created using the SCOWL (Spell Checker Oriented Word
Lists) database and found the only two answers to our puzzle in common usage.

bag beg big bog bug
pat pet pit pot put