The current draft of chapter 1 is available here.

Chapter 1

 

Coarticulation between [s] and [ɹʷ]


 
‘I scheme’ ‘I scream’
[s] in ‘I scheme’ [s] in ‘I scream’

 

Representing two words

On this page, we look at recordings of two words of Logoori (a Bantu language of Kenya), which translate to English ‘dog’ and ‘new’. The goal is to say exactly how these words are similar, and how they are different, which then tells us something about this language. We will work through a few ways of analyzing and representing the physical properties of these words, uncovering advantages and disadvantages of these methods.

Sound Recording


Any sound can be recorded on a computer and saved as a file of numbers, which can then be played back. Here we have an example of two words of Logoori. On all of these pages, you can press ‘play’ to hear the word. I recommend that you do not listen to these recordings right now, try to figure out the words based just on the numbers and pictures, then come back here to listen to see if you were right.

‘dog’: this file contains 9,183 numbers (you can see the numbers here).
‘new’: this file contains 9,191 numbers (you can see the numbers here).

Those walls of numbers by themselves are completely uninformative, and just looking at the numbers, you don’t learn anything about the words. We need better visualization.

We can also graph those numbers, creating a waveform of these utterances of ‘dog’ and ‘new’.
‘dog’ ‘new’

An expert phonetician could tell you a very little bit about these words from these pictures.

Spectrograms


A spectrogram analyzes these numbers and creates another bunch of numbers, in this case a 116 × 204 matrix of (floating-point) numbers (23,664 numbers).
Spectrogram of ‘dog’ Spectrogram of ‘new’

An expert phonetician could tell you a more bit about these words from these pictures. But how do we talk about the differences systematically, not using vague descriptions like “a bit darker”, “a bit further up the picture” or “a bit further to the right”?

Reduction of sound to formant measurements

Next, we reduce the sound recordings to a mere 228 numbers each (listed below), so we can compare “F1 at 0.02 seconds in ‘dog’, versus F1 at 0.02 seconds in ‘new’.” The two words differ in F1 at that point in time by 32 Hz. An expert phonetician may eventually tell you something about the words from these numbers, but it requires a lot more specialized analysis of these 228 numbers: at this point, we just have a smaller wall of numbers.

How well does this massively stripped-down representation preserve the original sound of the ‘dog’ and ‘new’ recordings? These numbers can be converted back to sound, and as you can hear: ‘dog’, ‘new’, there has been a serious loss in sound quality. Go ahead and glance at the table of numbers. Click here to move towards the ultimate solution

Time in seconds, Formants 1-3 in Hz

‘dog’‘new’
Time     F1F2F3             F1F2F3
0.024592198286349121512849
0.044512174288547521552851
0.064642156289748121392845
0.084662107289448521262827
0.14252138283149221102836
0.124732148286349521012820
0.144962158288749620922795
0.165072198281649820712785
0.185032184280049720432755
0.25032169277549620332715
0.225052140271049520212664
0.245072087259549120072526
0.265042023251349119572323
0.284961877245848418452264
0.35001787273645117472199
0.325011958304044321433269
0.344682169319743421283259
0.365882225331944221103216
0.387032245314144620983208
0.47322253330841020863209
0.427432259346833920613198
0.447302261341132620543148
0.467132248337832920673179
0.487022249340731220583062
0.56682267329231018662077
0.526582231331231920563017
0.546732212335331720413423
0.566682188347735520543085
0.586732166339337420353064
0.65812142344135520153185
0.625122141349429419923260
0.644792201343126218873110
0.664252183345726119543272
0.682952117346325820083213
0.72612132352025620823132
0.723441629209230021702239
0.744781062209933521412452
0.76406811219735120812645
0.78403824219635920652723
0.8423826222536920602773
0.82441864222538020682818
0.84458892224638120682842
0.86485906225438920682860
0.88505912225639620542840
0.9529909225240620442833
0.92555955232342320442821
0.945611074245446420542812
0.965461098245248420502804
0.985301051230749620582774
15351053217649620472748
1.026021068219449920472713
1.04816849208250420192641
1.066511090208451019972599
1.086141122208151919842572
1.16321111211555319582611
1.126521122211259419522683
1.146561154215960219172682
1.166501162215760619052669
1.186511168219162319092531
1.26581176220863318962426
1.227831172213967619472381
1.248171185214868419962327
1.267571167212870419352273
1.287401177212471019092271
1.36131145208573217442306
1.326321150209573817162259
1.346031173248773016052201
1.366371182263973816412240
1.3811622213349771214802248
1.43471240277469814932266
1.426181226290362714742101
1.446191202274068214622139
1.467871157256164113532271
1.486201556313269114152290
1.57021530299166614372246
1.526361598302568814292274

Solution

The problem was that we were trying to do completely different things with a single instrument. There are many numeric tools that can be used to discover the physical properties of sound (any sound at all), and there is a different set of tools that we use to understand linguistic sounds, which are symbolic representations of physical sound as they are used as cognitive building blocks of language. We simply reduce these two recordings to a tiny set of technically-defined symbols specialially developed for representing language sound – [ɪ́mbwá] ‘dog’ and [ɪ́mbjá] ‘new’. How we do that is the topic of Chapter 2.