Sunday, October 18, 2009

What is the piano doing? (Posted in reverse order) 1

I'm posting the two halves of this explanation in reverse order because, when I view my posts, more recent ones come up above later ones. If your display is the other way around, I apologize!

The piano voice synthesizer, Part I

So, what is happening here?

First a detour.

There are a number of ways to synthesize speech, but most of them involve determining a set of characteristics of speech, then build a mechanism to reproduce that kind of sound. Experiments in this direction are not new: in 1779, one C.G.Kratzenstein at the Imperial Academy in St. Petersburg constructed a device which generated vowel sounds by blowing through a reed into chambers shaped like a vocal tract.

These approaches all worked from the standpoint of analyzing the existing system and building mechanical analogs. One such analog is on display at the Exploratorium in San Francisco, CA, and a write up on it, with pictures, can be seen here: http://www.exploratorium.edu/exhibits/vocal_vowels/vocal_vowels.html

Samuel Morse is supposed to have been able to form vocal sounds with his hands, and used the ability to prank friends and adults. (I don't remember, honestly, if that story is supposed to be true, but it goes on to say that when he was suffering some rather painful dentistry, he used it to tell the doctor to lighten up a little!)

There are other approaches which are valid: synthesize part of the system, then let the remainder of the system be used to provide its normal function. Witness the 'Talk Box'. Whether you are more familiar with the guitar antics of Peter Frampton (or piano antics of Stevie Wonder), or the animated Casey Junior, the engine that pulled the circus train in Disney's classic Dumbo, you've heard this: a sound source is captured and applied to the vocal tract, and the vocalist merely moves his mouth and oral cavity as they would for speaking. With the talk-box, a tube leads the sound of an amplifier to the player's mouth, with the Sonovox (used for Casey Jones and numerous interesting commercials through the 60's), a pair of audio transducers are pressed lightly against the neck of the performer. In either case, the effect is the same: the vocal chords are replaced in function by another sound source, and the oral cavity, lips, tongue and teeth are employed as for normal sound.

In each case, of course, the effort is to reproduce the physical action and the acoustical modifiers used in human vocal production.

Electronic efforts to reproduce vocal characteristics are more recent (as is electronics more recent than mechanics!) The easiest of these is the recorder-reproducer, where a human speaks and the sound pressure wave from their voice is recorded electronically, whether on magnetic tape, in vinyl (and originally, recording to wax disks were totally mechanical), or in digital numbers, which themselves are recorded or stored. For playback, the recording is processed through an opposite process which takes the stored numbers or signals and turns them back into audible sound. In this case, the recording captures all the information and reproduces most of it, with some attendent noise. However, you can't record a woman saying "He saw the cat," and play back a man saying "It's a Rolex!" The playback is what was recorded. (This leaves out a whole branch of electronic music, where recorded sounds are distorted, reversed, stook on their heads and severely beaten, or simply chopped up and re-ordered. That's because the discussion of recording/reproducing is but a step to a discussion of synthesis of speech, so please, let's not get off the track!)

One of the earliest efforts to reproduce the human voice electronically was the Voder of Henry Dudley. This machine had multiple keys and footpedals, each assigned to a certain aspect of the electronic vocal model. For instance, there were keys to produce the gutterals, frickatives and pops produced by the tongue and lips for hard consonants. There where hiss generators which provided the SH and S sound, and which could be mixed into the sound of the previous consonant sounds, or vowels for voiced consonantals. And there were a set of "formant filters", which could be engaged by different amounts depending on the pressure of the performer's hands. And performer it was (and most often "she" was, since women were almost exclusively trained to operate the Voder.) The Voder was used in a great hall at the 1939 Worlds Fair in NYC, and received rave reviews, but little came of it afterwards, probably because of the difficulty involved in operating it.

The Formant Filters are important. These are an electrical analog to the resonant characteristics of the human vocal tract in certain configurations. Generally, three formant filters are enough to make recognizeable vowel sounds: they are tuned, one above the excitation pitch, the next tuned higher, and the third tuned higher. By controlling how much sound they let through in those ranges, and using a sound source which is rich in content in those ranges, the formant filters do just what the vocal tract and sinuses do to the complex sound coming from the vocal chords: it carves them away until they sound like... well... vowels!

When I was in 7th grade, I found a Bell Labs kit in the classroom, and talked the teacher into letting me take it home and build it. My father helped me (a lot: he was a TV repair technician, and understood what the instructions said. It was a _real_ learning experience for me!) The result was an electrical circuit built on the back of a box, fed from a sound generator with a control voltage that made it's pitch rise and fall. The generator turned out a sawtooth wave (very rich in harmonic content) and it fed through three formant filters, formed by capacitors and inductors. We could change the formant filter pitches by changing the capacitors, and change their strength by changing resistors, and we got it to say "ahhhh" easily, and "eeeeeee" and even long "o", but getting the long "u" (or "oo" really) was very difficult: the filters got so strong that we couldn't get enough sound out to hear it!

I'm going to require that you retain this last paragraph's information for the next post when I get back to the piano voice synthesizer, so maybe you want to go back and re-read it: three formant filters, which could have their frequency (pitch) and strength (Q is the official term, but you could think upside down and use the term damping as easily) adjusted, and a sound source that provided lots of rich components (harmonics), and which could have its pitch varied to lend a sense of emphasis. There was no effort made at consonants in this box, just vowels. For all intents and purposes, it acted just like the artifical vocal tracts shown on the Exploratorium page above!

This is a good place to stop, until the next post.

What is the piano doing? (Posted in reverse order)

I'm posting the two halves of this explanation in reverse order because, when I view my posts, more recent ones come up above later ones. If your display is the other way around, I apologize!

The piano voice synthesizer, Part II

Back to the background: there are two more things to consider before we get to the piano and what it is doing. They are the Vocoder and the Sonogram.

There is a special case of the formant-filter electronic voice synthesizer called the Vocoder. This interesting machine works in the conceptual gap between the formant synthesizer and the recorder-reproducer, and understanding it really is key to understanding the piano.

The vocoder is called vocoder because it both codes the voice and produces vocals. One side at a time: to record, the human voice is presented to the vocoder, which is a bank of fixed filters. Each filter carves off a part of the vocal signal, based on frequency. If you've ever seen a graphic equalizer (and you probably have: rows of sliders with cryptic labels like "125hz" and "250hz" and "500hz" over each one, which you can shape sound electronically: lift a few of the sliders that are close together, and the sound in that part of the audio range is increased: do it to sliders to the right, high frequency sounds increase (and hiss!), do it to sliders on the left, and bass sounds are increased (and maybe thumps and booms, too!) The fun of graphic equalizers is that they slice up the audio band from low pitches to high pitches and make control of those slices easier!) The vocoder's input doesn't merely change the incoming sound, it records the amount of power in each slice. Then, the reproducer uses that information to control the strength of power allowed in each slice in another bank of filters. If you apply a sound source that is appropriately like the vocal chords, then the output sound is identical to the input sound.

Before we go on, lets review what we have: on the input side, we have a bank of filters which _analyze_ the sound into bands. Each band is associated with a slice of the frequency spectrum from low to high. The numbers that come out of each band tell how much power is in that band. If the power increases, the numbers get bigger. If it decreases, the numbers are smaller. These numbers, when applied to a similar bank of filters that can be controlled, will make the same amount of power 'be allowed' in each filter, and those filters will act on a rich source of sound to produce an output like the input. We call the numbers 'coefficients', which isn't really kind to the numbers or us, but sound people are like that.

Now. What if we put something different into the input. Use a guitar: It's just like a talk-box: the filters act like an acoustical resonator and carve away the sound until it sounds like the guitar is talking! Use a woman's voice, and shift the coefficients so they feed higher-frequency filters than originally, and she can go "eeeeee" and out comes the words a man said at the input side! Substitute a musical synthesizer with a good, buzzy output, and make the man sing!

The nice thing about the vocoder is that you can connect the output of the analyzing filters to the analogous controllable output filters, and play music into the output filter while talking into the input filter, and the music comes out with words! Vocoders are used a lot in the entertainment business now, and digital versions of them are so sophistocated that they can be used to correct the pitch of a singer or add other voices in harmony with a singer, using the same enunciation and expression!

OK, now that the concept of filters, coefficients and power-in-a-band (of frequencies) are established, one more thing:

The sonogram is a picture of sound. Sound is complex in many ways: First of all, it is dependent on volume, pitch (frequency), and time. Specifically, if you remember the vocoder: each filter band has a frequency that it is active for: it has no response to unrelated frequencies. When a sound enters the filter band that it responds to, the level of output depends on the volume of the incoming sound: if it increases, the signal telling how much power is in the band increases, etc. And this varying happens with time. The fact that we can analyze the voice with a set of filters like we do with the vocoder means that we can just as easily do it with more filters (or less: the first vocoder only used five filters on each end!)

The sonogram, conceptually, is a display of many filters, shown with low frequencies lower and high frequencies higher. Time is shown from left (history) to right (more recent history). And volume is shown by the color of each point on the sonogram. Where no sound happens, no color happens. If the sonogram is "black and white", then it'll be black where there is no sound and get lighter as sound at that frequency increases. If a sound starts very quiet and grows loud, then dies away, but stays at one frequency, the sonogram will show one horizontal line which starts very dark, rises to a level of whiteness, then dims back to darkness. If a sound (like a drum-stroke) produces many frequencies all at once, but not for a long time, the sonogram will have a single vertical line, probably very light. Usually "false color" is used, say, blue for very quiet sounds, green for middle-loud sounds, and red for very loud sounds. And, of course, the filters are not perfect, so sound from adjoining bands "leak" into neighboring filters, so a person saying "bOOp" (which is very pure, i.e., may show up only at one frequency) may appear as a round spot with a vertical line at the start (for the b) and another at the end (for the p).

The sonogram is a way to see sound. It has been used in various forms, both analog and digital, for a century to analyze speech and as a tool to train speakers, as well as a way to analyze musical and other sounds.

If each spot on a sonogram could be tied to the control of a filter tuned to exactly that frequency, it could be used like the Vocoder to impress recorded speech into other sounds. In this case, we're using filters to change the shape of the sound that goes through them, and providing a single complex sound for those filters to carve up.

So what is the piano doing?

The piano voice synthesizer

Remember that the sonogram is a map of the acoustic energy present at each point in time, showing its frequency by its vertical placement, and its power by its color. What is analogous to this?

The player piano is a device that uses vacuum to actuate the hammers of a piano. The actuators are controlled by a mechanism with a hole for each hammer's actuator, over which a roll of paper with holes is passed. Where the holes pass over a hole in the reader, the pressure drops, a hammer is launched against the strings, and a note happens. If you take the piano roll and hold it with the beginning to the left and unroll it a bit, you can actually see the notes of the introduction, now going "up" as higher strings are sounded, now going left as lower notes are played.

This is a very "binary" approach: a hammer is actuated, or it is not. Later player piano roll systems added additional rows of holes to control volume, and some even added rows for speed. All in all, the system is very understandable: the sonogram is very similar.

But what does the sonogram _actually_ portray? Each spot, looked at vertically, indicates a frequency, around which some acoustic energy is present. And taken horizontally, each spot is a moment in time. The sonogram is a player piano roll for speech.

Conceptually, there is only one thing left to discuss: what goes on within each spot, and how closely must that be recreated to produce recognizeable speech?

The answer may be amazing to you. In actual fact, if there are enough 'bins', which are the filter bands, and they change quickly enough, the actual sound in the band doesn't need to be terribly like the original sound in the band at all!

This is why the vocoder works so nicely: as long as the formants (the output filters) are tracking the analyzers (input filters), you can let a moose bellow into the mike, and get out speech! Or, you could play harpsichord chords (Or as Don Dorsey did for his post 1977-versions of the "Disney's Electrical MainStreet Parade", synthesizer fanfares) and out will come sung words in harmonies!

So. What if we use a sonogram with 88 bins. Each bin is tuned to the same frequency as a piano. The piano has 88 keys, tuned to 88 individual pitches. If the bins are made moderately tight, then the amount of sound power in each bin represents the amount of power in frequencies close to the associated piano note. What, then, if the sonogram is treated like a piano roll, and the strength at each moment in time of each spot on the sonogram is used as the impulse energy for the associated key/hammer of the piano?

And the answer is what you see in that video of the piano speaking!

There is no need for electronics: all it would do is vary the volume of the sound from the piano at specific frequencies to impress the voice shape on the piano sound: instead, by controlling the strength with which each key is actuated, you get the same effect! Over all, the ear knits the result into one sound, which your brain can (at least after it gets the hint) interpret as speech.

And that's all there is to it!

Thursday, October 8, 2009

Wow... this leaves me speechless!

Ok, it doesn't leave me incapable of saying anything! Dig this:

Make.Blog post about talking piano

Now, what exactly are we seeing? I am fairly sure that the voiced sounds are 100% piano sounds, i.e. mechanically produced. No synthesizers, no electronic sound sources, no electronic sound modification like filters, ring modulators, ADSRs, etc.

The piano speaks.

I'll write another post describing what I think is happening and how this relates to the world of speech synthesis after my arms stop hurting, though.

For now, watch, listen, and maybe go "wow" like I did.

Misery can be mine!

There is a time when the thought flits through your head, "That was pretty stupid." I recently had one of those moments, which is why I am writing this with cast/braces on both arms. I fell a week ago, and managed to land on both arms, extended, breaking and dislocating both elbows. After 5 days of misery, I convinced myself that I could go back to work, and tried it for 1.25 days, after which I turned myself in to the HR department. I am now entering a period of 'short-term disability.'

This may up my blog post rate, but right at the moment, every typed character costs a lot, so we'll see.

I have been learning a lot about anatomy, though...