Sunday, October 18, 2009

What is the piano doing? (Posted in reverse order)

I'm posting the two halves of this explanation in reverse order because, when I view my posts, more recent ones come up above later ones. If your display is the other way around, I apologize!

The piano voice synthesizer, Part II

Back to the background: there are two more things to consider before we get to the piano and what it is doing. They are the Vocoder and the Sonogram.

There is a special case of the formant-filter electronic voice synthesizer called the Vocoder. This interesting machine works in the conceptual gap between the formant synthesizer and the recorder-reproducer, and understanding it really is key to understanding the piano.

The vocoder is called vocoder because it both codes the voice and produces vocals. One side at a time: to record, the human voice is presented to the vocoder, which is a bank of fixed filters. Each filter carves off a part of the vocal signal, based on frequency. If you've ever seen a graphic equalizer (and you probably have: rows of sliders with cryptic labels like "125hz" and "250hz" and "500hz" over each one, which you can shape sound electronically: lift a few of the sliders that are close together, and the sound in that part of the audio range is increased: do it to sliders to the right, high frequency sounds increase (and hiss!), do it to sliders on the left, and bass sounds are increased (and maybe thumps and booms, too!) The fun of graphic equalizers is that they slice up the audio band from low pitches to high pitches and make control of those slices easier!) The vocoder's input doesn't merely change the incoming sound, it records the amount of power in each slice. Then, the reproducer uses that information to control the strength of power allowed in each slice in another bank of filters. If you apply a sound source that is appropriately like the vocal chords, then the output sound is identical to the input sound.

Before we go on, lets review what we have: on the input side, we have a bank of filters which _analyze_ the sound into bands. Each band is associated with a slice of the frequency spectrum from low to high. The numbers that come out of each band tell how much power is in that band. If the power increases, the numbers get bigger. If it decreases, the numbers are smaller. These numbers, when applied to a similar bank of filters that can be controlled, will make the same amount of power 'be allowed' in each filter, and those filters will act on a rich source of sound to produce an output like the input. We call the numbers 'coefficients', which isn't really kind to the numbers or us, but sound people are like that.

Now. What if we put something different into the input. Use a guitar: It's just like a talk-box: the filters act like an acoustical resonator and carve away the sound until it sounds like the guitar is talking! Use a woman's voice, and shift the coefficients so they feed higher-frequency filters than originally, and she can go "eeeeee" and out comes the words a man said at the input side! Substitute a musical synthesizer with a good, buzzy output, and make the man sing!

The nice thing about the vocoder is that you can connect the output of the analyzing filters to the analogous controllable output filters, and play music into the output filter while talking into the input filter, and the music comes out with words! Vocoders are used a lot in the entertainment business now, and digital versions of them are so sophistocated that they can be used to correct the pitch of a singer or add other voices in harmony with a singer, using the same enunciation and expression!

OK, now that the concept of filters, coefficients and power-in-a-band (of frequencies) are established, one more thing:

The sonogram is a picture of sound. Sound is complex in many ways: First of all, it is dependent on volume, pitch (frequency), and time. Specifically, if you remember the vocoder: each filter band has a frequency that it is active for: it has no response to unrelated frequencies. When a sound enters the filter band that it responds to, the level of output depends on the volume of the incoming sound: if it increases, the signal telling how much power is in the band increases, etc. And this varying happens with time. The fact that we can analyze the voice with a set of filters like we do with the vocoder means that we can just as easily do it with more filters (or less: the first vocoder only used five filters on each end!)

The sonogram, conceptually, is a display of many filters, shown with low frequencies lower and high frequencies higher. Time is shown from left (history) to right (more recent history). And volume is shown by the color of each point on the sonogram. Where no sound happens, no color happens. If the sonogram is "black and white", then it'll be black where there is no sound and get lighter as sound at that frequency increases. If a sound starts very quiet and grows loud, then dies away, but stays at one frequency, the sonogram will show one horizontal line which starts very dark, rises to a level of whiteness, then dims back to darkness. If a sound (like a drum-stroke) produces many frequencies all at once, but not for a long time, the sonogram will have a single vertical line, probably very light. Usually "false color" is used, say, blue for very quiet sounds, green for middle-loud sounds, and red for very loud sounds. And, of course, the filters are not perfect, so sound from adjoining bands "leak" into neighboring filters, so a person saying "bOOp" (which is very pure, i.e., may show up only at one frequency) may appear as a round spot with a vertical line at the start (for the b) and another at the end (for the p).

The sonogram is a way to see sound. It has been used in various forms, both analog and digital, for a century to analyze speech and as a tool to train speakers, as well as a way to analyze musical and other sounds.

If each spot on a sonogram could be tied to the control of a filter tuned to exactly that frequency, it could be used like the Vocoder to impress recorded speech into other sounds. In this case, we're using filters to change the shape of the sound that goes through them, and providing a single complex sound for those filters to carve up.

So what is the piano doing?

The piano voice synthesizer

Remember that the sonogram is a map of the acoustic energy present at each point in time, showing its frequency by its vertical placement, and its power by its color. What is analogous to this?

The player piano is a device that uses vacuum to actuate the hammers of a piano. The actuators are controlled by a mechanism with a hole for each hammer's actuator, over which a roll of paper with holes is passed. Where the holes pass over a hole in the reader, the pressure drops, a hammer is launched against the strings, and a note happens. If you take the piano roll and hold it with the beginning to the left and unroll it a bit, you can actually see the notes of the introduction, now going "up" as higher strings are sounded, now going left as lower notes are played.

This is a very "binary" approach: a hammer is actuated, or it is not. Later player piano roll systems added additional rows of holes to control volume, and some even added rows for speed. All in all, the system is very understandable: the sonogram is very similar.

But what does the sonogram _actually_ portray? Each spot, looked at vertically, indicates a frequency, around which some acoustic energy is present. And taken horizontally, each spot is a moment in time. The sonogram is a player piano roll for speech.

Conceptually, there is only one thing left to discuss: what goes on within each spot, and how closely must that be recreated to produce recognizeable speech?

The answer may be amazing to you. In actual fact, if there are enough 'bins', which are the filter bands, and they change quickly enough, the actual sound in the band doesn't need to be terribly like the original sound in the band at all!

This is why the vocoder works so nicely: as long as the formants (the output filters) are tracking the analyzers (input filters), you can let a moose bellow into the mike, and get out speech! Or, you could play harpsichord chords (Or as Don Dorsey did for his post 1977-versions of the "Disney's Electrical MainStreet Parade", synthesizer fanfares) and out will come sung words in harmonies!

So. What if we use a sonogram with 88 bins. Each bin is tuned to the same frequency as a piano. The piano has 88 keys, tuned to 88 individual pitches. If the bins are made moderately tight, then the amount of sound power in each bin represents the amount of power in frequencies close to the associated piano note. What, then, if the sonogram is treated like a piano roll, and the strength at each moment in time of each spot on the sonogram is used as the impulse energy for the associated key/hammer of the piano?

And the answer is what you see in that video of the piano speaking!

There is no need for electronics: all it would do is vary the volume of the sound from the piano at specific frequencies to impress the voice shape on the piano sound: instead, by controlling the strength with which each key is actuated, you get the same effect! Over all, the ear knits the result into one sound, which your brain can (at least after it gets the hint) interpret as speech.

And that's all there is to it!

No comments:

Post a Comment