Principles for Lipsync Animation
keith lango, 2001

I recently have been asked by a few people (OK, more than a few) to try and touch upon the area of facial animation and lipsync. Most of these requests have come from folks reading my Pose-to-Pose Organized Keyframing tutorial who then want some ideas on breaking down lipsync and facial animation.

Originally I had replied that for me facial and lipsync was the one area of my animation that was still undefined for me. By that I mean I hadn't taken the time to sit down and really think about how I logically approach lipsync and facial animation. I've always just kinda "done it", letting it flow from within me. I enjoyed being able to pretty much do a single straight ahead pass at face and lipsync animation with another single 'tweak' pass and to be able to call it done. I readily admit that I don't pre-plan my lipsync at all. And I don't spend a whole lot of time breaking down my facial animation as a whole. I do mark a few seminal emotions I want to capture, but I don't do anything near as organized or mechanical as the pop-thru for my body work. Basically, face and lipsync animation was the last bastion of real heartfelt art for me, and I admit I was reluctant to quantify that little bit of remaining magic in my art. :o) But recently I have taken some steps towards actually quantifying this stuff.

As such, I have some thoughts on lip sync that I feel folks might be willing to check into. Let me preface my words by stating clearly that I do not consider myself an authority on the topic. My thoughts are pretty much just mine, and folks may disagree with my assessment of how to approach lipsync. But the purpose of my efforts is to try and give some concrete "hooks" for animators to use. I want to avoid having these thoughts coming off as rules or suchlike. They're merely ideas and theories that may help some folks get their brains around lipsync in a different way. So with the caveats offered, I expound upon my particular approach to lipsync.

This paper is not exhaustive, but it does begin to address how I tend to THINK about lipsync animation conceptually. I am limiting my comments in this paper to specifically lipsync animation. I am currently developing my thoughts for another paper on facial animation as a whole, the sum of which will enfold this paper's topics into itself for a holistic approach to animating character's faces with convincing speech and emotional acting.


In the Beginning...
Lipsync is a tricky thing to get the hang of at first. Many an animation shows the classic example of how just about everybody approaches it at first. The tendency is this:

1) make 'sound' targets for 'sounds' like M and E and S and Th and F and such. (some folks even go so far as to make targets for such 'sounds' as H and G and J and Z).
2) listen to the sound track
3) for every 'sound' you hear, hit the 'sound' target at or near 100%
4) Make a preview render of the lipsync animation
5) watch the mouth flap out of control
6) wonder what went wrong.

At least that's how it went for me at first. The problem is being too literal about animating a character talking, trying to animate the letters in the words instead of only emphasizing the major sounds needed to communicate the *idea* of speech..



There's No Such Thing As Letters in Speech...
Notice how I kept putting the word 'sound' in quotes above? That's because a common mistake for beginners is to associate LETTERS with SOUNDS.

Principle #1: Letters are not sounds. Sounds are not letters. There are NO letters in lipsync animation.

They serve similar roles, but in wildly divergent forms. LETTERS are representative symbols on a page (with a corresponding, arbitrarily assigned sound) that, when strung together to form words, communicate a thought. But letters aren't made for speech. They're for writing. And we're not animating writing, but speech. SOUNDS are utterances (with a corresponding arbitrarily assigned letter value used to transcribe the sound) that, when interpreted as understood words, communicate a thought. Sounds are for speech, but serve no use in writing. See the similarities and differences? So when you animate speech, don't animate letters. There are no letters in speech, only sounds, and the shape our faces take to make those sounds.
I know this sounds like an argument in semantics, but trust me, the distinction is very real. And when you learn to approach lipsync animation from the perspective of animating sound shapes instead of letters, your world will be a much brighter place.



So What Does that Mean For Animation?
Let's take a look at an example: the line "you hafta get" from the 10-second Club's November 2001 soundtrack takes about 25 frames to say. At first look, it seems like there ought to be the following keys for the phrase:
Y (a pucker shape)

That is a very literal interpretation of what it takes to show a person saying "you hafta get". But if you go ahead and keyframe the lipsync that way, you'll soon realize that this will result in a very poppy mouth when animated. Some of those poses will be onscreen for only a single frame, which is too much information and not enough time for the viewer to interpret it. A quick analysis will show that you go from one mouth shape that is quite open (Ah in hafta) to a pretty closed one (the F in hafta) and then back open again (for the end of hafta). The result is the mouth popping from open to closed back to open in just 3 frames. That's not fun to watch, folks.


But What About My I Mean "Sound" Shapes?
Often times beginners will make a 'phoneme' that is an exact replication of one's face saying that single 'letter' in isolation. So we make E phonemes saying E by itself. And we model "K" phonemes based off our own face in a mirror saying "kuh". At first that seems more than logical enough. The problem with that is that when you say the "t" sound by itself ('tuh'), your face doesn't look at all like it would if you say something like "skate". And that "t" in 'skate' gives a face shape that is completely different than the "t" sound shapes in "petstore". And THAT "t" is very different from the "t" shape you make when you say "goatee".

Principle #2: Mouth Shapes for Sounds Must Be Animated In Context

By context I mean this:
The preceding sound shape affects the current sound shape. Likewise, the following sound shape is anticipated in the current sound shape.
So the shapes shown must all be in context with the shape/sound the preceds it and follows it. When you get stuck on the idea of making all the "t" sounds in a soundtrack the same shape, regardless of the prior or following sound/shape context in the dialogue, then you're setting yourself up for a very poppy mouth when animated. Remember Rule#1- animating speech is not animating letters. It's animating the *flow* of shapes that are needed to make the present sounds within what's being communicated.


OK, Mr. Fancypants. So Just How Should I Animate Lipsync?
The better approach is to interpret speech, to grasp the essential elements of the communication as recorded in the sound track. To "squint your ears" and try and pick up the overall feel of the speech.
Let's take a look at art history.
For many years up until the late 19th century, the effort in rennaissance art was the meticulous and accurate recreation of reality. Realism was the goal, and literalism in interpreting a painting was the norm. Then a bunch of artists got an idea about capturing just the overall sense of an image. They became less interested in capturing every leaf on a tree, but began to focus on how the light and shadow and color hues projected that tree into another realm. This new realm of seeing was an interpretive realm where leaves didn't matter as much as form, color, tone and contrast. At first these guys were derided as lazy artists, too shiftless to bother with the details. But soon the world got hold of these new paintings and were amazed to see such life and beauty where before there was just leaves. The age of Impressionism was born, and we're all the better off for it.

So how does that apply to us and lipsync?

Here's how: Just as the impressionist painters got away from a literal realism in capturing a picture, we too need to get impressionistic when it comes to lipsync animation.

Principle #3: Interpret the Lipsync Animation Like an Impressionist

If in your animation you can just get the major impressions across you can let the little stuff slide if you want. Just like the impressionist would hint at a cluster of leaves with a single daub of his brush, you too should let words and sound shapes slur into the next word or sound shape. Mix the target facial weights together to show a flow. Get away from showing leaves and start showing contrast and form. Talking is more of a flowing thought than an alliterative function of letters.


Impressionism Applied To Real Live LipSync...
Let's look again at our example phrase- "you hafta get". A more impressionistic interpretation would be to emphasize the following major accents:


Go ahead and say that out loud. "Ooo" as in "scoop", "aaFF" as in "after" and "Eh" as in "pet".


Sounds alot like "you hafta get", doesn't it?
Now go one further.
Grab a handheld mirror.
Now, comfortably (ie: don't play act or over emphasize it), just say "you hafta get".
Watch how your mouth looks as you say it again.
Now, say "oo-aaFF-eh" a few times.

See how very close the two are in how they look? You want another example of this same principle?

Say to your mirror "I love you".
Then say to it "Elephant Shoes".

You never knew that the connection between la' mour and pachydermal podiatry was this close!


The Devil is in the Details...
Let's take an even closer look at this from a lipsync animation point of view. For the phrase "you hafta get" there is one special pose along with two major open poses and two major closed poses.
The special pose is the pucker/ooo at the beginning of You.
The first major open is the "aa" at the beginning of Hafta.
The second major open pose is the "Eh" of Get.
Likewise, the first major closed pose is the FF of Hafta.
The second closed pose is the T in Get. (It's not a true closed pose, but it's close enough for us to define it as such because it is more closed than open.)
Anyhow, by choosing to do nothing more than hit these opens and closes you can get nearly all you need. (heck, the Muppets have gotten by on that for 30+ years!) These main target points are like the broad brushes in an impressionist painting. They define shape, contrast, form, direction. The details of texture come later with the specific choices you make on top of the broad brushed open and closed pose shapes and timings. The opens and closes are the foundation of your more specific choices.

Principle #4: Get the Opens and Closes Done Right and Build On Those

Even if all you ever do is properly hit the opens and closes and wide shapes of the mouth at the right time you are already more than 75% of the way to great lipsync. You can get alot out of very little lipsync animation. And if you doubt it, animated properties with projected texture map mouths like "Veggietales" have proven that this is indeed true.

Getting Specific...
Here's a breakdown of some specific choices...
You'll want to start by letting the "Yuh" of You flow into the more open "aa" at the beginning of Hafta. Skip the specific "ooo" at the end of You because it is not very strong. It's there, but it gets said while the mouth is transitioning into the beginning of hafta. Basically it slurs into the next word.

The H of Hafta is burried in the back of the throat, so the lips don't really need to show it. So skip showing a specific H target for it.

Picking up from the moderately strong "aa" of hafta, hit the F for two frames to let it read. It's the major closed point of the phrase, so that needs to line up and read clearly.

Then skip the ending "ah" of hafta altogether, as well as the G of Get. Both happen under the breath, they're slurred under the transition from FF to the Eh accent of Get.

Hit that last open pose of Eh.

Then end with an appropriately shaped nearly closed mouth to catch the idea of a T.

You've basically now animated Ooo-aaFF-Eht. And you know what? It's enough. And the best part is it flows, it feels natural, and it doesn't pop.

There's Gotta be More. What about those T's and Stuff?
The short answer to this question is: don't sweat it unless you really need to. I haven't at all addressed the tongue in any of this. But if your character has a tongue, then you can get all the inner mouth sound shapes you need with that. The inner mouth sound shapes are:

G (hard)

So add your tongue work in here, keeping it as impressionistic as everything else, and you can handle the 'little stuff' quite easily. A good tip is to keep tongue movements very quick. Don't have the tongue take longer than 2 frames to get from a position back to another, unless you have a specific reason. Else wise it will look for all the world like your character is saying the "LL" sound. The word "bad" turns into "bald". "Good" becomes "gold". Keep the tongue light and quick, just like your wits.

Miscellaneous Tips & Tricks & Principles...
1) Don't go from wide open to closed in one frame and vice versa. Definitely don't go from open to closed to open in 3 frames.
2) Don't hold a mouth shape static. An "Ah" shape should shift into a slightly different "Ah" as it's being held.
3) Keep M's and F's for 2 frames. If it's tight, steal from the previous sound.
4) Keep and eye on your targets and make sure they're not too linear in going from one sound shape to the next.
5) Hit the sound shape at least 2 frames before the sound is heard. Even if you're right on the nose, it will feel late when played at full speed. Humans see things faster than they hear them, so we pick up our cues from the shape before the sound.
6) Break up the mouth angles. Shift the mouth up and down, tilt it left or right, get some snarls in there. Show emotion as the character speaks. We can speak and smile, speak and frown, speak and yawn at the same time. Built rigs that allow you to keep that kind of life in your lipsync animation.
7) Upper teeth do not move. They're nailed to your skull.
8) Jaws rotate, not slide, in chaarcters with clearly defined head/neck areas.
9) When building your sound shapes and facial controls, don't forget the cheeks and the nose! The cheeks move when we speak, as does our nose. The cheeks and nose are the great connectors in facial animation, crossing the bridge from mouth animation to eye and brow animation. By keeping your nose and cheeks in the action you tie together the entire face of the character, creating a far more believable character who can act.
10) Don't be afraid to go extreme. Avoid the Princess Fiona Final Fantasy Syndrome(tm). Keep the energy of the sound track in mind when you're doing the mouth shapes. Louder sounds with more energy should be shown with the mouth open wider, sound shapes more extreme. Watch TV announcers talk. Those faces are movin' baby!

Before You Go...
I hope this has helped some. We've broken down one phrase for this paper and I'm sure it all makes perfect sense now- for that one phrase. :o)
Now the trick for you is to learn how to adapt this impressionist kind of thinking into other phrases, other animations, other characters. Just try to keep in mind my four "Principles" that I've stated. If you can keep those in mind then you're well on your way to animating lipsync in a convincing, flowing manner that will feel natural and have life. Last of all, the best thing I can suggest is that you keep practicing. My breakdown can get you going in the right direction, but experience is the best teacher.