Saving Private Ryan - This one is particularly moving. During a rest in a battle, the troops discuss the lyrics of Edith Piaf and reveal their understanding of emotions and memories of beautiful women. Their comments echo the lyrics translated from french, as the music echos through the ruins.
La Haine - There is something about DJ Cut Killer’s blending Edith Piaf with KRS-One’s “Sound Of The Police” that fits perfectly in a french ghetto. The shot is made especially wonderful by the slowly increasing echo of the music through the tenemant buildings. What a great shot paired with an excellent blend of music.
Inception – An interesting take on the usage of “Non Je Ne Regrette Rien” as source for sound design in the recently released movie Inception. Very clever stuff!
There are two big questions being addressed here. What information can be extracted from millions of hours of audio and how can that information be applied?
Background
Generally speaking, the analysis of speech recognition consists of the following.
the topic(s) being discussed
the identities of the speaker(s)
the genders of the speakers
the emotional character of the speech
the amount and locations of speech versus non-speech (e.g. background noise or silence)
Speech is not the only data that can be extracted from uploaded video. Music has tempo, key, lyrics, timbre, instrumentation and much more. The search for sound effect reference material in web video is also currently limited by titles and keywords manually attached to videos.
Imagination
Captions are not available
Imagine that you’re hearing impaired but would like to watch and understand political speeches posted to YouTube. By using automated speech recognition in combination with closed captions, any spoken word video posted to YouTube would always be accessable to you.
Imagine that there is a video in a language not familar to you that you’d like to understand. Speech recognition combined with a translation service like BabelFish could help to bridge cultures worldwide. For people who are also blind, this could also be combined with speech synthesis feature for even greater accessability.
Imagine that you’re a business that wants to track the emotions of people uploading videos mentioning your product.
Imagine that you’re a Karaoke lover who has stage fright and wants to practice at home. Extracting lyrics from music videos and automatically adding them as closed captions would be a welcome feature to you. This might not be possible with “Louie Louie” but would be useful for everything else.
Imagine that you’re a DJ looking for the perfect track to mix. Being able to search for musical content with video according to key and tempo would be hugely valuable to you. This could be accomplished by combining MixMeister’s functionality with Google Audio. MixMeister extracts tempo and key from music libraries.
Imagine that you’re a musician looking to compose new music from sampled clips. Music search tools would be hugely useful to you as well. InBFlat.net is a website which presents videos composed in the same key. Kutiman is a musician who composes new songs from uploaded video clips. These are the sorts of projects that could be made using tempo and key detection in Google Audio. See a sample of Kutiman’s work below.
Imagine that you’re a part of the largest search engine company in the world and want to organize the world’s information and make it universally accessible and useful. What other features would you add?
Realization
With user generated content becoming the cornerstone of interactive media, tools and methods for parsing vast amounts of data will be essential. Speech recognition will be a huge part of this. Google is now working very hard at developing and refining their speech algorithms and is releasing multiple products to support this effort. Here are a few examples of this.
Google Audio Indexing – Search through the audio content in web video. Currently Google Audio Indexing is exclusive to the political channel of YouTube.
Google 411 – Google 411 is a voice activated search engine that can be dialed from any phone.
Google Mobile – Google Mobile has a built in speech recognition feature which is much faster than typing a search on a tiny touchscreen.
Google Voice – Google Voice is designed to consolidate phone numbers and also includes transcription services for voice mail.
As evidenced by a recent test by the New York Times, Google has a way to go before their algorithms are perfected. Nonetheless, they are on the right track and have displayed a consistancy in their interest to develop speech recognition.
BlogSonic is back! Because of recent vandalism to the blog I’ve had to rebuild it from scratch. Considering that I only had three posts at the time, it’s not so bad. I’ll repost the old articles shortly.
There are two soft synths that I’ve become a fan of recently. The first one is Magical 8bit Plug by YMCK. This one sounds amazingly close to the original Nintendo sound chip. Closer than anything else I’ve ever heard from software. It’s easy to use and is pretty lightweight on the processor load. It’s available as a VSTi and Audio Unit plug-in.
The second soft synth I’ve been enjoying is Chip32. It also has an 8 bit sound similar to the NES sound chip. One cool feature of Chip32 is that it gives the user the ability to draw the waveform, updating it in real time. Brad from New Grey Area turned me on to this one and I’ve been using it for a number of electro tracks I’ve been working on. Also available as a VSTi and Audio Unit.
About
AdamSonic is the personal blog of Adam Smith-Kipnis. It's a place for sharing ideas about music, sound, culture and technology.