From gesture to sound:

How AI converts sign language into speech

Think about your best friend. Do you remember your last good conversation? Perhaps you discussed a book, your favorite recipe or international politics. Now imagine that friend, after decades of friendship, gradually losing his hearing and becoming dependent on sign language. Of course, this would have a huge impact on your conversations and make your communication difficult. You would both have to learn sign language in order to communicate with each other without text, which would be up to 1320 hours can take to master. 

This may sound far-fetched, but according to the World Health Organization (WHO) an estimated 2.5 billion people will have some degree of hearing loss by 2050. This got us thinking. Can we as Data Scientists help these people communicate more easily? How could we address this problem? Wouldn't it be great if people who rely on sign language could still easily communicate with the world around them? 

At Cmotions, we use AI eager to solve business and societal challenges. We decided to use our knowledge of AI to support people who rely on sign language. Our goal was to develop a tool that can interpret (W)ASL sign language and convert it into spoken English. We chose gesture-to-speech instead of speech-to-gesture, because the former was not yet developed at the time we started. 

 

Our approach

So our goal was to sign language convert to speech. Our approach consisted of three steps, listed below explain.

 

Step 1 - From video to gloss: fine-tuning a video classification model

We have an existing, pre-trained video model called 'VideoMAE' as the basis of our model. To make this model suitable for our purpose, we fine-tuned it by training it on the open-source dataset Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. This WLASL dataset is the largest video dataset for Word-Level American Sign Language (ASL -American Sign Language at the word level) and contains 2,000 common "glosses" (the textual representation of sign language) in ASL. 

The dataset contains videos of American Sign Language with associated "glosses. For example, a video showing the gesture for 'thank you' is labeled 'thank you'. This allowed us to modify the existing video model so that it translates the sign language to 'glosces'. This turned out to work quite well: we achieved an accuracy of 98.2% and an F1 score of 98.3% after removing irrelevant elements (such as background color and exact hand position) and adding noise. The F1 score shows how accurate the AI model used is. For more technical details on why we added noise and why we had to split the videos into separate gestures at this point, please refer to this article. 

After this first step, we can now "talk" to our friend by reading and interpreting sign language. But this is still far from an optimal form of communication. 

 

Step 2 - From words to sentences

The first translation from hand gestures to text has now been made, but this does not yet result in usable language (e.g., "you coffee like tea" instead of "Would you like coffee or tea?"). We need to improve the syntactic structure of the generated word order. Moreover, the message is not yet spoken. Thus, two additional steps are needed. 

To convert the single words into understandable sentences, we use a Large Language Model (LLM) that was performing excellently at the time of writing: Qwen2.5. Simply put, Qwen is a generative AI model - also called LLM or "encoder model" - that can understand and generate text as a human would. This model learned from huge amounts of text data during a pre-training phase. Because the model has 32 billion parameters full of knowledge after training, we don't need to train it further with our own data; we just need to ask the right questions. We call this process prompt engineering, which is different from training a model. 

 

Step 3 - From sentence to speech

The final step is to generate an audio file based on the optimized sentences. We do this using a Python library from Google called 'gTTS' (Google Text-To-Speech). We enter the final sentences and ask the program to pronounce them aloud. This completes the translation chain from sign language to speech! 

Want to see the whole process - from a hand gesture in a video to fluently spoken text - in action? In this article let's see how it works. 

 

Conclusion

In summary, we have developed a system that can convert video files with sign language into an audio file with spoken language. Of course, improvements are still needed. For example, the need to split videos into separate files makes the model less user-friendly. Ideally, a live translation from sign language to speech should be possible when filming someone, without having to record and upload a video first. Nevertheless, we have shown that translating sign language to speech is feasible and can contribute greatly to making sign language understandable to a wider audience. 

In addition to facilitating conversations, such a tool could also be used for educational purposes, such as in learning sign language. Other possible uses include customer service or communication with government agencies and insurance companies. These are just a few examples. 

 

Want to learn more about how we trained the model and see a full example from video to audio?

Recent posts