Using AI to translate sign language to speech

Think of your best friend. Now consider the last great conversation you had together. Perhaps you discussed a book, your favourite recipe or international politics? Now imagine that, after decades of friendship, your friend gradually experiences a decline in hearing ability and eventually depends on sign language. Obviously, this would significantly impact your ability to converse in your friendship, as it makes communication more difficult. Both of you would need to learn sign language to communicate without text, which could take up to 1320 hours to learn.

This might seem farfetched, but according to the WHO, nearly 2,5 billion people are projected to have some degree of hearing loss, by 2050. This triggered us. Could we, as data scientists, help people to communicate more easily? How could we approach such an issue. Wouldn’t it be great if those people that are impacted could still easily communicate with the world around them?

At Cmotions, we enjoy leveraging AI to solve business and societal challenges. We decided to use our knowledge of AI to attempt assisting people who depend on sign language for their communication. Our goal was to develop a tool that would be able to interpret (W)ASL sign language and convert it to English spoken language. We chose sign-to-speech and not speech-to-sign language, since the former had not been developed yet at the start of our project.

Our approach

As explained, our goal was to translate hand sign language to speech. Our approach consists of three steps, which we’ll explain in more detail yet simplified below.

Step 1 – From video to gloss: finetuning a video classification model

We used an existing pretrained video classification model ‘VideoMAE’ as our base. To modify this video model for our purpose, we finetuned it by training the model on the open-source dataset “Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison” This WLASL dataset is the largest video dataset for Word-Level American Sign Language (ASL) recognition, which features 2,000 common different words in ASL. The dataset contains videos of American sign language and their corresponding ‘glosses’ (the text representation of sign language). For example, a video in which the gloss ‘thank you’ is portrayed, would have ‘thank you’ as its label. This enabled us to tweak the existing video model so that it translates sign language to glosses. It worked rather well (98.2% accuracy and a 98.3% F1 score after we both removed irrelevant elements (like background color, exact location of hand in the video, etc) and added noise. For more technical details on why you would want to add noise and why we had to cut the video’s into separate gestures at this point for example, see this article.

Our model now gives us the first step in enabling us to ‘talk’ with our friend more easily, as it can now be read and interpret sign language. However, this communication is far from optimal yet.

Step 2 – From words to sentences

Now we have a first translation of the hand sign videos to text, but this is still not really usable language (e.g. it will now output something like “you coffee like tea”, instead of “would you like coffee or tea?”). In short, we must deal with the syntactic structure of the sequence of words we have generated from the sign videos so far. Moreover, the message from our friend is translated from sign into text, but not spoken out loud at this point. So, we need to take two more steps.

First off, to understand these words, we use another model which creates understandable sentences from the current sequence of words that represent the signs. For this, we use a popular large language model (LLM) that showed great performance at time of writing: Qwen2.5. Put simply, Qwen is a generative AI model – also known as LLM or ‘encoder model’ – which is a type of AI that can understand and generate text just like a human would. It has learned from a huge amount of textual example data, which is called the pretraining phase. Since the model contains so much knowledge after training – stored in the 32 billion model parameters – we don’t need to train the model on our own data, we just have to ask it the right questions. We’ve given this model some basic instructions on how we would like the output to be generated based on the input we give it. This is what we call prompt engineering, which is not the same as training a model.

Step 3 – From sentence to speech

The last step creates an audio file from the now readable sentences. This is done with a python package from Google, called ‘gTTS’ (Goole Text To Speech). We feed it the finalized sentences and ask it to read them out loud. This completes the last step in our pipeline from sign language to speech!

Do you want to see the whole process – from hand sign video up to spoken fluent text audio – in action? In this article we show you how it works.

Rounding up

n conclusion, we were able to create a pipeline which allow you to upload a set of videos with hand signs and return an audio file of the translation in spoken language. Of course, there are still some improvements to make. For example, splitting the video into separate files as is needed at this point is not ideal and makes our model not really user friendly when conversing with your friend. Secondly, in an ideal world you can do a live translation while filming someone instead of making a video and uploading it. Nevertheless, we have shown that translating sing language to speech is quite feasible and can help greatly in making sign language understandable to a wider audience.

Besides helping in conversing, a tool like his one could also be used for educational purposes in helping people to learn sign language. Another use case could be in the form of customer support or interacting with governments / insurance companies / etc. These, of course, are just some examples.