{"id":10091,"date":"2025-03-03T11:56:53","date_gmt":"2025-03-03T10:56:53","guid":{"rendered":"https:\/\/cmotions.com\/?p=10091"},"modified":"2025-03-03T11:57:45","modified_gmt":"2025-03-03T10:57:45","slug":"van-gebaar-naar-geluid-hoe-ai-gebarentaal-omzet-in-spraak","status":"publish","type":"post","link":"https:\/\/cmotions.com\/en\/from-gesture-to-sound-how-ai-converts-sign-language-into-speech\/","title":{"rendered":"From gesture to sound: how AI turns sign language into speech"},"content":{"rendered":"<p>[et_pb_section fb_built=\"1\u2033 admin_label=\"section\" _builder_version=\"4.23\u2033 background_color=\"#000000\u2033 custom_padding=\"0px||||false|false\" global_colors_info=\"{}\"][et_pb_row _builder_version=\"4.23\u2033 _module_preset=\"default\" width=\"100%\" global_colors_info=\"{}\"][et_pb_column type=\"4_4\u2033 _builder_version=\"4.23\u2033 _module_preset=\"default\" global_colors_info=\"{}\"][et_pb_text _builder_version=\"4.25.2\u2033 _module_preset=\"default\" header_font_size=\"37px\" background_layout=\"dark\" header_font_size_tablet=\"30px\" header_font_size_phone=\"25px\" header_font_size_last_edited=\"on|phone\" global_colors_info=\"{}\"][et_pb_text<\/p>\n<h1><strong>From gesture to sound: <\/strong><\/h1>\n<h1><strong>How AI converts sign language into speech<\/strong><\/h1>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=\"4.23\u2033 _module_preset=\"default\" width=\"100%\" custom_padding=\"2px|||||\" global_colors_info=\"{}\"][et_pb_column type=\"4_4\u2033 _builder_version=\"4.23\u2033 _module_preset=\"default\" global_colors_info=\"{}\"][et_pb_text _builder_version=\"4.25.2\u2033 _module_preset=\"default\" link_text_color=\"#49b69c\" header_text_color=\"#47B2A8\u2033 header_font_size=\"32px\" header_3_text_color=\"#FFFFFF\" header_3_font_size=\"25px\" min_height=\"363.2px\" custom_padding=\"||0px|||\" header_font_size_tablet=\"28px\" header_font_size_phone=\"18px\" header_font_size_last_edited=\"on|phone\" header_3_font_size_tablet=\"22px\" header_3_font_size_phone=\"22px\" header_3_font_size_last_edited=\"on|phone\" global_colors_info=\"{}\"]<\/p>\n<p><span data-contrast=\"auto\">Think about your best friend. Do you remember your last good conversation? Perhaps you discussed a book, your favorite recipe or international politics. Now imagine that friend, after decades of friendship, gradually losing his hearing and becoming dependent on sign language. Of course, this would have a huge impact on your conversations and make your communication difficult. You would both have to learn sign language in order to communicate with each other without text, which would be up to <\/span><a href=\"https:\/\/www.handspeak.com\/learn\/205\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">1320 hours<\/span><\/a><span data-contrast=\"auto\"> can take to master.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">This may sound far-fetched, but according to the <\/span><a href=\"https:\/\/www.who.int\/news-room\/fact-sheets\/detail\/deafness-and-hearing-loss\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">World Health Organization (WHO)<\/span><\/a><span data-contrast=\"auto\">\u00a0an estimated 2.5 billion people will have some degree of hearing loss by 2050. This got us thinking. Can we as Data Scientists help these people communicate more easily? How could we address this problem? Wouldn't it be great if people who rely on sign language could still easily communicate with the world around them?<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">At Cmotions, we use <a href=\"https:\/\/cmotions.com\/en\/consultancy\/data-science\/\" target=\"_blank\" rel=\"noopener\">AI<\/a> eager to solve business and societal challenges. We decided to use our knowledge of AI to support people who rely on sign language. Our goal was to develop a tool that can interpret (W)ASL sign language and convert it into spoken English. We chose gesture-to-speech instead of speech-to-gesture, because the former was not yet developed at the time we started.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p>[\/et_pb_text][et_pb_text quote_border_color=\"#49b69c\" _builder_version=\"4.25.2\u2033 _module_preset=\"default\" link_text_color=\"#49b69c\" header_text_color=\"#47B2A8\u2033 header_font_size=\"32px\" header_2_font_size=\"35px\" header_3_text_color=\"#FFFFFF\" header_3_font_size=\"25px\" header_font_size_tablet=\"28px\" header_font_size_phone=\"18px\" header_font_size_last_edited=\"on|phone\" header_3_font_size_tablet=\"22px\" header_3_font_size_phone=\"22px\" header_3_font_size_last_edited=\"on|phone\" global_colors_info=\"{}\"]<\/p>\n<h2>Our approach<\/h2>\n<p><span class=\"NormalTextRun SCXW45690661 BCX8\">So our goal was <\/span><span class=\"NormalTextRun SCXW45690661 BCX8\">to <\/span><span class=\"NormalTextRun SCXW45690661 BCX8\">sign language<\/span><span class=\"NormalTextRun SCXW45690661 BCX8\"> convert to speech. Our approach consisted of three steps, listed below<\/span><span class=\"NormalTextRun SCXW45690661 BCX8\"> <\/span><span class=\"NormalTextRun SCXW45690661 BCX8\">explain.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3>Step 1 - From video to gloss: fine-tuning a video classification model<\/h3>\n<p><span data-contrast=\"auto\">We have an existing, pre-trained video model called '<\/span><a href=\"https:\/\/github.com\/MCG-NJU\/VideoMAE?tab=readme-ov-file\" target=\"_blank\" rel=\"noopener\"><span>VideoMAE<\/span><\/a><span data-contrast=\"auto\">' as the basis of our model. To make this model suitable for our purpose, we fine-tuned it by training it on the open-source <\/span><a href=\"https:\/\/dxli94.github.io\/WLASL\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">dataset<\/span><\/a> <i><span data-contrast=\"auto\">Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods<\/span><\/i> <i><span data-contrast=\"auto\">Comparison<\/span><\/i><span data-contrast=\"auto\">. This WLASL dataset is the largest video dataset for Word-Level American Sign Language (ASL -American Sign Language at the word level) and contains 2,000 common \"glosses\" (the textual representation of sign language) in ASL.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">The dataset contains videos of American Sign Language with associated \"glosses. For example, a video showing the gesture for 'thank you' is labeled 'thank you'. This allowed us to modify the existing video model so that it translates the sign language to 'glosces'. This turned out to work quite well: we achieved an accuracy of 98.2% and an F1 score of 98.3% after removing irrelevant elements (such as background color and exact hand position) and adding noise. The F1 score shows how accurate the AI model used is. For more technical details on why we added noise and why we had to split the videos into separate gestures at this point, please refer to this <\/span><a href=\"https:\/\/theanalyticslab.nl\/fine-tuning-a-video-classification-model-for-hand-sign-recognition-in-python\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">article<\/span><\/a><span data-contrast=\"auto\">.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">After this first step, we can now \"talk\" to our friend by reading and interpreting sign language. But this is still far from an optimal form of communication.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><strong>Step 2 - From words to sentences<\/strong><\/h3>\n<p><span data-contrast=\"auto\">The first translation from hand gestures to text has now been made, but this does not yet result in usable language (e.g., \"you coffee like tea\" instead of \"Would you like coffee or tea?\"). We need to improve the syntactic structure of the generated word order. Moreover, the message is not yet spoken. Thus, two additional steps are needed.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">To convert the single words into understandable sentences, we use a Large Language Model (LLM) that was performing excellently at the time of writing: <\/span><a href=\"https:\/\/huggingface.co\/Qwen\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Qwen2.5<\/span><\/a><span data-contrast=\"auto\">. Simply put, Qwen is a generative AI model - also called LLM or \"encoder model\" - that can understand and generate text as a human would. This model learned from huge amounts of text data during a pre-training phase. Because the model has 32 billion parameters full of knowledge after training, we don't need to train it further with our own data; we just need to ask the right questions. We call this process <\/span><i><span data-contrast=\"auto\">prompt engineering<\/span><\/i><span data-contrast=\"auto\">, which is different from training a model.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3><strong>Step 3 - From sentence to speech<\/strong><\/h3>\n<p><span data-contrast=\"auto\">The final step is to generate an audio file based on the optimized sentences. We do this using a Python library from Google called '<\/span><a href=\"https:\/\/pypi.org\/project\/gTTS\/\" target=\"_blank\" rel=\"noopener\"><span>gTTS<\/span><\/a><span data-contrast=\"auto\">' (Google Text-To-Speech). We enter the final sentences and ask the program to pronounce them aloud. This completes the translation chain from sign language to speech!<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Want to see the whole process - from a hand gesture in a video to fluently spoken text - in action? In this <\/span><a href=\"https:\/\/theanalyticslab.nl\/bridging-the-gap-how-ai-translates-sign-language-into-speech\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">article<\/span><\/a><span data-contrast=\"auto\"> let's see how it works.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span><\/span><\/p>\n<h2><strong>Conclusion<\/strong><\/h2>\n<p><span data-contrast=\"auto\">In summary, we have developed a system that can convert video files with sign language into an audio file with spoken language. Of course, improvements are still needed. For example, the need to split videos into separate files makes the model less user-friendly. Ideally, a live translation from sign language to speech should be possible when filming someone, without having to record and upload a video first. Nevertheless, we have shown that translating sign language to speech is feasible and can contribute greatly to making sign language understandable to a wider audience.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In addition to facilitating conversations, such a tool could also be used for educational purposes, such as in learning sign language. Other possible uses include customer service or communication with government agencies and insurance companies. These are just a few examples.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<h3>Want to learn more about how we trained the model and see a full example from video to audio?<\/h3>\n<ul>\n<li><span><a href=\"https:\/\/theanalyticslab.nl\/fine-tuning-a-video-classification-model-for-hand-sign-recognition-in-python\/\" target=\"_blank\" rel=\"noopener\">Fine-tuning a video classification model for hand sign recognition in Python<\/a><\/span><\/li>\n<li><span><a href=\"https:\/\/theanalyticslab.nl\/bridging-the-gap-how-ai-translates-sign-language-into-speech\/\" target=\"_blank\" rel=\"noopener\">Bridging the gap: How AI translates sign language into speech<\/a><\/span><\/li>\n<\/ul>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section].<\/p>","protected":false},"excerpt":{"rendered":"<p>From gesture to sound: How AI converts sign language into speechThink of your best friend. Do you remember your last good conversation? Perhaps you discussed a book, your favorite recipe or international politics. Now imagine that friend, after decades of friendship, gradually losing his hearing and becoming dependent on sign language. [...]<\/p>","protected":false},"author":2,"featured_media":10094,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[51,55],"tags":[],"class_list":["post-10091","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artikel","category-data-science-ai-nl"],"_links":{"self":[{"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/posts\/10091","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/comments?post=10091"}],"version-history":[{"count":8,"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/posts\/10091\/revisions"}],"predecessor-version":[{"id":10134,"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/posts\/10091\/revisions\/10134"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/media\/10094"}],"wp:attachment":[{"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/media?parent=10091"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/categories?post=10091"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cmotions.com\/en\/wp-json\/wp\/v2\/tags?post=10091"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}