Hearing is one of the primary senses of the human being, as it allows us to detect vibrations through the ear and, from them, understand the environment, communicate effectively, interact socially and react to dangerous situations. It is estimated that 466 million people worldwide experience hearing loss, which affects communication, cognitive, and social development. For many deaf people, the main barrier is in access to healthcare: the absence of interpreters and the lack of training of health professionals in sign language hinder both accurate diagnosis and understanding of treatment, which is reflected in worse health indicators, such as obesity, hypertension and depression.

Home / Publications / Publication

Home / Publications / Publication

Linguagem Gestual

Publication type: Article Summary
Original title: Empowering Deaf-Hearing Communication: Exploring Synergies between Predictive and Generative AI-Based Strategies towards (Portuguese) Sign Language Interpretation
Article publication date: October 2023
Source: Journal of Imaging
Authors: Telmo Adão, João Oliveira, Somayeh Shahrabadi, Hugo Jesus, Marco Fernandes, Ângelo Costa, Vânia Ferreira, Martinho Gonçalves, Miguel Lopéz, Emanuel Peres & Luís Magalhães

What is the goal, target audience, and areas of digital health it addresses?
     The objective of the study is to develop and validate a system capable of transforming Portuguese Sign Language (LGP) gestures into coherent sentences through predictive and generative Artificial Intelligence (AI), thus promoting effective communication between deaf and hearing people. The target audience includes the deaf community, hearing professionals who interact with them in various contexts, LGP interpreters, caregivers, and AI researchers. In the field of digital health, it falls under the areas of digital inclusion and accessibility, patient safety, and AI-assisted communication in clinical and healthcare settings.

What is the context?
     Hearing is one of the primary senses of the human being, as it allows us to detect vibrations through the ear and, from them, understand the environment, communicate effectively, interact socially and react to dangerous situations. It is estimated that 466 million people worldwide experience hearing loss, which affects communication, cognitive, and social development. For many deaf people, the main barrier is in access to healthcare: the absence of interpreters and the lack of training of health professionals in sign language hinder both accurate diagnosis and understanding of treatment, which is reflected in worse health indicators, such as obesity, hypertension and depression.

     Sign language is a visual-spatial language based on conventional symbols that combine hand movements, facial expressions, and body postures. Like other languages, it has its own vocabulary, syntax and grammatical rules — for example, verbs remain in the infinitive form — thus distinguishing itself from mime, based on improvised gestures.

     Cultural diversity is reflected in the existence of different sign languages in each country, which may even vary between regions In Portugal, LGP emerged in 1823 at Casa Pia de Lisboa, being widely used by the deaf community, although still little known by the hearing population, which hinders social integration.

What are the current approaches?
     To improve communication between the deaf community and the hearing population, various technologies have been explored. Some rely on invasive devices, such as sensor gloves that record hand and finger movements with high accuracy. Others use RGB-D cameras that analyze 3D motion effectively, though these are less accessible than cameras embedded in laptops or smartphones.

     At the same time, AI methods based on frame by frame analysis of videos have also been tested, achieving around 86% accuracy in sign recognition. However, they focused on static signals and ignored dynamic gestures in continuous motion. Recently, methods based on anatomical landmarks, supported by extensive datasets and spatial-temporal neural networks, have emerged, capable of processing continuous sign language videos and generating text transcriptions. This advancement is reinforced by large scale (Large Language Models), such as ChatGPT, which transform incomplete transcriptions into correct and consistent sentences, paving the way for hybrid solutions that combine visual recognition and natural language generation, bringing the technology closer to human communication.

     Despite progress, evolution remains uneven across different sign languages. There are extensive public databases for languages such as American, Indian or Brazilian, but resources for LGP are still scarce. Moreover, initiatives that exploit the recognition of sign language as an accessible digital service are rare, which restricts the development of open and interoperable platforms.

What does innovation consist of? How is the impact of this study assessed?
     The innovation of this study lies in the creation of an LGP interpretation service that captures the user’s gestures on video, with a common device camera, recognizes these gestures by converting them into words with predictive AI, and, finally, arranges these words into coherent sentences with generative AI, presenting the final communication in written text. For its development, the following architecture was defined: (1) data collection, (2) dataset construction, (3) application of AI models and (4) frontend development.

     For data collection, 33 individuals recorded 50 videos each, corresponding to 50 relevant terms in the context of commerce, tourism, and health. To increase recording diversity by different individuals, both experts in LGP — interpreters, teachers, and deaf individuals — and people with no prior experience were included. The latter watched, analyzed, and imitated gestures performed by experts, also using the LGP dictionary of infopedia as reference. Of the 33 videos obtained for each term, 28 were used to train the AI models and the other 5 to test them.

     During dataset construction, the videos were transformed into AI-compatible data. Data augmentation strategies were applied, such as horizontal flips to generate left- or right-hand variations, and shear-based distortions to simulate different camera angles. In each frame, 88 landmarks of the face, hands, arms and torso were extracted using the MediaPipe library. Additional augmentation techniques generated spatial variations — such as random radial shifts, which slightly displace a gesture to the right — and temporal variations — such as spline-based interpolation, which creates intermediate gestures between two existing ones. The data were subsequently normalized to standardize dimensions and reduce background interference.

     First, AI models based on long-short term memory (LSTM) networks were used, capable of identifying spatio-temporal gesture patterns and classifying them into one of the 50 defined terms. Two architectures were tested: one consisted of three LSTM layers (SimpleLSTM), which hierarchically processes frame sequences, and a hybrid (ConvLSTM), which combines a 1D convolutional layer that captures local patterns such as finger movements, with LSTM layer that interprets overall gesture dynamics. In both cases, the final layer assigned probabilities to each gesture, selecting the most likely term. Then, each set of terms was sent to the Large Language Model ChatGPT-3.5, which generated the corresponding sentence. This process was guided by dynamic rules (e.g., punctuation) and static rules (e.g., repetition elimination) to ensure consistency.

     The user interface, developed in JavaScript, started recording the landmark-based points and storing them in temporary memory using the buffer-based interaction technique whenever the user hands were detected. Users received continuous visual feedback through a progress bar indicating when data capture began and the remaining time to complete the gesture, helping them synchronize with the system. When the buffer was full, the data was sent to the LSTM model, which then automatically reset.

     The impact of the system was evaluated using several metrics, which included the success rate (%) of correctly detecting and extracting landmark-based points, the inference accuracy of terms across different data augmentation techniques and LSTM network architectures, as well as the semantic similarity between reference sentences and those generated by ChatGPT, along with an assessment of ChatGPT-3.5’s efficiency. In addition, 12 participants without prior LGP experience independently performed a new set of gestures and assessed the intuitiveness of the buffer interaction technique on a Likert scale, the protocol that was also applied to one deaf participant from the Associação Portuguesa de Surdos.

What are the main results? What is the future of this approach?
     The extraction of anatomical landmarks reached an average success rate of 74.6%, performing worse on signs like “bread” or ” How much does it costs” and better on signs like “chicken” or “weekend”. Data augmentation techniques proved crucial for model performance. Within SimpleLSTM architecture, horizontal flips alone achieved 80% accuracy, while combining flips and frame distortions reached 87%. When spatial variations were added, the accuracy increased to 94.8%, while with temporal variations it was 92%. Combining all methods simultaneously resulted in 93.8% accuracy. For the ConvLSTM architecture, only the most promising conditions identified in SimpleLSTM were tested, avoiding unnecessary training. With this architecture, the accuracy was 95.6% with flips, distortions and spatial variations, and 94.4% when all methods were combined. Despite the slight superiority in accuracy, the ConvLSTM training was, on average, 17.5 times slower than SimpleLSTM.

     ChatGPT-3.5 achieved an average semantic correlation of 81% between generated and reference sentences, with perfect similarity in cases like “How much does it cost?” or “Can I drink wine?”, and lower values in more complex expressions. In terms of efficiency, sentence generation was fast when requested by the user (under 1 second) but limited when configured for automatic real-time updates.

     When the 12 participants recorded gestures, only one synchronization error with the buffer occurred and 14 recognition errors which can be explained by the lack of experience in LGP. Most participants gave maximum scores to the intuitiveness of the progress bar, and this evaluation was also confirmed by the deaf participant.

     The future of this approach involves expanding LGP datasets, increasing the number and diversity of terms to strengthen model generalization. It also requires exploring new data augmentation strategies and more advanced neural network architectures. In parallel, refining rule-based constraints on language models will be key to generating more natural, fluent, and contextually appropriate sentences. In healthcare, this technology could significantly improve the experience of deaf patients, ensuring greater understanding of medical guidelines and promoting equity in access to care.

Do you have an innovative idea in healthcare field?

Share it with us and see it come to life.
We will help bring your projects to life!

Newsletter

Receive the latest updates from the InovarSaúde portal.

República Portuguesa logo
logotipo SNS
SPMS logotipo

Follow Us

YouTube
LinkedIn

Co-funded by

PRR Logotipo
república Portuguesa logo
União Europeia Logo

Newsletter

Receive the latest updates from the InovarSaúde portal.

República Portuguesa logo
SNS Logo
SPMS Logo

Follow Us

Co-funded by

PRR Logotipo
República Portuguesa logo
União Europeia Logo

Home / Publications / Publication

Linguagem Gestual

Publication type: Article Summary
Original title: Empowering Deaf-Hearing Communication: Exploring Synergies between Predictive and Generative AI-Based Strategies towards (Portuguese) Sign Language Interpretation
Article publication date: October 2023
Source: Journal of Imaging
Authors: Telmo Adão, João Oliveira, Somayeh Shahrabadi, Hugo Jesus, Marco Fernandes, Ângelo Costa, Vânia Ferreira, Martinho Gonçalves, Miguel Lopéz, Emanuel Peres & Luís Magalhães

What is the goal, target audience, and areas of digital health it addresses?
     The objective of the study is to develop and validate a system capable of transforming Portuguese Sign Language (LGP) gestures into coherent sentences through predictive and generative Artificial Intelligence (AI), thus promoting effective communication between deaf and hearing people. The target audience includes the deaf community, hearing professionals who interact with them in various contexts, LGP interpreters, caregivers, and AI researchers. In the field of digital health, it falls under the areas of digital inclusion and accessibility, patient safety, and AI-assisted communication in clinical and healthcare settings.

What is the context?
     Hearing is one of the primary senses of the human being, as it allows us to detect vibrations through the ear and, from them, understand the environment, communicate effectively, interact socially and react to dangerous situations. It is estimated that 466 million people worldwide experience hearing loss, which affects communication, cognitive, and social development. For many deaf people, the main barrier is in access to healthcare: the absence of interpreters and the lack of training of health professionals in sign language hinder both accurate diagnosis and understanding of treatment, which is reflected in worse health indicators, such as obesity, hypertension and depression.

     Sign language is a visual-spatial language based on conventional symbols that combine hand movements, facial expressions, and body postures. Like other languages, it has its own vocabulary, syntax and grammatical rules — for example, verbs remain in the infinitive form — thus distinguishing itself from mime, based on improvised gestures.

     Cultural diversity is reflected in the existence of different sign languages in each country, which may even vary between regions In Portugal, LGP emerged in 1823 at Casa Pia de Lisboa, being widely used by the deaf community, although still little known by the hearing population, which hinders social integration.

What are the current approaches?
     To improve communication between the deaf community and the hearing population, various technologies have been explored. Some rely on invasive devices, such as sensor gloves that record hand and finger movements with high accuracy. Others use RGB-D cameras that analyze 3D motion effectively, though these are less accessible than cameras embedded in laptops or smartphones.

     At the same time, AI methods based on frame by frame analysis of videos have also been tested, achieving around 86% accuracy in sign recognition. However, they focused on static signals and ignored dynamic gestures in continuous motion. Recently, methods based on anatomical landmarks, supported by extensive datasets and spatial-temporal neural networks, have emerged, capable of processing continuous sign language videos and generating text transcriptions. This advancement is reinforced by large scale (Large Language Models), such as ChatGPT, which transform incomplete transcriptions into correct and consistent sentences, paving the way for hybrid solutions that combine visual recognition and natural language generation, bringing the technology closer to human communication.

     Despite progress, evolution remains uneven across different sign languages. There are extensive public databases for languages such as American, Indian or Brazilian, but resources for LGP are still scarce. Moreover, initiatives that exploit the recognition of sign language as an accessible digital service are rare, which restricts the development of open and interoperable platforms.

What does innovation consist of? How is the impact of this study assessed?
     The innovation of this study lies in the creation of an LGP interpretation service that captures the user’s gestures on video, with a common device camera, recognizes these gestures by converting them into words with predictive AI, and, finally, arranges these words into coherent sentences with generative AI, presenting the final communication in written text. For its development, the following architecture was defined: (1) data collection, (2) dataset construction, (3) application of AI models and (4) frontend development.

     For data collection, 33 individuals recorded 50 videos each, corresponding to 50 relevant terms in the context of commerce, tourism, and health. To increase recording diversity by different individuals, both experts in LGP — interpreters, teachers, and deaf individuals — and people with no prior experience were included. The latter watched, analyzed, and imitated gestures performed by experts, also using the LGP dictionary of infopedia as reference. Of the 33 videos obtained for each term, 28 were used to train the AI models and the other 5 to test them.

     During dataset construction, the videos were transformed into AI-compatible data. Data augmentation strategies were applied, such as horizontal flips to generate left- or right-hand variations, and shear-based distortions to simulate different camera angles. In each frame, 88 landmarks of the face, hands, arms and torso were extracted using the MediaPipe library. Additional augmentation techniques generated spatial variations — such as random radial shifts, which slightly displace a gesture to the right — and temporal variations — such as spline-based interpolation, which creates intermediate gestures between two existing ones. The data were subsequently normalized to standardize dimensions and reduce background interference.

     First, AI models based on long-short term memory (LSTM) networks were used, capable of identifying spatio-temporal gesture patterns and classifying them into one of the 50 defined terms. Two architectures were tested: one consisted of three LSTM layers (SimpleLSTM), which hierarchically processes frame sequences, and a hybrid (ConvLSTM), which combines a 1D convolutional layer that captures local patterns such as finger movements, with LSTM layer that interprets overall gesture dynamics. In both cases, the final layer assigned probabilities to each gesture, selecting the most likely term. Then, each set of terms was sent to the Large Language Model ChatGPT-3.5, which generated the corresponding sentence. This process was guided by dynamic rules (e.g., punctuation) and static rules (e.g., repetition elimination) to ensure consistency.

     The user interface, developed in JavaScript, started recording the landmark-based points and storing them in temporary memory using the buffer-based interaction technique whenever the user hands were detected. Users received continuous visual feedback through a progress bar indicating when data capture began and the remaining time to complete the gesture, helping them synchronize with the system. When the buffer was full, the data was sent to the LSTM model, which then automatically reset.

     The impact of the system was evaluated using several metrics, which included the success rate (%) of correctly detecting and extracting landmark-based points, the inference accuracy of terms across different data augmentation techniques and LSTM network architectures, as well as the semantic similarity between reference sentences and those generated by ChatGPT, along with an assessment of ChatGPT-3.5’s efficiency. In addition, 12 participants without prior LGP experience independently performed a new set of gestures and assessed the intuitiveness of the buffer interaction technique on a Likert scale, the protocol that was also applied to one deaf participant from the Associação Portuguesa de Surdos.

What are the main results? What is the future of this approach?
     The extraction of anatomical landmarks reached an average success rate of 74.6%, performing worse on signs like “bread” or ” How much does it costs” and better on signs like “chicken” or “weekend”. Data augmentation techniques proved crucial for model performance. Within SimpleLSTM architecture, horizontal flips alone achieved 80% accuracy, while combining flips and frame distortions reached 87%. When spatial variations were added, the accuracy increased to 94.8%, while with temporal variations it was 92%. Combining all methods simultaneously resulted in 93.8% accuracy. For the ConvLSTM architecture, only the most promising conditions identified in SimpleLSTM were tested, avoiding unnecessary training. With this architecture, the accuracy was 95.6% with flips, distortions and spatial variations, and 94.4% when all methods were combined. Despite the slight superiority in accuracy, the ConvLSTM training was, on average, 17.5 times slower than SimpleLSTM.

     ChatGPT-3.5 achieved an average semantic correlation of 81% between generated and reference sentences, with perfect similarity in cases like “How much does it cost?” or “Can I drink wine?”, and lower values in more complex expressions. In terms of efficiency, sentence generation was fast when requested by the user (under 1 second) but limited when configured for automatic real-time updates.

     When the 12 participants recorded gestures, only one synchronization error with the buffer occurred and 14 recognition errors which can be explained by the lack of experience in LGP. Most participants gave maximum scores to the intuitiveness of the progress bar, and this evaluation was also confirmed by the deaf participant.

     The future of this approach involves expanding LGP datasets, increasing the number and diversity of terms to strengthen model generalization. It also requires exploring new data augmentation strategies and more advanced neural network architectures. In parallel, refining rule-based constraints on language models will be key to generating more natural, fluent, and contextually appropriate sentences. In healthcare, this technology could significantly improve the experience of deaf patients, ensuring greater understanding of medical guidelines and promoting equity in access to care.

Cadeira de rodas

Autonomous Patient Mobility in a Hospital Environment

The internal transport of patients in healthcare institutions, although at first glance it may seem like a simple task, represents a complex, continuous, demanding and time-consuming logistical operation that cuts across all levels of the…

Read more
Voz em IA

The Future of Diagnostics: Speech and AI

Speech is a biomarker that reflects, in a sensitive way, the integrated functioning of several physiological systems, namely the nervous, respiratory, and muscular systems. This complexity makes it a promising resource for detecting changes associated…

Read more
Literatura sobre os cuidados de saúde no futuro

What Literature Reveals About Healthcare in the Future

The healthcare sector is undergoing rapid transformation driven by population aging, increasing complexity of care, and digital advancements, in a context that requires greater integration, sustainability and adaptation to new realities such as the European…

Read more
Perturbação do sono

A Digital Intervention for Insomnia in Oncology

Insomnia is a sleep disorder characterised by persistent difficulties in initiating sleep, maintaining sleep during the night, or achieving restful sleep. These difficulties arise even in the presence of adequate sleeping conditions and are often…

Read more
Sistema de telemonitorização remota

Digital Technology Revolutionising Post-cardiac Surgery

According to the World Health Organisation, cardiovascular disease remains the leading cause of death worldwide, responsible for around 17.9 million deaths a year. Its high prevalence is associated with unhealthy lifestyles characterised by poor diet,…

Read more
Sistema robótico autónomo INSIDE

Autonomous Robotics System for Autism Therapy

Autism spectrum disorder is a neurodevelopmental condition with significant clinical, social and economic repercussions throughout life. According to the World Health Organization, it is estimated to affect approximately 1 in 160 children worldwide. Its origin…

Read more
Enfermeira com um telefone

Mobile Application to Improve Workflows in Nursing Homes

Portugal has one of the highest aging populations in the world, placing increasing pressure on elderly care services, especially in nursing homes. Healthcare professionals in these facilities are often overwhelmed due to the increasing number…

Read more
troca de informações de saúde e interoperabilidade

New Era of Interoperability in Healthcare Systems

The growing use of electronic health records, digital diagnostic systems and remote monitoring technologies has led to a significant increase in the volume and complexity of health data. This increase intensifies the need for continuous,…

Read more
robótica colaborativa

Collaborative Robotics Improves Working Conditions

Workers face growing challenges in the industrial environment. Among the most critical are fatigue and inappropriate postures, often associated with repetitive tasks and working conditions that lack ergonomic suitability. These factors represent significant risks for…

Read more
Benefícios da Eletrônica Médica

Detection of Anxiety and Panic Attacks in Real Time

The growing number of people with anxiety disorders, along with increased awareness of mental health, drives the need for new technological tools that provide remote and continuous monitoring of anxiety and panic disorders. Thus, the…

Read more
tele-ecografia

A Novel Approach for Robotic-assisted Tele-echography

Currently, robotic systems for ultrasound diagnostic procedures fall into two main categories: portable robots that require manual positioning and fully autonomous robotic systems that independently control the ultrasound probe’s orientation and positioning. Portable robots rely…

Read more
Personalização e tecnologia na gestão da Diabetes

Personalization and Technology in Diabetes Management

IPDM has significant potential to improve diabetes management and drive health system reforms to become high-performing, effective, equitable, accessible, and sustainable. Evidence and good practices inspire health system transformation. Adopting person-centred approaches like co-creation and…

Read more
TEF-HEALTH Logo

SPMS Integrates the TEF-Health Initiative

SPMS participates in the TEF-Health initiative as a partner in a consortium composed of 51 entities from 9 European Union countries. This action is co-financed by the European Commission and has a duration of five…

Read more
Global Digital Health Partnership Logo

SPMS Represents Portugal as Vice-president of GDHP

The GDHP is an intergovernmental organization in the digital health sector that facilitates cooperation and collaboration between government representatives and the World Health Organization (WHO). Its purpose is to foster policymaking that promote the digitalization…

Read more
Portugal INCoDe.2030

Digital Transformation of Health at INCoDe.2030 in Tomar

The “National Digital Skills Initiative e.2030, Portugal” (INCoDe.2030) is an initiative that aims to improve the Portuguese population’s level of digital skills, placing Portugal at the level of the most advanced European countries in this…

Read more
HealthData@PT Logo

HealthData@PT: New SPMS Initiative for Health Data

Action HealthData@PT is launched in the context of the implementation of the European Health Data Space, and is an initiative approved by the European Commission under the EU4Health 2021-2027 programme. This initiative contributes to the…

Read more

Do you have an innovative idea in healthcare field?

Share it with us and see it come to life.
We will help bring your projects to life!

Newsletter

Receive the latest updates from the Inovarsaúde portal.

República Portuguesa logo
SNS Logo
SPMS Logo

Follow Us

YouTube
LinkedIn

Co-funded by

PRR Logotipo
República Portuguesa logo
União Europeia Logo
Scroll to Top