Author Spotlight: Addressing Technical and Subjective Challenges in Measuring Classroom Attention

Luis Marquez-Carpintero; Monica Pina-Navarro; Sergio Suescun-Ferrandiz; Felix Escalona; Francisco Gomez-Donoso; Rosabel Roig-Vila; Miguel Cazorla

doi:10.3791/65931

Engineering

Artificial Intelligence-Based System for Detecting Attention Levels in Students

Published: December 15, 2023 doi: 10.3791/65931

Luis Marquez-Carpintero¹, Monica Pina-Navarro¹, Sergio Suescun-Ferrandiz¹, Felix Escalona¹, Francisco Gomez-Donoso¹, Rosabel Roig-Vila², Miguel Cazorla¹

¹University Institute for Computer Research, University of Alicante, ²Department of General and Specific Didactics, University of Alicante

Abstract

The attention level of students in a classroom can be improved through the use of Artificial Intelligence (AI) techniques. By automatically identifying the attention level, teachers can employ strategies to regain students' focus. This can be achieved through various sources of information.

One source is to analyze the emotions reflected on students' faces. AI can detect emotions, such as neutral, disgust, surprise, sadness, fear, happiness, and anger. Additionally, the direction of the students' gaze can also potentially indicate their level of attention. Another source is to observe the students' body posture. By using cameras and deep learning techniques, posture can be analyzed to determine the level of attention. For example, students who are slouching or resting their heads on their desks may have a lower level of attention. Smartwatches distributed to the students can provide biometric and other data, including heart rate and inertial measurements, which can also be used as indicators of attention. By combining these sources of information, an AI system can be trained to identify the level of attention in the classroom. However, integrating the different types of data poses a challenge that requires creating a labeled dataset. Expert input and existing studies are consulted for accurate labeling. In this paper, we propose the integration of such measurements and the creation of a dataset and a potential attention classifier. To provide feedback to the teacher, we explore various methods, such as smartwatches or direct computers. Once the teacher becomes aware of attention issues, they can adjust their teaching approach to re-engage and motivate the students. In summary, AI techniques can automatically identify the students' attention level by analyzing their emotions, gaze direction, body posture, and biometric data. This information can assist teachers in optimizing the teaching-learning process.

Introduction

In modern educational settings, accurately assessing and maintaining students' attention is crucial for effective teaching and learning. However, traditional methods of gauging engagement, such as self-reporting or subjective teacher observations, are time-consuming and prone to biases. To address this challenge, Artificial Intelligence (AI) techniques have emerged as promising solutions for automated attention detection. One significant aspect of understanding students' engagement levels is emotion recognition¹. AI systems can analyze facial expressions to identify emotions, such as neutral, disgust, surprise, sadness, fear, happiness, and anger².

Gaze direction and body posture are also crucial indicators of students' attention³. By utilizing cameras and advanced machine learning algorithms, AI systems can accurately track where students are looking and analyze their body posture to detect signs of disinterest or fatigue⁴. Furthermore, incorporating biometric data enhances the accuracy and reliability of attention detection⁵. By collecting measurements, such as heart rate and blood oxygen saturation levels, through smartwatches worn by students, objective indicators of attention can be obtained, complementing other sources of information.

This paper proposes a system that evaluates an individual's level of attention using color cameras and other different sensors. It combines emotion recognition, gaze direction analysis, body posture assessment, and biometric data to provide educators with a comprehensive set of tools for optimizing the teaching-learning process and improving student engagement. By employing these tools, educators can gain a comprehensive understanding of the teaching-learning process and enhance student engagement, thereby optimizing the overall educational experience. By applying AI techniques, it is even possible to automatically evaluate this data.

The main goal of this work is to describe the system that allows us to capture all the information and, once captured, to train an AI model that allows us to obtain the attention of the whole class in real-time. Although other works have already proposed capturing attention using visual or emotional information⁶, this work proposes the combined use of these techniques, which provides a holistic approach to allow the use of more complex and effective AI techniques. Moreover, the datasets hitherto available are limited to either a set of videos or one of biometric data. The literature includes no datasets that provide complete data with images of the student's face or their body, biometric data, data on the teacher's position, etc. With the system presented here, it is possible to capture this type of dataset.

The system associates a level of attention with each student at each point of time. This value is a probability value of attention between 0% and 100%, which can be interpreted as low level of attention (0%-40%), medium level of attention (40%-75%), and high level of attention (75%-100%). Throughout the text, this probability of attention is referred to as the level of attention, student attention, or whether students are distracted or not, but these are all related to the same output value of our system.

Over the years, the field of automatic engagement detection has grown significantly due to its potential to revolutionize education. Researchers have proposed various approaches for this area of study.

Ma et al.⁷ introduced a novel method based on a Neural Turing Machine for automatic engagement recognition. They extracted certain features, such as eye gaze, facial action units, head pose, and body pose, to create a comprehensive representation of engagement recognition.

EyeTab⁸, another innovative system, used models to estimate where someone is looking with both their eyes. It was specially made to work smoothly on a standard tablet with no modifications. This system harnesses well-known algorithms for processing images and analyzing computer vision. Their gaze estimation pipeline includes a Haar-like feature-based eye detector, as well as a RANSAC-based limbus ellipse fitting approach.

Sanghvi et al.⁹ propose an approach that relies on vision-based techniques to automatically extract expressive postural features from videos recorded from a lateral view, capturing the behavior of the children. An initial evaluation is conducted, involving the training of multiple recognition models using contextualized affective postural expressions. The results obtained demonstrate that patterns of postural behavior can effectively predict the engagement of the children with the robot.

In other works, such as Gupta et al.¹⁰, a deep learning-based method is employed to detect the real-time engagement of online learners by analyzing their facial expressions and classifying their emotions. The approach utilizes facial emotion recognition to calculate an engagement index (EI) that predicts two engagement states: engaged and disengaged. Various deep learning models, including Inception-V3, VGG19, and ResNet-50, are evaluated and compared to identify the most effective predictive classification model for real-time engagement detection.

In Altuwairqi et al.¹¹, the researchers present a novel automatic multimodal approach for assessing student engagement levels in real-time. To ensure accurate and dependable measurements, the team integrated and analyzed three distinct modalities that capture students' behaviors: facial expressions for emotions, keyboard keystrokes, and mouse movements.

Guillén et al.¹² propose the development of a monitoring system that uses electrocardiography (ECG) as a primary physiological signal to analyze and predict the presence or absence of cognitive attention in individuals while performing a task.

Alban et al.¹³ utilize a neural network (NN) to detect emotions by analyzing the heart rate (HR) and electrodermal activity (EDA) values of various participants in both time and frequency domains. They find that an increase in the root-mean-square of successive differences (RMSDD) and the standard deviation normal-to-normal (SDNN) intervals, coupled with a decrease in the average HR, indicate heightened activity in the sympathetic nervous system, which is associated with fear.

Kajiwara et al.¹⁴ propose an innovative system that employs wearable sensors and deep neural networks to forecast the level of emotion and engagement in workers. The system follows a three-step process. Initially, wearable sensors capture and collect data on behaviors and pulse waves. Subsequently, time series features are computed based on the behavioral and physiological data acquired. Finally, deep neural networks are used to input the time series features and make predictions on the individual's emotions and engagement levels.

In other research, such as Costante et al.¹⁵, an approach based on a novel transfer metric learning algorithm is proposed, which utilizes prior knowledge of a predefined set of gestures to enhance the recognition of user-defined gestures. This improvement is achieved with minimal reliance on additional training samples. Similarly, a sensor-based human activity recognition framework¹⁶ is presented to address the goal of the impersonal recognition of complex human activities. Signal data collected from wrist-worn sensors is utilized in the human activity recognition framework developed, employing four RNN-based DL models (Long-Short Term Memories, Bidirectional Long-Short Term Memories, Gated Recurrent Units, and Bidirectional Gated Recurrent Units) to investigate the activities performed by the user of the wearable device.

Subscription Required. Please recommend JoVE to your librarian.

Protocol

The following protocol follows the guidelines of the University of Alicante's human research ethics committee with the approved protocol number UA-2022-11-12. Informed consent has been obtained from all participants for this experiment and for using the data here.

1. Hardware, software, and class setup

Set a router with WiFi capabilities (the experiments were carried out using a DLink DSR 1000AC) in the desired location so that its range covers the entire room. Here, 25 m² classrooms with 30 students were covered.
Set one smartwatch (here Samsung Galaxy Smartwatch 5) and one camera (here Logitech C920 cameras) for each student location. Set one embedded device for every two students. Fix two cameras on two tripods and connect them to another embedded device (hereafter referred to as zenithal cameras).
Connect the cameras to the corresponding embedded devices with a USB link and turn them ON. Turn ON the smartwatches as well. Connect every embedded device and smartwatch to the WiFi network of the router set up in step 1.1.
Place the cameras in the corner of each student's desk and point them forward and slightly tilted upwards so that the face of the student is clearly visible in the images.
Place the two tripods with the cameras in front of the desks that are closest to the aisles in the first row of seats of the classroom. Move the tripod riser to the highest position so the cameras can clearly view most of the students. Each embedded device will be able to manage one or two students, along with their respective cameras and watches. The setup is depicted in Figure 1.
Run the capturing software in the smartwatches and in the embedded devices so that they are ready to send the images and the accelerometer, gyroscope, heart rate, and lighting data. Run the server software that collects the data and stores it. A diagram of all these elements can be seen in Figure 2.
Ensure the zenithal cameras dominate the scene so that they have a clear view of the students' bodies. Place an additional individual camera in front of each student.
Let the students sit in their places, inform them of the goals of the experiment, and instruct them to wear the smartwatches on their dominant hand and not to interact with any of the elements of the setup.
Start the capture and collection of the data in the server and resume the class lessons as usual.

Figure 1: Hardware and data pipeline. The cameras and smartwatches data are gathered and fed to the machine learning algorithms to be processed. Please click here to view a larger version of this figure.

Figure 2: Position of the sensors, teacher, and students. Diagram showing the positions of the cameras, smartwatches, and GUI in the classroom with the teacher and students. Please click here to view a larger version of this figure.

2. Capture and data processing pipeline

NOTE: All these steps are performed automatically by processing software deployed in a server. The implementation used for the experiments in this work was written in Python 3.8.

Collect the required data by gathering all the images and biometric data from the smartwatch for each student and build a data frame that includes data from 1 s. This data frame is composed of one image from the individual camera, one image from each zenithal camera (two in this case), fifty registers of the three gyroscope values, fifty registers of the three accelerometer values, one heart-rate value, and one light-condition value.
To compute the head direction, the first step is to use the webcam that is pointed to the student's face and retrieve an image. Then, that image is processed by the BlazeFace¹⁷ landmark estimation algorithm. The output of such an algorithm is a list of 2D key points corresponding to specific areas of the face.
By using the estimated position of the eyes, nose, mouth, and chin provided by the algorithm, solve the perspective n-point problem using the method cv::SolvePnPMethod of the OpenCV library, a canonical face setting. The rotation matrix that defines the head direction is obtained as a result of this procedure.
Compute the body pose by forwarding the zenithal image from the data frame in which the student is depicted to the BlazePose¹⁸ landmark estimation algorithm and retrieve a list of 2D coordinates of the joints of the body depicted in the image. This list of landmarks describes the student's pose.
NOTE: The body pose is important as it can accurately represent a student's engagement with different activities during the class. For instance, if the student is sitting naturally with their hands on the desk, it could mean that they are taking notes. In contrast, if they are continually moving their arms, it could mean they are speaking to someone.
Obtain the image of the student's face and perform alignment preprocessing. Retrieve the key points of the eyes from the list computed in step 2.3 and project them to the image plane so that the position of the eyes is obtained.
Trace a virtual vector that joins the position of both eyes. Compute the angle between that vector and the horizontal line and build a homography matrix to apply the reverse angle to the image and the key points so that the eyes are aligned horizontally.
Crop the face using the key points to detect its limits and feed the patch to the emotion detection convolutional neural network¹⁹. Retrieve the output from the network, which is a vector with 7 positions, each position giving the probability of the face showing one of the seven basic emotions: neutral, happy, disgusted, angry, scared, surprised, and sad.
Measure the level of attention by obtaining the accelerometer, gyroscope, and heart rate data from the data frame (Figure 3).
Feed the deep learning-based model that was custom-built and trained from scratch described in the representative results with the stream of accelerometer and gyroscope data. Retrieve the output from the model, which is a vector with 4 positions, each position giving the probability of the data representing one of the following possible actions: handwriting, typing on a keyboard, using a mobile phone, or resting, as shown in Figure 4.
Feed the final attention classifier with a linearization of all the results from the previous systems by merging the head direction, emotion recognition output, body pose, and the gyroscope, accelerometer, and heart rate data. Retrieve the results from this final classifier, which is a score from 0 to 100. Classify this continuous value into one of the three possible discrete categories of attention: low level of attention (0-40%), medium level of attention (40%-75%), and high level of attention (75% - 100%). The structure of the dataset generated is shown in Table 1.
Show the results of the attention levels to the teacher using the graphical user interface (GUI) from the teacher's computer, accessible from a regular web browser.

Figure 3: Data captured by the smartwatch. The smartwatch provides a gyroscope, accelerometer, heart rate, and light condition as streams of data. Please click here to view a larger version of this figure.

Figure 4: Examples of the categories considered by the activity recognition model. Four different actions are recognized by the activity recognition model: handwriting, typing on a keyboard, using a smartphone, and resting position. Please click here to view a larger version of this figure.

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

The target group of this study is undergraduate and master's students, and so the main age group is between 18 and 25 years old. This population was selected because they can handle electronic devices with fewer distractions than younger students. In total, the group included 25 people. This age group can provide the most reliable results to test the proposal.

The results of the attention level shown to the teacher have 2 parts. Part A of the result shows individual information about the current attention level of each student. Part B is then intended to obtain the average attention of the whole class and its temporal history throughout the lesson. This allows us to capture a general trend of the student's attention in the classroom and to adapt the methodology used by the teacher in a live fashion. Every second, the interface requests new information from the server. Furthermore, this view incorporates the use of browser notifications, allowing drastic changes in the students' attention to be shown in a non-intrusive manner while the teacher performs their activities normally, without the need to keep this GUI in the foreground. An example of this GUI can be seen in Figure 5.

Figure 5: Graphical user interface of the system. The level of attention is shown in a GUI that can be accessed by any internet browser on any capable device, such as a tablet, smartphone, and desktop or laptop computer. Please click here to view a larger version of this figure.

As for the activity recognition model, a recurrent neural network was defined, such that it receives a sequence of 200 measurements with 6 values each as input: namely, three values from the accelerometer and 3 from the gyroscope. The model has an LSTM layer with 64 units followed by a SoftMax-activated fully connected layer with four output neurons, one per category. The architecture is depicted in Figure 6.

Figure 6: Architecture of the activity classifier. As input, the model takes smartwatch data and processes it through an LSTM layer followed by a fully connected layer. The output is the probability of the sample depicting each activity. Please click here to view a larger version of this figure.

As an output, the classifier returns the class corresponding to the estimated action being performed by the student. This neural network was trained using data captured from 6 different individuals. Each was recorded while performing actions from the four different categories for 200 s. All the data taken was duplicated, generating a new mirrored data set by inverting the value obtained from the sensors in the X-axis. This is similar to collecting data from both the right and left hands of all individuals. This is a common practice in the machine learning area and is intended to generate more samples from the existing dataset to avoid overfitting.

The 200 measurements (one record per second) are grouped into streams of 4 s to match the input from the LSTM network by moving the window one second at a time. As a result, we obtained 197 combinations of data taken in an interval of 4 s. Summarizing, in total, there are 9,456 data inputs, 6 people, 4 classes, 2 hands, and 197 training sets. The data was separated into 90% training and 10% validation, and the network was trained for 300 epochs and a batch size of 64.

As shown in Figure 7, the model was trained for 300 epochs. The validation loss was less than 0.1%, and the validation accuracy was 97%. The metrics obtained highlight the good performance of the model.

Figure 7: Training and validation losses and accuracies. Training and validation losses and accuracies show that the model performance is adequate and does not suffer from overfitting. Please click here to view a larger version of this figure.

Finally, the results of each subsystem (head pose, pose estimation, emotion prediction, and activity recognition) are merged into a boosting classifier that provides a probability value on whether or not the student is attentive to the lesson.

In order to advance conceptual and procedural clarification for accurate labeling and expert input, existing studies were consulted as described below.

Regarding expert input, the Delphi method was chosen²⁰^,²¹^,²², a method that is increasingly relevant in the technological field²³. As pointed out in a previous publication, the Delphi method is defined as an iterative, group, and anonymous process to generate opinions on a topic and explore consensus among experts on that topic²³. In the case presented here, 6 experts contributed for 2 weeks and 2 rounds of consultation, in concurrence with Khodyakov et al.²⁴. Due to the importance of the profile of the participating experts, the consultation included academic specialists from universities in the fields of Psychology, Pedagogy, and Computer Science. A quantitative method was used to collect the data. The results have led to a consensus on the labeling used in this study.

With regard to the studies consulted as a basis for the labeling, we started with an exploratory study in the main databases, such as WOS and Scopus. The contributions of earlier studies²⁵^,²⁶^,²⁷^,²⁸ are worth mentioning in this regard. All of them address the problem of care from specific perspectives, but not in a holistic way from an intelligent system, as this study intends to address. On the other hand, there are studies that combine two specific sources, such as in Zaletelj et al.²⁹, where they focus on facial and body features, but they are far from global approaches such as this study. One previous work stands out³⁰, citing Posner's taxonomy, which is taken into account in this study. Posner considers attention as a set of isolatable neural systems (alertness, orienting, and executive control), which often work together to organize behavior³⁰.

The boosting classifier is an ensemble algorithm that learns weights for each weak output of the classifier and generates a final value by means of a weighted combination of each individual decision. This information, as discussed in step 2.9, is presented in real-time via a web interface so that the teacher can notice drastic changes in the attention level of the class with browser notifications. With this visualization interface, which shows the real-time evolution of the students' overall attention level, teachers can adapt their classes to engage students in their lessons and get more out of the class.

Table 1 shows the dataset structure, which is composed of the following elements: Individual camera: one image per second at 960 x 720 pixels RGB; Zenital cameras: two images per second at 1920 x 1080 pixels RGB; Gyroscope: 50 data per second, each data is decomposed into 3 floating point values with 19 decimal values, corresponding to the coordinates X, Y, Z. Measures angular acceleration in °/s; Accelerometer: 50 data per second, each data is decomposed into 3 floating point values with 19 decimal values, corresponding to the coordinates X, Y, Z. Measures the acceleration in m/s²; Rotation Vector: 50 data per second, each data is decomposed into a quaternion with 4 floating point values with 19 decimal places (with values between -1 and 1); Heart Rate: one value per second measuring beats per minute; Light-Sensor: approximately 8-10 values per second measuring the light level with integers; Head direction: For each image, 3 decimal numbers represent the X-axis (roll), Y-axis (pitch), and Z-axis (yaw), which indicate the tilt of the head; Body pose: For each image, 18 decimal numbers represent the X and Y coordinates of 9 key points.

Individual camera	Zenithal cameras	Gyroscope	Accelero-meter	Rotation-Vector	Heart-rate	Light-condition	Head direction	Body pose
960 x 720 pixels RGB image	2 x (1920 x 1080 pixels)	50 x 3 (XYZ)	50 x 3 (XYZ)	50 x quaternion	beats per minute	10 x lumens	3 (XYZ) decimal numbers	9 x 2 (XY) decimal numbers
960 x 720 pixels RGB image	RGB image	decimal numbers	decimal numbers	50 x quaternion	beats per minute	10 x lumens	3 (XYZ) decimal numbers	9 x 2 (XY) decimal numbers

Table 1: Structure of the dataset. The dataset shows different data for classification purposes. All the data shown comes from biometric data and images taken from different cameras.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

This work presents a system that measures the attention level of a student in a classroom using cameras, smartwatches, and artificial intelligence algorithms. This information is subsequently presented to the teacher for them to have an idea of the general state of the class.

One of the main critical steps of the protocol is the synchronization of the smartwatch information with the color camera image, as these have different frequencies. This was solved by deploying raspberries as servers that receive information from the smartwatch and cameras with their respective timestamps and perform a rough matching of this information. Finally, this information is sent to a centralized server for further processing. The other critical step of the protocol is the definition of the final classifier to generate the final inference with the data obtained from different sources. To resolve this point, the raw data must be preprocessed through different subsystems to generate valuable information such as the pose, emotion, or action of the user. This information, together with biometric data, is normalized and combined to estimate the attention level of each student.

The experimental results suggest that head direction and pose estimation can be accurately estimated using zenith and single cameras, while emotion recognition performs better with the most common emotions, such as happiness or neutrality. Regarding the action classifier with the smartwatch, highly distinguishable actions, such as writing, typing, or texting, present good detection accuracy and are suitable for the system.

This system offers several advantages and distinct features when compared to existing or alternative methods, as described below.

Objective and continuous measurements
Smartwatches and cameras provide objective and continuous measurements of students' attention throughout the class session. Traditional methods, such as self-reporting with questionnaires, can be subjective and prone to bias. The use of handheld devices and cameras eliminates reliance on self-reporting or external judgments, allowing educators to obtain more reliable and detailed data on attention levels.

Real-time information
The method captures attention data in real-time within the natural context of the classroom. It allows teachers to understand attention patterns during the course of a lesson. This is crucial as attention fluctuates during a class, and capturing real-time data allows for a more accurate assessment of attention dynamics.

Multimodal data integration
By combining data from smartwatches, which provide physiological measures, and cameras, which provide visual information, educators can gain a better understanding of attention. Physiological measures, such as movement patterns, can complement visual observations from cameras, providing a richer and more nuanced representation of attentional states. This multimodal approach increases the reliability and validity of attention assessment.

However, although this system has several advantages, there are some limitations to be considered, as described below.

Ethical and privacy concerns
Collecting physiological and visual data from students raises privacy concerns. It is essential to ensure that proper informed consent is obtained from participants and that data are anonymized and securely stored to protect the privacy and rights of the individuals involved. Unauthorized access or misuse of sensitive data must be prevented.

Reliability and validity of data
Although smartwatches and cameras can provide objective measurements, ensuring the reliability and validity of the data collected can pose challenges. Technical limitations of the devices, such as sensor accuracy or signal interference, can affect data quality. Calibration and validation procedures are necessary to establish the accuracy and consistency of the measurements.

Interpretation of attention signals
Interpreting attention-related signals obtained from smartwatches and cameras requires careful analysis. Attention is a complex cognitive process influenced by various factors, and physiological signals or visual information may not always directly correlate with attention levels, as they might not capture other important aspects of attention, such as cognitive engagement, selective focus, or mental effort.

Invasive or disruptive nature
Wearing devices or being observed by cameras may alter students' behavior or attention levels. Some students may feel self-conscious or uncomfortable, which may affect their natural attention patterns. It is important to consider the potential impact of the method itself on the attention being measured and to minimize any disruption to the learning environment. To overcome some potential distractions, we simplified the use of the smartwatch, which works by simply wearing it like a normal watch, and any other functions were disabled, such that it is no longer a distraction that can alter the learner's level of attention. The students receive a brief explanation of the experiment and their goals to give them time to get accustomed to the setting.

Regarding head pose¹⁷, pose estimation¹⁸^, and emotion prediction¹⁹, comprehensive and exhaustive experimentation is carried out in the corresponding articles. In summary, it should be noted that these three systems work properly, achieving high accuracy and resolving the corresponding tasks. However, despite the good performance, the approaches have some limitations. Regarding head direction estimation, it tends to provide erroneous predictions when the head is heavily tilted towards any direction. When this happens, the landmark estimation system performs poorly as many of the interesting points are not present in the image as a consequence of the abovementioned self-occlusion. Thus, this event inevitably leads to poor performance. This case is also present in the pose estimation method, as the approach is similar. If there are important key points that are not visible within the input image, the pose estimation methods tend to make them up, leading to erroneous and impossible poses. This effect can be mitigated by correctly placing the zenithal and individual cameras. As for emotion prediction, it tends to accurately detect some facial expressions whilst struggling with others. For instance, neutral and happy emotions are consistently detected, while the system is more prone to fail on fear, disgust, and surprise emotions if they are only slightly or fleetingly shown.

Regarding network interferences, the specifications of the D-Link DSR-1000AC Router indicate a maximum WLAN data transfer rate of 1300 Mbps. Consequently, an average participant in the experiment would transmit 4.344 Mb. This calculation takes into account the transmission of 3 images per second, each with an average size of 180 kB, and a sensor data package from the smartwatch of 3 kB per second. Therefore, considering the zenithal images, the theoretical maximum number of students that can be connected simultaneously is 297. However, this number will be influenced by the final number of devices connected to the router and the congestion level of the WiFi channel at the moment of the experiment.

Finally, the proposed method has several important implications and potential applications in various research areas, as described below.

Education and research
This method can provide valuable insights into the factors that affect students' attention and engagement in the classroom. Researchers can analyze the data collected from smartwatches and cameras to understand how different teaching methods, classroom environments, or even individual student characteristics influence attention levels.

Special education
This can be especially beneficial for studying attention-related challenges in students with special needs. Researchers can use data collected from smartwatches and cameras to identify patterns that lead to attention difficulties. This information can help develop targeted interventions and personalized strategies to support students with attention deficit disorders or other attention-related conditions.

Subscription Required. Please recommend JoVE to your librarian.

Disclosures

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was developed with funding from Programa Prometeo, project ID CIPROM/2021/017. Prof. Rosabel Roig is the chair of the UNESCO "Education, Research and Digital Inclusion".

Materials

Name	Company	Catalog Number	Comments
4 GPUs Nvidia A40 Ampere	NVIDIA	TCSA40M-PB	GPU for centralized model processing server
FusionServer 2288H V5	X-Fusion	02311XBK	Platform that includes power supply and motherboard for centralized model processing server
Memory Card Evo Plus 128 GB	Samsung	MB-MC128KA/EU	Memory card for the operation of the raspberry pi 4b 2gb. One for each raspberry.
NEMIX RAM - 512 GB Kit DDR4-3200 PC4-25600 8Rx4 EC	NEMIX	M393AAG40M32-CAE	RAM for centralized model processing server
Processor Intel Xeon Gold 6330	Intel	CD8068904572101	Processor for centralized model processing server
Raspberry PI 4B 2GB	Raspberry	1822095	Local server that receives requests from the clocks and sends them to the general server. One every two students.
Samsung Galaxy Watch 5 (40mm)	Samsung	SM-R900NZAAPHE	Clock that monitors each student's activity. For each student.
Samsung MZQL23T8HCLS-00B7C PM9A3 3.84Tb Nvme U.2 PCI-Express-4 x4 2.5inch Ssd	Samsung	MZQL23T8HCLS-00B7C	Internal storage for centralized model processing server
WebCam HD Pro C920 Webcam FullHD	Logitech	960-001055	Webcam HD. One for each student plus two for student poses.

DOWNLOAD MATERIALS LIST

References

Hasnine, M. N., et al. Students' emotion extraction and visualization for engagement detection in online learning. Procedia Comp Sci. 192, 3423-3431 (2021).
Khare, S. K., Blanes-Vidal, V., Nadimi, E. S., Acharya, U. R. Emotion recognition and artificial intelligence: A systematic review (2014-2023) and research recommendations. Info Fusion. 102, 102019 (2024).
Bosch, N. Detecting student engagement: Human versus machine. UMAP '16: Proc the 2016 Conf User Model Adapt Personal. , 317-320 (2016).
Araya, R., Sossa-Rivera, J. Automatic detection of gaze and body orientation in elementary school classrooms. Front Robot AI. 8, 729832 (2021).
Lu, Y., Zhang, J., Li, B., Chen, P., Zhuang, Z. Harnessing commodity wearable devices for capturing learner engagement. IEEE Access. 7, 15749-15757 (2019).
Vanneste, P., et al. Computer vision and human behaviour, emotion and cognition detection: A use case on student engagement. Mathematics. 9 (3), 287 (2021).
Ma, X., Xu, M., Dong, Y., Sun, Z. Automatic student engagement in online learning environment based on neural Turing machine. Int J Info Edu Tech. 11 (3), 107-111 (2021).
Wood, E., Bulling, A. EyeTab: model-based gaze estimation on unmodified tablet computers. ETRA '14: Proc Symp Eye Tracking Res Appl. , 207-210 (2014).
Sanghvi, J., et al. Automatic analysis of affective postures and body motion to detect engagement with a game companion. HRI '11: Proc 6th Int Conf Human-robot Interact. , 205-211 (2011).
Gupta, S., Kumar, P., Tekchandani, R. K. Facial emotion recognition based real-time learner engagement detection system in online learning context using deep learning models. Multimed Tools Appl. 82 (8), 11365-11394 (2023).
Altuwairqi, K., Jarraya, S. K., Allinjawi, A., Hammami, M. Student behavior analysis to measure engagement levels in online learning environments. Signal Image Video Process. 15 (7), 1387-1395 (2021).
Belle, A., Hargraves, R. H., Najarian, K. An automated optimal engagement and attention detection system using electrocardiogram. Comput Math Methods Med. 2012, 528781 (2012).
Alban, A. Q., et al. Heart rate as a predictor of challenging behaviours among children with autism from wearable sensors in social robot interactions. Robotics. 12 (2), 55 (2023).
Kajiwara, Y., Shimauchi, T., Kimura, H. Predicting emotion and engagement of workers in order picking based on behavior and pulse waves acquired by wearable devices. Sensors. 19 (1), 165 (2019).
Personalizing a smartwatch-based gesture interface with transfer learning. Costante, G., Porzi, L., Lanz, O., Valigi, P., Ricci, E. 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, , 2530-2534 (2014).
Mekruksavanich, S., Jitpattanakul, A. Deep convolutional neural network with RNNs for complex activity recognition using wrist-worn wearable sensor data. Electronics. 10 (14), 1685 (2021).
Bazarevsky, V., Kartynnik, Y., Vakunov, A., Raveendran, K., Grundmann, M. BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs. arXiv. , (2019).
Bazarevsky, V., et al. BlazePose: On-device Real-time Body Pose tracking. arXiv. , (2020).
Mejia-Escobar, C., Cazorla, M., Martinez-Martin, E. Towards a better performance in facial expression recognition: a data-centric approach. Comput Intelligence Neurosci. , In press (2023).
El-Garem, A., Adel, R. Applying systematic literature review and Delphi methods to explore digital transformation key success factors. Int J Eco Mgmt Engi. 16 (7), 383-389 (2022).
Using electroencephalography to determine student attention in the classroom. Indumathi, V., Kist, A. A. IEEE Global Engineering Education Conference (EDUCON, Kuwait, Kuwait, , 1-3 (2023).
Ma, X., Xie, Y., Wang, H. Research on the construction and application of teacher-student interaction evaluation system for smart classroom in the post COVID-19. Studies Edu Eval. 78, 101286 (2023).
Andersen, D. Constructing Delphi statements for technology foresight. Futures Foresight Sci. 5 (2), e144 (2022).
Khodyakov, D., et al. Disciplinary trends in the use of the Delphi method: A bibliometric analysis. PLoS One. 18 (8), e0289009 (2023).
Martins, A. I., et al. Consensus on the Terms and Procedures for Planning and Reporting a Usability Evaluation of Health-Related Digital Solutions: Delphi Study and a Resulting Checklist. J Medical Internet Res. 25, e44326 (2023).
Dalmaso, M., Castelli, L., Galfano, G. Social modulators of gaze-mediated orienting of attention: A review. Psychon Bull Rev. 27 (5), 833-855 (2020).
Klein, R. M. Thinking about attention: Successive approximations to a productive taxonomy. Cognition. 225, 105137 (2022).
Schindler, S., Bublatzky, F. Attention and emotion: An integrative review of emotional face processing as a function of attention. Cortex. 130, 362-386 (2020).
Zaletelj, J., Košir, A. Predicting students' attention in the classroom from Kinect facial and body features. J Image Video Proc. 80, (2017).
Strauch, C., Wang, C. A., Einhäuser, W., Van der Stigchel, S., Naber, M. Pupillometry as an integrated readout of distinct attentional networks. Trends Neurosci. 45 (8), 635-647 (2022).

Engineering