top of page
Search
  • Writer's pictureTeam Waldo

Blog post 1: Early Beginnings

Updated: May 21, 2019

This is the first blog post from Waldo, an academic project with IBM that aims to help carers in care homes understand Makaton. Makaton is a language that uses signs, symbols and speech to help children and adults communicate easily. The main aim of this project is to train a machine learning algorithm that can understand Makaton (provided via a camera feed), packaged in a friendly package that people of all ages would be happy to interact with.


We made our way down to IBM Hursley in the lovely English countryside for our first meeting with the IBM team, getting a fuller picture of the aims and goals of the project, and some technical assistance to start the project. We also interacted with the team from Imperial working on the Hexology project, as well as the team from Hexology, exchanging ideas and learning from each other.



After the meeting at IBM, we got to work on our project. In our first week, we performed tasks concurrently in two groups. The first group worked on sourcing and setting up a suitable and minimal development platform with decent Graphical Processing Unit (GPU) compute and researched on multiple deep learning algorithms to tackle the sign recognition problem, while the second group worked on gathering a starting dataset for the Makaton.


 

After the meeting at IBM, we got to work on our project. In our first week, we performed tasks concurrently in two groups. The first group worked on sourcing and setting up a suitable and minimal development platform with decent Graphical Processing Unit (GPU) compute and researched on multiple deep learning algorithms to tackle the sign recognition problem, while the second group worked on gathering a starting dataset for the Makaton.


A remote server with the following specifications was chosen to kick start the project:

-Dual Core Intel CPU with 13GB RAM

-NVIDIA 12 GB GDDR5 K80

-256GB SSD


To fully exploit the processing capabilities that GPUs can offer for training neural networks on the tensorflow framework, CUDA v10.0 was installed alongside cuDNN v7.5 to improve the rate of training. With the development platform fully set up, the team could finally get their hands on in implementing different deep learning algorithms.


The first algorithm that came intuitively to our minds was the CNN + LSTM (Convolutional Neural Networks + Long Short Term Memory)[1] model. In this algorithm, input videos are pre-processed into individual frames. The frames are then passed through a pre-trained CNN that was originally trained for object classification on the ImageNet. The learned features are then passed through the LSTM to learn the temporal evolution of these features before making a prediction on which gesture the video has captured. The group tested this approach using several different pre-trained CNN models that can be found from the Keras documentation[2]. In selecting the models to be used for this application, a balance between model accuracy and the size of the model had to be determined as the eventual application would have to be deployed on edge. Models MobileNetV2 and InceptionV3 were thus chosen as they delivered high Top-1 Accuracy on the ImageNet validation set despite being lightweight models. After several hours of training on a subset of the 20BN-JESTER Dataset V1[3] containing 4 hand gesture classes, the following results were obtained and tabulated below:



The next algorithm implemented was the C3D[4] , using 3D Convolutional Neural Networks to classify which gesture the video clip captured. Because of the nature of 3D Convolutions, the model developed was massive and contained over 79 million parameters. However, the results of this approach was slightly dismal as it attained a validation accuracy of just over 56% when trained with just 4 classes of gestures from the 20BN-jester Dataset V1.


The second group identified 5 simple Makaton signs that are likely to be commonly used, and include both signs that are more static and require more action. These signs are ‘dinner’, ‘good’, ‘home’, ‘no’ and ‘sorry’ (which can be seen below). The aim in generating the dataset was to provide a sufficiently significant level of variation within the set to ensure any neural network trained using this data would be robust and able to work under a variety of conditions. To this end, we used around 20 volunteers, with videos taken at different angles, with different backgrounds, and under different lighting conditions.




77 views0 comments

Recent Posts

See All
Post: Blog2_Post
bottom of page