The Final Stretch

Team Waldo
Jun 10, 2019
8 min read

Updated: Jun 11, 2019

This past week was all about fine tuning and fitting all the components into Waldo. The objective is to ensure that all of Waldo’s functions are as robust and as accurate as possible. WIth lots of hard work and determination, there is now a working model of the sign language recognition, working buttons, and a working ultrasonic sensor, with everything interfaced to a speaker to produce sound through the IBM Watson text to speech API. This week’s blog documents the productive and successful past week. We also have a working first demo which can be found at the end of the blog.

We also submitted the leaflet and completed a poster meant to describe the more technical aspects of our product. The poster is another deliverable for our project, which will be displayed at our booth on demo day. The poster is A1 in size and offers a more technical approach to our solution. It illustrates the design process and the salient features of our design solution.

MACHINE LEARNING

Having settled on a model that not only had sufficient complexity but was also compact enough to run on the Jetson Nano, we turned our attention towards improving the accuracy and robustness of the model as a whole.

Given the time that we had, there were limits as to how much data we could obtain to train our model. This certainly put a cap on the performance of the model as deep learning models rely heavily on data to generalise well. However, we had several ideas to circumvent this limitation.

Before getting into the approaches we took, here are some details about our dataset. For each random person that we had approached, we recorded them doing the 5 different makaton signs. Therefore, for each video in a given class, there would be one in the other 4 classes for the same corresponding person. We were only able to collect a total of 175 videos per class (5 in this case) for our training set. Of the 175 videos, 70 of them were of the three group members tasked with data collection, leaving 105 videos per class that is diverse. For our validation set, we had 50 videos per class, all of which are diverse. It is important to note that having enough data for validation crucial to ensure that the validation accuracy/loss can be used as a proxy of the actual (true) accuracy/loss of the model given out of sample data.

Firstly, to test the limits of our model, we trained it with the training data (118,562 samples) from the Jester Dataset which had 27 classes. We used 80% of the training data as the training set and 20% of the training data as the validation set. As each epoch took around 4 hours, we trained the model for 5 epochs and it had obtained a training accuracy of 88% and a validation accuracy of 87%. We don’t believe that the model had overfitted yet (meaning that it was possible that the training and validation accuracy could still rise), but this was sufficient evidence of the models’ capabilities.

Next, we trained the model on the data that we had with randomly initialised weights, with 6 classes (5 actions + 1 no gesture). It was able to achieve a training accuracy of 98.8% and a validation accuracy of 91.6%. This might appear to be a very positive result, but we were wary of the fact that our validation set had only a total of 300 samples.

With our limited data in mind, we turned to data augmentation. Data augmentation is a technique to artificially create new training data from existing training data. We performed this by applying the same affine transformations to all the frames of a given video sample that is 30 frames in length. This involved random translation (+/- 20% in x and y axis) , scaling (80%-120%), rotation (+/- 5 deg) and shear (+/- 5 deg) within the ranges specified. In addition, contrast normalization (0.75-1.5) and additive gaussian noise were also introduced. For those interested, we used a python library imgaug [1] and wrote some code to apply the same transformation to all frames of a video.

On top of this, we also had the idea of integrating the data from the Jester Dataset into our model. Therefore, on top of our augmented data, we also trained the model with non-conflicting actions (i.e there is a Thumbs Up in Jester which clashes with the Good action). The idea behind this, is that having more data from both our augmented dataset and the Jester dataset would enable the model to better learn the important spatio-temporal features for gesture recognition. Having additional gestures from Jester would force the model to learn a better representation of the gestures and to also focus on the salient parts of the data that are most pertinent to recognising gestures.

Having done both augmentation of data and the introduction of data from the Jester Dataset, the model is able to achieve a peak validation accuracy of 89.5%. Whilst at first glance, this may seem like a downgrade from the 91.6% obtained without augmentation, it is important to point out that the 89.5% validation accuracy was obtained over 26 classes (3 times more classes that before) and also on substantially more data (22,462 samples). This gives us confidence that the validation loss and accuracy reflects the model’s out of sample performance.

The model was then put into practical testing on the Jetson Nano. Empirically, it gave fewer false predictions, and would give a peak prediction probability of >0.99 when the gestures were performed. For the final output, we used a hard threshold probability of 0.94 and required that two consecutive predictions of the same gesture above the threshold occur before signalling that a gesture is detected.

Overall, the model performs well in recognising the different Makaton Signs. However, the way the model is trained and the way it is used is slightly different. The model is trained with the 30 Frames which capture the entire motion of a given Makaton Sign. However, in practice, we used a sliding window approach, which utilises the 30 most recent Frames (samples at 15 FPS) as input to the model. Thus there exists the possibility that a portion of the 30 input frames may look like the beginning or ending one of the gestures and perhaps the model has learnt to use that to identify the action. In this aspect, there is room for improvement of the model and further inspection and analysis of what the model focuses on would allow one to better tune the model. Additionally, the use of a hard threshold is simple and effective in our proof-of-concept, but it might be possible that there are better ways of determining when to signal that a sign has been detected. It may even be possible that this can be tackled with a machine learning approach but with a slightly different setup for the data, one that mimics the sliding window mentioned above.

POWER CONSIDERATIONS

In order to power both the Jetson and the Raspberry Pi, a much more powerful external power bank had to be bought- we eventually settled with a power bank capable of charging laptops. However, when the Jetson started to run the machine learning scripts, it simply shut down, meaning that there could have been still a lack of power. We then decided to measure the actual current that the Jetson was drawing by observing the power drawn from a power supply unit. The interesting fact was that the current drawn never got higher than 2.5 A, and with the power bank being able to supply a maximum of 3 A, it was clear that a lack of power was not the problem. In fact, the issue was due to the surges of current. Even though the power drawn from the Jetson was always lower that 2.5A, there were times when the current drawn would be increase by 1A at times, meaning a jump from 1.2 A to 2.2 A for example. This could be the reason why the Jetson shuts down- not because of the lack of power, but because of the surges in current that the Jetson suddenly draws during operation. This surge in current could be the reason why the power bank shuts down (perhaps due to a built in safety mechanism), hence cutting power supply to the Jetson. The solution was to actually modify and limit the amount of power the Jetson draws. In the default mode, the Jetson operates with 4 core CPUs at a max clock speed of 1.479 GHz for each CPU. As such one of (or a combination of) these 2 parameters (number of CPUs or max clock speed) had to be modified. Essentially, lowering these parameters helped in terms of power, but made the entire system laggy due to the reduction in computing power. The best compromise, which was found through trial and error, was to reduce the number of cores to 3 CPUs and at a reduced maximum clock speed of 0.8GHz. At this setting, the power bank is able to smoothly power the Jetson with no issues. Thus, in order for the Jetson to operate smoothly, the decision was taken to connect all other peripherals including the speakers to the Raspberry Pi. This way the Jetson only has to deal with all the Machine Learning inferencing and Video stream input.

OTHER SENSORS

Significant progress was made on the implementation of the buttons though the GPIO pins this week. Due to limited space on Waldo, the toy elephant, we decided to implement 4 buttons, with each button on Waldo’s 2 hands and 2 feet. The 4 phrases to be said are “I would like a coffee”, “I would like to go out”, “I would like to go on a bus” and “I would like to go home”. It is worth noting that these phrases can be easily customisable according to the users’ needs. These phrases are said though IBM Watson’s text to speech API.

The GPIO pins are also used for communication between the Raspberry Pi and the Jetson. Once the Jetson detects a Makaton action, encoded binary data is sent through the GPIO pins for the data to be decoded by the Raspberry Pi, where the corresponding sound of either 'Dinner’, ‘Good’, ‘Home’, ‘No’ or ‘Sorry’ is played through the speakers.

Essentially, the ultrasonic sensor is used as a proximity sensor. It is able to measure the distance of an object or body from the sensor. It gives an indication as to how far the user should stand from Waldo for it to accurately detect and identify the Makaton hand signs. This threshold is currently set at 75 cm, where the user must stand at least 75 cm away from Waldo. As an example, if the user stands at 60 cm as measured by the ultrasonic sensor, Waldo will utter the phrase “Please stand further away from me”. There is also a lower threshold at 45 cm where Waldo will stop uttering the phrase. This is to accommodate the user moving towards Waldo to press the buttons. Basically what this means is that when the user stands at any distance less than 55 cm from Waldo, it recognises that the user simply wants to play with or press the buttons and will not utter any phrases.

FITTING IT ALL IN

When trying to fit everything into Waldo, it was important to have a good spacial awareness of the toy. Here is the breakdown:

The Jetson, Raspberry Pi and Battery Pack take up the bulk of the space in the body of Waldo
Each of the 4 buttons are placed on Waldo’s 2 arms and feet
The ultrasonic sensor placed at Waldo’s ‘belly-button’

This can all be seen in the image below. Presenting the ‘new and smart’ Waldo:

Finally, here we have a first demo of the main key functions, which was very encouraging considering the effort and all the long hours put in!

The Final Stretch

Recent Posts

Comments