Abstract

Correctly recognizing the behaviors of children with Autism Spectrum Disorder (ASD) is of vital importance for the diagnosis of Autism and timely early intervention. However, the observation and recording during the treatment from the parents of autistic children may not be accurate and objective. In such cases, automatic recognition systems based on computer vision and machine learning (in particular deep learning) technology can alleviate this issue to a large extent. Existing human action recognition models can now achieve impressive performance on challenging activity datasets, e.g., daily activity, and sports activity. However, problem behaviors in children with ASD are very different from these general activities, and recognizing these problem behaviors via computer vision is less studied. In this paper, we first evaluate a strong baseline for action recognition, i.e., Video Swin Transformer, on two autism behaviors datasets (SSBD and ESBD) and show that it can achieve high accuracy and outperform the previous methods by a large margin, demonstrating the feasibility of vision-based problem behaviors recognition. Moreover, we propose language-assisted training to further enhance the action recognition performance. Specifically, we develop a two-branch multimodal deep learning framework by incorporating the ”freely available” language description for each type of problem behavior. Experimental results demonstrate that incorporating additional language supervision can bring an obvious performance boost for the autism problem behaviors recognition task as compared to using the video information only (i.e., 3.49% improvement on ESBD and 1.46% on SSBD). Our code and model will be publicly available for reproducing the results.

Datasets and Pre-processing

There are two popular autism behavior datasets: Self-Stimulatory Behavior Dataset (SSBD) and Expanded Stereotype Behavior Dataset (ESBD). SSBD contains 75 videos of stimming actions of children with autism spectrum disorder with an average duration of 90 seconds per video. SSBD contains three typical action classes: Arm Flapping, Head Banging, and Spinning. The second dataset ESBD contains 99 YouTube videos, and the average video duration is about 2 minutes. Besides the three categories of SSBD, ESBD also includes the fourth category of Hand Action; thus ESBD contains 35 videos for Arm flapping, 13 videos for Hand action, 24 videos for Head banging, and 37 videos for Spinning.

These two datasets are noisy and contain a large portion of the background or other subjects. To enhance recognition accuracy, we first preprocess, specifically, detect and crop with YOLOv5, the videos to obtain cleaner data that include target children performing ASD behaviors only. As presented below, the cropped frame is much more focused on the child region than the raw data, which leads to a more stable training process and better action recognition results. In addition, we segment each video into several 30-frame clips to extract as much information as possible, and this video segmentation setting also increases the number of training samples.

Method

We propose to use detailed action descriptions as the textual input. Specifically, we first pre-define textural descriptions for the problem behavior classes by searching on the web and publications, as listed below.

The overall framewor is shown below, compared with visual-only model, we add CLIP text encoder as our text encoder to process the langauge input (VSL+L). The original CLIP model consists of an image Transformer encoder and a text Transformer encoder and is trained on a large-scale image-text pairs dataset. The supervision contained in the natural language provides strong cross-modal representation ability for both CLIP encoders. In our work, in order to introduce additional supervision without bringing in extra annotations, we utilize the abundant image-text knowledge in the CLIP text encoder. As shown below, since the language feature from the text encoder actually contains sufficient corresponding visual information, we propose to minimize the distance between the visual feature from the Video Swin Transformer and the language feature in the feature space for the input visual and text pair. With such a strategy, the additional language knowledge is actually distilled into the visual branch, introducing more information for the recognition task and making the language feature more predictable.

Visualization

Visualization of the attention map in the last attention layer of VST and VST+L. The first row of each case shows the attention map without language supervision (VST) and the second row shows the attention map with language supervision (VST+L). Pixels with high attention values indicate regions that are most relevant to the final prediction. It is evident that using additional language supervision can make the model focus more on the relevant regions to the target action.

Language-Assisted Deep Learning for Autistic Behaviors Recognition

Smart Health 2023

Andong Deng¹, Taojiannan Yang¹, Chen Chen¹, Qian Chen², Leslie Neely², Sakiko Oyama²,
¹University of Central Florida ²University of Texas at San Antonio

Paper

Abstract

Datasets and Pre-processing

Method

Performance

Visualization

Citation

Language-Assisted Deep Learning for Autistic Behaviors Recognition

Smart Health 2023

Andong Deng1, Taojiannan Yang1, Chen Chen1, Qian Chen2, Leslie Neely2, Sakiko Oyama2, 1University of Central Florida 2University of Texas at San Antonio Paper

Abstract

Datasets and Pre-processing

Method

Performance

Visualization

Citation

Andong Deng¹, Taojiannan Yang¹, Chen Chen¹, Qian Chen², Leslie Neely², Sakiko Oyama²,
¹University of Central Florida ²University of Texas at San Antonio

Paper