Local Spatiotemporal Representation Learning for Longitudinally-consistent Neuroimage Analysis
NeurIPS 2022


Recent self-supervised advances in medical computer vision exploit global and local anatomical self-similarity for pretraining prior to downstream tasks such as segmentation. However, current methods assume i.i.d. acquisition, which is invalid in clinical study designs where follow-up longitudinal scans track subject-specific temporal changes. In this work, (1) we present a local and multi-scale spatiotemporal representation learning method for image-to-image architectures trained on longitudinal images. (2) we propose a surprisingly simple self-supervised segmentation consistency regularization to exploit intra-subject correlation. Benchmarked in the one/few/full-shot segmentation setting, the proposed framework outperforms existing baselines on both longitudinal neurodegenerative adult MRI and developing infant brain MRI, and yield both higher performance and longitudinal consistency.


Given nonlinearly-registered temporal images of a subject, (a) we assume that corresponding spatial locations in various network layers should have similar representations. As U-Net skip connections can cause degenerate decoder embeddings, we (b) encourage the decoder bottleneck to be orthogonal to encoder bottleneck and regularize the concatenated decoder features to have (c) high spatial variance and be (d) uncorrelated channel-wise. During fine-tuning, we (e) encourage temporal intra-subject prediction consistency.


Learned self-similarity

On pretraining an image-to-image network with per-layer spatiotemporal self-supervision, we visualize the intra-subject multi-scale feature similarity between a query (red star) channel-wise feature and all spatial positions within the key feature at a different age.


Pretraining w/o regularization (b)-(d) above leads to semantically-implausible representations with low-diversity and artifacts. Our method attains both positionally and anatomically relevant representations via proper regularization. Please refer to Suppl. E in our paper for more analysis on the need for regularization.

One-shot segmentation

Once pretrained on all unlabeled data, all bench-marked methods are finetuned on either a single annotated image (IBIS-wmgm) or a single annotated subject (IBIS-subcort and OASIS3). When deployed on other subjects at different ages, our method yields improved segmentation performance.


When finetuned on a single 36 month-old image, our method generalizes to unseen timepoints by leveraging temporal anatomical consistency, and present much higher consistency between temporal segmentations.


On one-shot segmentation benchmarking, our method shows better quantitative performance with the Dice coefficient (top) and the spatiotemporal consistency of segmentation (bottom). For few-shot and fully-supervised results, please refer to Suppl. Table 3 and 4 respectively.



    title={Local Spatiotemporal Representation Learning for Longitudinally-consistent Neuroimage Analysis}, 
    author={Mengwei Ren and Neel Dey and Martin A. Styner and Kelly Botteron and Guido Gerig},
    url = {https://arxiv.org/abs/2206.04281},


Work supported by NIH R01-HD055741-12, R01-MH118362-02S1, 1R01MH118362- 01, 1R01HD088125-01A1, R01MH122447, R01-HD059854, U54-HD079124, P50-HD103573, R01ES032294, and the NYS Center for Advanced Technology in Telecommunications (CATT).

The website template was borrowed from Michaƫl Gharbi.