KEEP

Kalman-Inspired FEaturE Propagation for Video Face Super-Resolution

S-Lab, Nanyang Technological University
ECCV 2024

    Top: Results on complex simulated degradations.
    Bottom: Results on real-world videos.

Abstract

Despite the promising progress of face image super-resolution, video face super-resolution remains relatively under-explored. Existing approaches either adapt general video super-resolution networks to face datasets or apply established face image super-resolution models independently on individual video frames. These paradigms encounter challenges either in reconstructing facial details or maintaining temporal consistency. To address these issues, we introduce a novel framework called Kalman-inspired Feature Propagation (KEEP), designed to maintain a stable face prior over time. The Kalman filtering principles offer our method a recurrent ability to use the information from previously restored frames to guide and regulate the restoration process of the current frame. Extensive experiments demonstrate the effectiveness of our method in capturing facial details consistently across video frames.

Method

(a) The state space model defines the underlying dynamic system, where f describes how the latent states transit over time, g is a generative model, and h models the degradation from clean frame to degraded frame. (b) Block diagram of Kalman filter model. In each time step, a predictive state from previous frame (Blue dash box) and new observed state of current frame (Red dash box) are fused by Kalman gain from Kalman Gain Network (KGN) to produce more accurate estimates. The combined state is then used to generate the estimated clean frame.

The proposed KEEP consists of four modules: encoder, decoder, Kalman filter network, and Cross-Frame Attention (CFA). An encoder and decoder construct a VQGAN generative model. Kalman filter network is designed to incorporate Kalman filtering principles and promote temporal information propagation and maintain stable latent code priors. Particularly, the filter recursively fuses observed state of current frame and predictive state from previous frame to form a more accurate posterior estimate of the current state. In addition, cross-frame attention (CFA) layers are adopted in the decoder to further promote local temporal consistency to regularize the information propagation. The integral designs enable evidence accumulation and enhance temporal consistency for face video super-resolution.

Video

BibTeX

@InProceedings{feng2024keep,
      title     = {Kalman-Inspired FEaturE Propagation for Video Face Super-Resolution},
      author    = {Feng, Ruicheng and Li, Chongyi and Loy, Chen Change},
      booktitle = {European Conference on Computer Vision (ECCV)},
      year      = {2024}
    }