Zhangkai NI's Homepage

Publications

"*" indicates corresponding author and "#" indicates co-first author.

Journal Publications (Google Scholar Profile)

Transferring From Distortion to Perception-Oriented Optimization: Just-Noticeable-Distortion-Based Domain Adaptation
Xuelin Shen, Haoqiao Ou, Zhangkai Ni, Wenhan Yang, Shiqi Wang, and Sam Kwong.
IEEE Transactions on Multimedia, September 2025.
Abstract | Paper | Code | BibTex Abstract: The perception-distortion- tradeoff reveals the limitation of current low-level deep learning paradigms, i.e., minimizing reconstruction distortion does not guarantee improved perceptual quality. Acknowledging the lack of a reliable perception-oriented optimization function, we are motivated to explore a flexible approach for enhancing perceptual quality by steering the tradeoff to prioritize perception. To this end, we reconsider the perception-distortion function by incorporating the Just-Noticeable-Distortion (JND) mechanism. We mathematically demonstrate that in the common image restoration process, altering the optimization target from natural images to distorted images—where the distortion intensity is constrained by the JND threshold and the distortion type aligns with that arising from the restorer itself—effectively obtained improved perception indices without any changes to the restorer or optimization function. Accordingly, to facilitate various low-level learning models, we are motivated to construct the first large-scale CNN-oriented JND image dataset. Our dataset comprises 500 natural images and 4,500 degraded versions generated by a series of autoencoders, as well as the actual JND judgment results collected through rigorous subjective testing from twenty volunteers. Finally, a learning-based JND inference model is established on the proposed dataset and employed in the proposed JND-based adaptation scheme, where the inferred JND images serve as pseudo-ground truth for the training or fine-tuning processes of low-level vision models. Extensive experiments on image super-resolution and end-to-end image compression across multiple models have shown encouraging improvements in perceptual quality, demonstrating the effectiveness of the proposed scheme. Our dataset is available at: https://github.com/ohq17/CNN-Oriented-JND-Dataset.
```
@article{shen2025transferring,
	title={Transferring From Distortion to Perception-Oriented Optimization: Just-Noticeable-Distortion-Based Domain Adaptation},
	author={Shen, Xuelin and Ou, Haoqiao and Ni, Zhangkai and Yang, Wenhan and Wang, Shiqi and Kwong, Sam},
	journal={IEEE Transactions on Multimedia},
	year={2025},
	publisher={IEEE}
}
```
EIN: Exposure-Induced Network for Single-Image HDR Reconstruction
Yue Liu, Zhangkai Ni, Peilin Chen, Shiqi Wang, Xinfeng Zhang, Hanli Wang, and Sam Kwong.
ACM Transactions on Multimedia Computing, Communications and Applications, August 2025.
Abstract | Paper | Code | BibTex Abstract: Reconstructing high dynamic range (HDR) images from standard dynamic range (SDR) ones has received growing attention in recent years. A predominant problem of this task lies in the absence of texture and structural information in under/over-exposed regions. In this paper, we propose an efficient and stable single-image HDR reconstruction method, namely exposure-induced network (EIN). More specifically, a dynamic range expansion branch (DB) is designed to expand the global dynamic range of the input SDR image. Moreover, two exposure-gated detail recovering branches for local over- (OB) and under- (UB) exposed regions are proposed to interact with the DB to progressively infer the texture and structural details with the learned confidence maps to resolve challenging ambiguities in such regions. The features from these three interactional branches are adaptively fused in the joint global-local decoder to reconstruct the final HDR image. The proposed network is trained based upon a large-scale dataset constructed with diverse content. Extensive experimental results demonstrate that the proposed model achieves consistent visual quality improvement for input SDR images with different exposures compared with state-of-the-art methods. The source code is available at: https://github.com/Yliu724/EIN
```
@article{liu2025exposure,
	title={EIN: Exposure-Induced Network for Single-Image HDR Reconstruction},
	author={Liu, Yue and Ni, Zhangkai and Chen, Peilin and Wang, Shiqi and Zhang, Xinfeng and Wang, Hanli and Kwong, Sam},
	journal={ACM Transactions on Multimedia Computing, Communications and Applications},
	year={2025},
	publisher={ACM New York, NY}
}
```
Contributed Perception-Based Dynamic Evolution Method for Autonomous Vehicle Groups in Open Scenes
Qichao Mao, Jiujun Cheng^*, Mengchu Zhou^*, Zhangkai Ni^*, Guiyuan Yuan, Shangce Gao, and Chuanhuang Li.
IEEE Transactions on Mobile Computing, July 2025.
Abstract | Paper | Code | BibTex Abstract: Accurately handling dynamic evolution events is a significant challenge for autonomous vehicle groups (AVGs) in open scenes, which can be affected by complex road conditions and various interference factors. Existing work on the dynamic evolution of AVGs in open scenes concentrates on semi-centralized groups, assessing communication links as the sole criterion. However, there lack the mathematical analysis of and methods for the dynamic evolution of events in distributed AVGs with cooperative perception. To address this issue, we propose a contributed perception-based dynamic evolution method designed for distributed AVGs. This method ensures that group members can continuously and timely exchange valid perceptual information. Firstly, we investigate the impact of external interference on the contributed perception of vehicle groups to understand the drivers behind their dynamic evolution. Secondly, we define a range of vehicle group evolution behaviors and corresponding handling methods in response to external interference. Lastly, we introduce group states and perceptibility to delineate the evolution dynamics. Simulation results demonstrate the superiority of our proposed method over existing ones in terms of average group contribution, accessibility, persistence, timeliness, and perceptibility.
```
@article{mao2025contributed,
	title={Contributed Perception-Based Dynamic Evolution Method for Autonomous Vehicle Groups in Open Scenes},
	author={Mao, Qichao and Cheng, Jiujun and Zhou, MengChu and Ni, Zhangkai and Yuan, Guiyuan and Gao, Shangce and Li, Chuanhuang},
	journal={IEEE Transactions on Mobile Computing},
	year={2025},
	publisher={IEEE}
}
```
Semantic Masking with Curriculum Learning for Robust HDR Image Reconstruction
Zhangkai Ni^*, Yang Zhang, Kerui Ren, Wenhan Yang, Hanli Wang, and Sam Kwong.
International Journal of Computer Vision (IJCV), pp. 1-16, July 2025.
Abstract | Paper | Code | BibTex Abstract: High Dynamic Range (HDR) image reconstruction aims to reconstruct images with a larger dynamic range from multiple Low Dynamic Range (LDR) images with different exposures. Existing methods face two challenges: visual artifacts in the restored images and insufficient model generalization capabilities. This paper addresses these issues by leveraging the inherent potential of Masked Image Modeling (MIM). We propose a Segment Anything Model (SAM)-guided masking strategy, leveraging large-model priors to direct the HDR reconstruction network via curriculum learning. This strategy gradually increases the difficulty from simple to complex tasks, guiding the model to effectively learn semantic priors that prevent the model from overfitting to the training data. Our approach starts by training the model without any masks, then progressively increasing the masking ratio of input features guided by semantic segmentation maps, which compels the model to learn semantic information during restoration. Subsequently, we make an adaption to reduce the masking ratio to minimize the input discrepancy between the training and testing stage. Besides, we manipulate the computation of the loss based on the perceptual quality of reconstructed images, where challenging areas (e.g., over-/under-exposed regions) are given more weight to improve image restoration results. Furthermore, through specialized module design, our method can be fine-tuned to any number of inputs, achieving comparable performance to models trained from scratch with only 5.5% of parameter adjustments. Extensive qualitative and quantitative experiments demonstrate that our approach surpasses state-of-the-art methods in both effectiveness and generalization. Our code is available at: https://github.com/eezkni/SMHDR
```
@article{ni2025semantic,
	title={Semantic Masking with Curriculum Learning for Robust HDR Image Reconstruction: Z. Ni et al.},
	author={Ni, Zhangkai and Zhang, Yang and Ren, Kerui and Yang, Wenhan and Wang, Hanli and Kwong, Sam},
	journal={International Journal of Computer Vision},
	pages={1--16},
	year={2025},
	publisher={Springer}
}
```
Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution
Zhangkai Ni^*, Yang Zhang, Wenhan Yang, Hanli Wang, Shiqi Wang, and Sam Kwong.
IEEE Transactions on Image Processing (TIP), vol. 34, pp. 3861-3872, June 2025.
Abstract | Paper | Code | BibTex Abstract: Major efforts in data-driven image super-resolution (SR) primarily focus on expanding the receptive field of the model to better capture contextual information. However, these methods are typically implemented by stacking deeper networks or leveraging transformer-based attention mechanisms, which consequently increases model complexity. In contrast, model-driven methods based on the unfolding paradigm show promise in improving performance while effectively maintaining model compactness through sophisticated module design. Based on these insights, we propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR. This method is designed through unfolding an SR optimization function constrained by structural similarity, aiming to combine the strengths of both data-driven and model-driven approaches. Our model operates progressively following the unfolding paradigm. Each iteration consists of multiple Mixed-Scale Gating Modules (MSGM) and an Efficient Sparse Attention Module (ESAM). The former implements comprehensive constraints on features, including a structural similarity constraint, while the latter aims to achieve sparse activation. In addition, we design a Mixture-of-Experts-based Feature Selector (MoE-FS) that fully utilizes multi-level feature information by combining features from different steps. Extensive experiments validate the efficacy and efficiency of our unfolding-inspired network. Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption. Our code will be available at: https://github.com/eezkni/SSIU
```
@article{ni2025structural,
	title={Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution},
	author={Ni, Zhangkai, Zhang, Yang and Yang, Wenhan and Wang, Hanli and Wang, Shiqi and Kwong, Sam},
	journal={IEEE Transactions on Image Processing},
	year={2025},
	publisher={IEEE}
}
```
Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians
Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai.
IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, May 2025.
Abstract | Paper | Code | BibTex Abstract: The recently proposed 3D Gaussian Splatting (3D-GS) demonstrates superior rendering fidelity and efficiency compared to NeRF-based scene representations. However, it struggles in large-scale scenes due to the high number of Gaussian primitives, particularly in zoomed-out views, where all primitives are rendered regardless of their projected size. This often results in inefficient use of model capacity and difficulty capturing details at varying scales. To address this, we introduce Octree-GS, a Level-of-Detail (LOD) structured approach that dynamically selects appropriate levels from a set of multi-scale Gaussian primitives, ensuring consistent rendering performance. To adapt the design of LOD, we employ an innovative grow-and-prune strategy for densification and also propose a progressive training strategy to arrange Gaussians into appropriate LOD levels. Additionally, our LOD strategy generalizes to other Gaussian-based methods, such as 2D-GS and Scaffold-GS, reducing the number of primitives needed for rendering while maintaining scene reconstruction accuracy. Experiments on diverse datasets demonstrate that our method achieves real-time speeds, being up to 10× faster than state-of-the-art methods in large-scale scenes, without compromising visual quality. Project page:: https://city-super.github.io/octree-gs/
```
@article{ni2025structural,
	title={Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians},
	author={Ren, Kerui and Jiang, Lihan and Lu, Tao and Yu, Mulin and Xu, Linning and Ni, Zhangkai and Dai, Bo},
	journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
	year={2025},
	publisher={IEEE}
}
```
A Unified Software-Defined Autonomous Vehicle Network and Urban Congestion Prediction Method
Lu Yang, Jiujun Cheng, Yue Zhao, Zhangkai Ni^*, Qiachao Mao^*, and Shangce Gao.
IEEE Transactions on Network Science and Engineering (TNSE), vol. 12, no. 4, July 2025.
Abstract | Paper | Code | BibTex Abstract: Urban traffic congestion is worsening and accurate traffic congestion prediction is essential to address this issue. Current studies mainly concentrate on manned vehicles, overlooking the burgeoning traffic flow that includes both manned and autonomous vehicles. While road infrastructures and autonomous vehicles could alleviate congestion through information exchange, current infrastructure and vehicle diversity hinder effective data collection and management. This paper proposes a unified Software-Defined Autonomous Vehicle Network (SDAVN) to consistently compute traffic parameters such as average velocity, traffic flow, and occupancy using real-time mobility data from autonomous vehicles and connected manned vehicles. Additionally, we propose an effective SDAVN congestion prediction method featuring a Transformer-based traffic parameter prediction module and a congestion detection module employing an extended Spatio-Temporal Self-Organizing Mapping (STSOM). We optimize the 2D SOM to a 3D model to learn more effectively spatio-temporal characteristics. Furthermore, we introduce an asymmetric loss function to address the imbalance between congested and uncongested samples. Experimental results demonstrate the superior long-term congestion prediction performance of our method compared to existing approaches at both road and lane levels across traditional traffic datasets and simulations of real automated driving environments
```
@article{yang2025unified,
	title={A Unified Software-Defined Autonomous Vehicle Network and Urban Congestion Prediction Method},
	author={Yang, Lu and Cheng, Jiujun and Zhao, Yue and Ni, Zhangkai and Mao, Qichao and Gao, Shangce},
	journal={IEEE Transactions on Network Science and Engineering},
	year={2025},
	publisher={IEEE}
}
```
Shell-guided Compression of Voxel Radiance Fields
Peiqi Yang, Zhangkai Ni^*, Hanli Wang, Wenhan Yang, ,
Shiqi Wang, and Sam Kwong.
IEEE Transactions on Image Processing (TIP), vol. 34, pp. 1179-1191, February 2025.
Abstract | Paper | Code | BibTex Abstract: In this paper, we address the challenge of significant memory consumption and redundant components in large-scale voxel-based model, which are commonly encountered in real-world 3D reconstruction scenarios. We propose a novel method called Shell-guided compression of Voxel Radiance Fields (SVRF), aimed at optimizing voxel-based model into a shell-like structure to reduce storage costs while maintaining rendering accuracy. Specifically, we first introduce a Shell-like Constraint, operating in two main aspects: 1) enhancing the influence of voxels neighboring the surface in determining the rendering outcomes, and 2) expediting the elimination of redundant voxels both inside and outside the surface. Additionally, we introduce an Adaptive Thresholds to ensure appropriate pruning criteria for different scenes. To prevent the erroneous removal of essential object parts, we further employ a Dynamic Pruning Strategy to conduct smooth and precise model pruning during training. The compression method we propose does not necessitate the use of additional labels. It merely requires the guidance of self-supervised learning based on predicted depth. Furthermore, it can be seamlessly integrated into any voxel-grid-based method. Extensive experimental results demonstrate that our method achieves comparable rendering quality while compressing the original number of voxel grids by more than 70%. Our code is available at: https://github.com/eezkni/SVRF
```
@article{yang2025shell,
	title={Shell-guided Compression of Voxel Radiance Fields},
	author={Yang, Peiqi, and Ni, Zhangkai, and Wang, Hanli, and Yang, Wenhan and Wang, Shiqi and Kwong, Sam},
	journal={IEEE Transactions on Image Processing},
	volume={34},
	pages={1179-1191},
	year={2025},
	publisher={IEEE}
}
```

Breaking Boundaries: Unifying Imaging and Compression for HDR Image Compression
Xuelin Shen, Linfeng Pan, Zhangkai Ni^*, Yulin He, Wenhan Yang, , Shiqi Wang, and Sam Kwong.
IEEE Transactions on Image Processing (TIP), vol. 34, pp. 510-521, January 2025.
Abstract | Paper | Code | BibTex Abstract: DHigh Dynamic Range (HDR) images present unique challenges for Learned Image Compression (LIC) due to their complex domain distribution compared to Low Dynamic Range (LDR) images. In coding practice, HDR-oriented LIC typically adopts preprocessing steps (e.g., perceptual quantization and tone mapping operation) to align the distributions between LDR and HDR images, which inevitably comes at the expense of perceptual quality. To address this challenge, we rethink the HDR imaging process which involves fusing multiple exposure LDR images to create an HDR image and propose a novel HDR image compression paradigm, Unifying Imaging and Compression (HDR-UIC). The key innovation lies in establishing a seamless pipeline from image capture to delivery and enabling end-to-end training and optimization. Specifically, a Mixture-ATtention (MAT)-based compression backbone merges LDR features while simultaneously generating a compact representation. Meanwhile, the Reference-guided Misalignment-aware feature Enhancement (RME) module mitigates ghosting artifacts caused by misalignment in the LDR branches, maintaining fidelity without introducing additional information. Furthermore, we introduce an Appearance Redundancy Removal (ARR) module to optimize coding resource allocation among LDR features, thereby enhancing the final HDR compression performance. Extensive experimental results demonstrate the efficacy of our approach, showing significant improvements over existing state-of-the-art HDR compression schemes. Our code is available at: https://github.com/plf1999/HDR-UIC

@article{shen2025breaking,
	title={Breaking Boundaries: Unifying Imaging and Compression for HDR Image Compression},
	author={Shen, Xuelin and Pan, Linfeng and Ni, Zhangkai and He, Yulin and Yang, Wenhan and Wang, Shiqi and Kwong, Sam},
	journal={IEEE Transactions on Image Processing},
	volume={34},
	pages={510-521},
	year={2025},
	publisher={IEEE}
}

Vision-Language Relational Transformer for Video-to-Text Generation
Tengpeng Li, Hanli Wang, Qinyu Li, and Zhangkai Ni.
IEEE Transactions on Multimedia (TMM), vol. 27, pp. 4584-4596, January 2025.
Abstract | Paper | Code | BibTex Abstract: Video-to-text generation is a challenging task that involves translating video contents into accurate and expressive sentences. Existing methods often ignore the importance of establishing fine-grained semantics within visual representations and exploring textual knowledge implied by video contents, leading to difficulty in generating satisfactory sentences. To address these problems, a vision-language relational transformer model is proposed for video-to-text generation. Three key novel aspects are investigated. First, a visual relation modeling block is designed to obtain higher-order feature representations and establish semantic relationships between regional and global features. Second, a knowledge attention block is developed to explore hierarchical textual information and capture cross-modal dependencies. Third, a video-centric conversation system is constructed to complete multi-round dialogues by incorporating the proposed modules including visual relation modeling, knowledge attention and text generation. Extensive experiments on five benchmark datasets including MSVD, MSRVTT, ActivityNet, Charades and EMVPC demonstrate that the proposed scheme achieves remarkable performance compared with the state-of-the-art methods. Besides, the qualitative experiment reveals the system's favorable conversation capability and provides a valuable exemplar for future video understanding works. The source code of this work can be found in https://mic.tongji.edu.cn

@article{li2025vision,
	title={Vision-Language Relational Transformer for Video-to-Text Generation},
	author={Li, Tengpeng and Wang, Hanli and Li, Qinyu and Ni, Zhangkai},
	journal={IEEE Transactions on Multimedia},
	volume={27},
	pages={4584-4596},
	year={2025},
	publisher={IEEE}
}

M2Trans: Multi-Modal Regularized Coarse-to-Fine Transformer for Ultrasound Image Super-Resolution
Zhangkai Ni, Runyu Xiao, Wenhan Yang, Hanli Wang, Zhihua Wang, Lihua Xiang, and Liping Sun.
IEEE Journal of Biomedical and Health Informatics (JBHI), vol. 29, no. 5, pp. 3112-3123, May 2025.
Abstract | Paper | Code | BibTex Abstract: Ultrasound image super-resolution (SR) aims to transform low-resolution images into high-resolution ones, thereby restoring intricate details crucial for improved diagnostic accuracy. However, prevailing methods relying solely on image modality guidance and pixel-wise loss functions struggle to capture the distinct characteristics of medical images, such as unique texture patterns and specific colors harboring critical diagnostic information. To overcome these challenges, this paper introduces the Multi-Modal Regularized Coarse-to-fine Transformer (M2Trans) for Ultrasound Image SR. By integrating the text modality, we establish joint image-text guidance during training, leveraging the medical CLIP model to incorporate richer priors from text descriptions into the SR optimization process, enhancing detail, structure, and semantic recovery. Furthermore, we propose a novel coarse-to-fine transformer comprising multiple branches infused with self-attention and frequency transforms to efficiently capture signal dependencies across different scales. Extensive experimental results demonstrate significant improvements over state-of-the-art methods on benchmark datasets, including CCA-US, US-CASE, and our newly created dataset MMUS1K, with a minimum improvement of 0.17dB, 0.30dB, and 0.28dB in terms of PSNR. Our code and dataset will be available at: https://github.com/eezkni/M2Trans

@article{ni2024m2trans, 
	title={M2Trans: Multi-Modal Regularized Coarse-to-Fine Transformer for Ultrasound Image Super-Resolution}, 
	author={Ni, Zhangkai and Xiao, Runyu and Yang, Wenhan and Wang, Hanli and Wang, Zhihua and Xiang, Lihua and Sun, Liping}, 
	journal={IEEE Journal of Biomedical and Health Informatics}, 
	year={2024}, 
	publisher={IEEE} }

Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics
Zhangkai Ni, Yue Liu, Keyan Ding, Wenhan Yang, Hanli Wang, and Shiqi Wang.
IEEE Transactions on Multimedia (TMM), vol. 26, pp. 10211-10224, May 2024.
Abstract | Paper | Code | BibTex Abstract: Deep learning-based methods have significantly influenced the blind image quality assessment (BIQA) field, however, these methods often require training using large amounts of human rating data. In contrast, traditional knowledge-based methods are cost-effective for training but face challenges in effectively extracting features aligned with human visual perception. To bridge these gaps, we propose integrating deep features from pre-trained visual models with a statistical analysis model into a Multi-scale Deep Feature Statistics (MDFS) model for achieving opinion-unaware BIQA (OU-BIQA), thereby eliminating the reliance on human rating data and significantly improving training efficiency. Specifically, we extract patch-wise multi-scale features from pre-trained vision models, which are subsequently fitted into a multivariate Gaussian (MVG) model. The final quality score is determined by quantifying the distance between the MVG model derived from the test image and the benchmark MVG model derived from the high-quality image set. A comprehensive series of experiments conducted on various datasets show that our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models. Furthermore, it shows improved generalizability across diverse target-specific BIQA tasks. Our code is available at: https://github.com/eezkni/MDFS

@article{ni2024opinion,
	title={Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics},
	author={Ni Zhangkai, and Liu Yue, and Ding Keyan, and Yang Wenhan, and Wang Hanli, and Wang Shiqi},
	journal={IEEE Transactions on Multimedia},
	volume={26},
	pages={10211-10224},
	year={2024},
	publisher={IEEE}
}

A Dynamic Evolution Model for Decentralized Autonomous Car Clusters in a Highway Scene
Jiujun Cheng, Huiyu Sun, Zhangkai Ni^*, Aiguo Zhou, and Dongjie Ye.
IEEE Transactions on Computational Social Systems (TCSS), vol. 11, no. 3, June 2024.
Abstract | Paper | Code | BibTex Abstract: Cluster evolution is a challenging problem for vehicular ad hoc network (VANET) in a highway scene with fast moving autonomous vehicles and frequent cluster topology changes. Most of the existing studies analyze the cluster evolution behavior of cluster heads (CHs), and these approaches lead to frequent changes in vehicle structure when CHs change, which easily makes the cluster unstable. In this work, we propose a decentralized autonomous car cluster dynamic evolution model. First, we define a decentralized cluster structure. Then, we analyze the cluster evolution behavior and propose a maintenance method. Next, we define eight vehicle states and their transitions. Finally, we introduce the cluster dynamic evolution model and the collaboration model. The results of extensive simulation experiments show that our method can effectively maintain the consistency of cluster consensus and improve the stability of the cluster structure compared with the centralized cluster maintenance method.

@article{cheng2023dynamic,
	title={A Dynamic Evolution Model for Decentralized Autonomous Car Clusters in a Highway Scene},
	author={Cheng, Jiujun and Sun, Huiyu and Ni, Zhangkai and Zhou, Aiguo},
	journal={IEEE Transactions on Computational Social Systems},
	volume={11},
	number={3},
	pages={3792--3802},
	year={2024},
	publisher={IEEE}
}

HDRC: A Subjective Quality Assessment Database for Compressed High Dynamic Range Image
Yue Liu, Zhangkai Ni, Peilin Chen, Shiqi Wang, and Sam Kwong.
International Journal of Machine Learning and Cybernetics, vol. 15, no. 10, October 2024.
Abstract | Paper | Code | BibTex Abstract: Evaluating the quality of high dynamic range (HDR) images has emerged as a challenging and contemporary topic with the proliferation of HDR content. In this work, a new HDR compression (HDRC) database is proposed, aiming to provide a benchmark for the development of full-reference HDR image quality assessment (IQA) algorithms when facing the latest HDR compression distortions. In particular, the proposed HDRC database is the first HDR-IQA database to incorporate Versatile Video Coding (VVC) compression distortions, closely associated with practical application scenarios. Furthermore, the proposed HDRC database is currently the largest HDR-IQA database, including 80 reference images and 400 distorted images. Extensive experiments are conducted by studying the performance compared to existing HDR-IQA databases when evaluating three HDR-specific IQA models and nine IQA models prevalent for low dynamic range (LDR) content, revealing the challenges the proposed HDRC database brings. The results indicate that the existing IQA models demonstrate noticeable decreases in accuracy when assessing new compression distortions, underscoring the need for the development of novel HDR-IQA models. Consequently, the suggested HDRC database can serve as a potential database for HDR-IQA research, fostering a comprehensive exploration of the associated fields.

@article{liu2024hdrc,
	title={HDRC: A subjective quality assessment database for compressed high dynamic range image},
	author={Liu, Yue and Ni, Zhangkai and Chen, Peilin and Wang, Shiqi and Kwong, Sam},
	journal={International Journal of Machine Learning and Cybernetics},
	volume={15},
	number={10},
	pages={4373--4388},
	year={2024},
	publisher={Springer}
}

CD-iNet: Deep Invertible Network for Perceptual Image Color Difference Measurement
Zhihua Wang, Keshuo Xu, Keyan Ding, Qiuping Jiang, Yifan Zuo, Zhangkai Ni, and Yuming Fang.
International Journal of Computer Vision, vol. 132, July 2024.
Abstract | Paper | Code | BibTex Abstract: Image color difference (CD) measurement, a crucial concept in color science and imaging technology, aims to quantify the perceived difference between two colors. Most widely recognized CD formulae are recommended by the Commission Internationale de l’Èclairage (CIE), which are tailored to homogeneous color patches and may not generalize effectively to images encompassing diverse content. Developing effective CD metrics for natural images remains an active and ongoing area of research. Drawing inspiration from the design principles found in CIE-recommended formulae, which place a premium on achieving a perceptually uniform color space, we posit that an ideal color space should adhere to the following criteria: (1) Characterizing any color pixel with three degrees of freedom, which is necessary and sufficient; (2) The visual distance between two pixels is proportional to the Euclidean distance, i.e., perceptual uniformity; (3) The transformation between color spaces is inherently reversible and has low computational complexity. To satisfy these criteria, we investigate to leverage deep invertible neural network (DINNs) to learn an invertible coordinate transform, in which the Euclidean distance is employed to compute the CD on a pixel-by-pixel basis within the transformed color space and subsequently average the resulting CD map to obtain the global CD for a pair of images. By using DINNs, the acquired coordinate transform can maintain three-dimensional properties and mathematical invertibility. The resulting metric, referred to as CD-iNet, is end-to-end optimized on color patch datasets and image datasets simultaneously. Extensive quantitative and qualitative experiments on smartphone photograph datasets demonstrate the superiority of CD-iNet over existing metrics. Besides, CD-iNet can produce competitive local CD maps without requiring dense supervision and be robust against geometric distortions. More importantly, the transformed color space exhibits reasonable characteristics of perceptual uniformity, e.g., low cross-contamination between color attributes. Codes are available at: https://github.com/hellooks/CD-iNet.

@article{wang2024cd,
	title={CD-iNet: Deep invertible network for perceptual image color difference measurement},
	author={Wang, Zhihua and Xu, Keshuo and Ding, Keyan and Jiang, Qiuping and Zuo, Yifan and Ni, Zhangkai and Fang, Yuming},
	journal={International Journal of Computer Vision},
	volume={132},
	number={12},
	pages={5983--6003},
	year={2024},
	publisher={Springer}
}

Glow in the Dark: Low-Light Image Enhancement with External Memory
Dongjie Ye, Zhangkai Ni, Wenhan Yang, Hanli Wang, Shiqi Wang, and Sam Kwong.
IEEE Transactions on Multimedia (TMM), vol. 26, pp. 2148-2163, July 2023.
Abstract | Paper | Code | BibTex Abstract: Deep learning-based methods have achieved remarkable success with powerful modeling capabilities. However, the weights of these models are learned over the entire training dataset, which inevitably leads to the ignorance of sample specific properties in the learned enhancement mapping. This situation causes ineffective enhancement in the testing phase for the samples that differ significantly from the training distribution. In this paper, we introduce external memory to form an external memory-augmented network (EMNet) for low-light image enhancement. The external memory aims to capture the sample specific properties of the training dataset to guide the enhancement in the testing phase. Benefiting from the learned memory, more complex distributions of reference images in the entire dataset can be “remembered” to facilitate the adjustment of the testing samples more adaptively. To further augment the capacity of the model, we take the transformer as our baseline network, which specializes in capturing long-range spatial redundancy. Experimental results demonstrate that our proposed method has a promising performance and outperforms state-of-the-art methods. It is noted that, the proposed external memory is a plug-and-play mechanism that can be integrated with any existing method to further improve the enhancement quality. More practices of integrating external memory with other image enhancement methods are qualitatively and quantitatively analyzed. The results further confirm that the effectiveness of our proposed memory mechanism when combing with existing enhancement methods. Our codes is available at: https://github.com/Lineves7/EMNet

@article{ye2023glow,
	title={Glow in the Dark: Low-Light Image Enhancement with External Memory},
	author={Ye, Dongjie and Ni, Zhangkai and Yang, Wenhan and Wang, Hanli and Wang, Shiqi and Kwong, Sam},
	journal={IEEE Transactions on Multimedia},
	volume={26},
	year={2023},
	publisher={IEEE}
}

Neural Network Based Rate Control for Versatile Video Coding
Yunhao Mao, Meng Wang, Zhangkai Ni, Shiqi Wang, and Sam Kwong.
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 33, no. 10, pp. 6072-6085, October 2023.
Abstract | Paper | Code | BibTex Abstract: In this work, we propose a neural network based rate control algorithm for Versatile Video Coding (VVC). The proposed method relies on the modeling of the Rate-Quantization (R-Q) and Distortion-Quantization (D-Q) relationships in a data driven manner based upon the characteristics of prediction residuals. In particular, a pre-analysis framework is adopted, in an effort to obtain the prediction residuals which govern the Rate-Distortion (R-D) behaviors. By inferring from the prediction residuals with deep neural networks, the Coding Tree Unit (CTU) level R-Q and D-Q model parameters are derived, which could efficiently guide the optimal bit allocation. Subsequently, the coding parameters, including Quantization Parameter (QP) and λ, at both frame and CTU levels, are obtained according to allocated bit-rates. We implement the proposed rate control algorithm on VVC Test Model (VTM-13.0). Experimental results exhibit that the proposed rate control algorithm achieves 0.77% BD-Rate savings under Low Delay B (LDB) configurations when compared to the default rate control algorithm used in VTM-13.0. For Random Access (RA) configurations, 1.77% BD-Rate savings can be observed. Furthermore, with better bit-rate estimation, more stable buffer status can be observed, further demonstrating the advantages of the proposed rate control method.

@article{mao2023neural,
	title={Neural Network Based Rate Control for Versatile Video Coding},
	author={Mao, Yunhao and Wang, Meng and Ni, Zhangkai and Wang, Shiqi and Kwong, Sam},
	journal={IEEE Transactions on Circuits and Systems for Video Technology},
	volume={33},
	number={10},
	pages={6072--6085},
	year={2023},
	publisher={IEEE}
}

A CTU-level Screen Content Rate Control for Low-delay Versatile Video Coding
Yi Chen, Meng Wang, Shiqi Wang, Zhangkai Ni, and Sam Kwong.
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 33, no. 9, pp. 5227-5241, September 2023.
Abstract | Paper | Code | BibTex Abstract: In this paper, a rate control scheme for screen content video coding is proposed for the Versatile Video Coding (VVC) standard. In view of the critical challenges arising from the spatial and temporal unnaturalness of screen content sequences, the proposed method relies on the specifically designed pre-analysis such that the content information regarding the scene complexity can be obtained. As such, the estimated residual complexity is then incorporated into the proposed complexity-aware rate models and distortion models, leading to the optimal bit allocations for each frame and coding tree unit (CTU). In particular, the optimization problem can be analytically solved with the proposed models, and the coding parameters such as Lagrangian multiplier λ and quantization parameter of each frame and CTU could be delicately derived according to the allocated bits through the proposed analytical models. Extensive experiments have been conducted to evaluate the effectiveness of the proposed method. Compared to the default hierarchical λ-domain rate control and other screen content rate control algorithms, the proposed method could achieve obvious RD performance gain, and the bit-rate accuracy could be improved.

@article{chen2023ctu,
	title={A CTU-level Screen Content Rate Control for Low-delay Versatile Video Coding},
	author={Chen, Yi and Wang, Meng and Wang, Shiqi and Ni, Zhangkai and Kwong, Sam},
	journal={IEEE Transactions on Circuits and Systems for Video Technology},
	volume={33},
	number={9},
	pages={5227--5241},
	year={2023},
	publisher={IEEE}
}

High Dynamic Range Image Quality Assessment Based on Frequency Disparity
Yue Liu, Zhangkai Ni, Shiqi Wang, Hanli Wang, and Sam Kwong.
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 33, no. 8, pp. 4435-4440, August 2023.
Abstract | Paper | Code | BibTex Abstract: In this paper, a novel and effective image quality assessment (IQA) algorithm based on frequency disparity for high dynamic range (HDR) images is proposed, termed as local-global frequency feature-based model (LGFM). Motivated by the assumption that the human visual system (HVS) is highly adapted for extracting structural information and partial frequencies when perceiving the visual scene, the Gabor and the Butterworth filters are applied to the luminance component of the HDR image to extract the local and global frequency features, respectively. The similarity measurement and feature pooling strategy are sequentially performed on the frequency features to obtain the predicted single quality score. The experiments evaluated on four widely used benchmarks demonstrate that the proposed LGFM can provide a higher consistency with the subjective perception compared with the state-of-the-art HDR IQA methods. Our code is available at: https://github.com/eezkni/LGFM

@article{liu2023high,
	title={High Dynamic Range Image Quality Assessment Based on Frequency Disparity},
	author={Liu, Yue and Ni, Zhangkai and Wang, Shiqi and Wang, Hanli and Kwong, Sam},
	journal={IEEE Transactions on Circuits and Systems for Video Technology},
	volume={33},
	number={8},
	pages={4435--4440},
	year={2023},
	publisher={IEEE}
}

CSformer: Bridging Convolution and Transformer for Compressive Sensing
Dongjie Ye, Zhangkai Ni, Hanli Wang, Jian Zhang, Shiqi Wang, and Sam Kwong.
IEEE Transactions on Image Processing (TIP), vol. 32, pp. 2827-2842, May 2023.
Abstract | Paper | Code | BibTex Abstract: Convolution neural networks (CNNs) have succeeded in compressive image sensing. However, due to the inductive bias of locality and weight sharing, the convolution operations demonstrate the intrinsic limitations in modeling the long-range dependency. Transformer, designed initially as a sequence-tosequence model, excels at capturing global contexts due to the self-attention-based architectures even though it may be equipped with limited localization abilities. This paper proposes CSformer, a hybrid framework that integrates the advantages of leveraging both detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. In the sampling module, images are measured blockby-block by the learned sampling matrix. In the reconstruction stage, the measurement is projected into dual stems. One is the CNN stem for modeling the neighborhood relationships by convolution, and the other is the transformer stem for adopting global self-attention mechanism. The dual branches structure is concurrent, and the local features and global representations are fused under different resolutions to maximize the complementary of features. Furthermore, we explore a progressive strategy and window-based transformer block to reduce the parameter and computational complexity. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing, which achieves superior performance compared to state-of-the-art methods on different datasets.

@article{ye2023csformer,
	title={CSformer: Bridging convolution and transformer for compressive sensing},
	author={Ye, Dongjie and Ni, Zhangkai and Wang, Hanli and Zhang, Jian and Wang, Shiqi and Kwong, Sam},
	journal={IEEE Transactions on Image Processing},
	volume={32},
	pages={2827--2842},
	year={2023},
	publisher={IEEE}
}

Towards Unsupervised Deep Image Enhancement with Generative Adversarial Network
Zhangkai Ni, Wenhan Yang, Shiqi Wang, Lin Ma, and Sam Kwong.
IEEE Transactions on Image Processing (TIP), vol. 29, pp. 9140-9151, September 2020.
Abstract | Paper | Code | BibTex Abstract: Improving the aesthetic quality of images is chal- lenging and eager for the public. To address this problem, most existing algorithms are based on supervised learning methods to learn an automatic photo enhancer for paired data, which consists of low-quality photos and corresponding expert-retouched ver- sions. However, the style and characteristics of photos retouched by experts may not meet the needs or preferences of general users. In this paper, we present an unsupervised image enhance- ment generative adversarial network (UEGAN), which learns the corresponding image-to-image mapping from a set of images with desired characteristics in an unsupervised manner, rather than learning on a large number of paired images. The proposed model is based on single deep GAN which embeds the modulation and attention mechanisms to capture richer global and local features. Based on the proposed model, we introduce two losses to deal with the unsupervised image enhancement: (1) fidelity loss, which is defined as a ℓ2 regularization in the feature domain of a pre-trained VGG network to ensure the content between the enhanced image and the input image is the same, and (2) quality loss that is formulated as a relativistic hinge adversarial loss to endow the input image the desired characteristics. Both quantitative and qualitative results show that the proposed model effectively improves the aesthetic quality of images. Our code is available at: https://github.com/eezkni/UEGAN

@article{ni2020towards,
	title={Towards unsupervised deep image enhancement with generative adversarial network},
	author={Ni, Zhangkai and Yang, Wenhan and Wang, Shiqi and Ma, Lin and Kwong, Sam},
	journal={IEEE Transactions on Image Processing},
	volume={29},
	pages={9140--9151},
	year={2020},
	publisher={IEEE}
}

Color Image Demosaicing Using Progressive Collaborative Representation
Zhangkai Ni, Kai-Kuang Ma, Huanqiang Zeng, and Baojiang Zhong.
IEEE Transactions on Image Processing (TIP), vol. 29, pp. 4952-4964, March 2020.
Abstract | Paper | Code | BibTex Abstract: In this paper, a progressive collaborative representation (PCR) framework is proposed that is able to incorporate any existing color image demosaicing method for further boosting its demosaicing performance. Our PCR consists of two phases: (i) offline training and (ii) online refinement. In phase (i), multiple training-and-refining stages will be performed. In each stage, a new dictionary will be established through the learning of a large number of feature-patch pairs, extracted from the demosaicked images of the current stage and their corresponding original full-color images. After training, a projection matrix will be generated and exploited to refine the current demosaicked image. The updated image with improved image quality will be used as the input for the next training-and-refining stage and performed the same processing likewise. At the end of phase (i), all the projection matrices generated as above-mentioned will be exploited in phase (ii) to conduct online demosaicked image refinement of the test image. Extensive simulations conducted on two commonly-used test datasets (i.e., the IMAX and Kodak) for evaluating the demosaicing algorithms have clearly demonstrated that our proposed PCR framework is able to constantly boost the performance of any image demosaicing method we experimented, in terms of the objective and subjective performance evaluations.

@article{ni2020color,
	title={Color Image Demosaicing Using Progressive Collaborative Representation},
	author={Zhangkai Ni, Kai-Kuang Ma, Huanqiang Zeng, Baojiang Zhong},
	journal={IEEE Transactions on Image Processing},
	volume={29},
	number={1},
	pages={4952--4964},
	year={2020},
	publisher={IEEE}
}

Just Noticeable Distortion Profile Inference: A Patch-level Structural Visibility Learning Approach
Xuelin Shen, Zhangkai Ni, Wenhan Yang, Shiqi Wang, Xinfeng Zhang, and Sam Kwong.
IEEE Transactions on Image Processing (TIP), vol. 30, pp. 26-38, November 2020.
Abstract | Paper | Code | Dataset | Project | BibTex Abstract: In this paper, we propose an effective approach to infer the just noticeable distortion (JND) profile based on patch- level structural visibility learning. Instead of pixel-level JND profile estimation, the image patch, which is regarded as the basic processing unit to better correlate with the human perception, can be further decomposed into three conceptually independent components for visibility estimation. In particular, to incorporate the structural degradation into the patch-level JND model, a deep learning-based structural degradation estimation model is trained to approximate the masking of structural visibility. In order to facilitate the learning process, a JND dataset is further established, including 202 pristine images and 7878 distorted images generated by advanced compression algorithms based on the upcoming Versatile Video Coding (VVC) standard. Extensive experimental results further show the superiority of the proposed approach over the state-of-the-art. Our dataset is available at: https://shenxuelin-cityu.github.io/jnd.html.

@article{shen2020just,
	title={Just Noticeable Distortion Profile Inference: A Patch-level Structural Visibility Learning Approach},
	author={Xuelin Shen, Zhangkai Ni, Wenhan Yang, Xinfeng Zhang, Shiqi Wang, and Sam Kwong},
	journal={IEEE Transactions on Image Processing},
	volume={30},
	pages={26--38},
	year={2020},
	publisher={IEEE}
}

A Gabor Feature-Based Quality Assessment Model for the Screen Content Images
Zhangkai Ni, Huanqiang Zeng, Lin Ma, Junhui Hou, Jing Chen, and Kai-Kuang Ma.
IEEE Transactions on Image Processing (TIP), vol. 27, no. 9, pp. 4516-4528, September 2018.
Abstract | Paper | Code | Project | BibTex Abstract: In this paper, an accurate and efficient full-reference image quality assessment (IQA) model using the extracted Gabor features, called Gabor feature-based model (GFM), is proposed for conducting objective evaluation of screen content images (SCIs). It is well-known that the Gabor filters are highly consistent with the response of the human visual system (HVS), and the HVS is highly sensitive to the edge information. Based on these facts, the imaginary part of the Gabor filter that has odd symmetry and yields edge detection is exploited to the luminance of the reference and distorted SCI for extracting their Gabor features, respectively. The local similarities of the extracted Gabor features and two chrominance components, recorded in the LMN color space, are then measured independently. Finally, the Gabor-feature pooling strategy is employed to combine these measurements and generate the final evaluation score. Experimental simulation results obtained from two large SCI databases have shown that the proposed GFM model not only yields a higher consistency with the human perception on the assessment of SCIs but also requires a lower computational complexity, compared with that of classical and state-of-the-art IQA models.

@article{ni2018gabor,
	title={A Gabor feature-based quality assessment model for the screen content images},
	author={Ni, Zhangkai and Zeng, Huanqiang and Ma, Lin and Hou, Junhui and Chen, Jing and Ma, Kai-Kuang},
	journal={IEEE Transactions on Image Processing},
	volume={27},
	number={9},
	pages={4516--4528},
	year={2018},
	publisher={IEEE}
}

Screen Content Image Quality Assessment Using Multi-Scale Difference of Gaussian
Ying Fu, Huanqiang Zeng, Lin Ma, Zhangkai Ni, Jianqing Zhu, and Kai-Kuang Ma.
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), vol. 28, no. 9, pp. 2428-2432, September 2018.
Abstract | Paper | Code | BibTex Abstract: In this paper, a novel image quality assessment (IQA) model for the screen content images (SCIs) is proposed by using multi-scale difference of Gaussian (MDOG). Motivated by the observation that the human visual system (HVS) is sensitive to the edges while the image details can be better explored in different scales, the proposed model exploits MDOG to effectively characterize the edge information of the reference and distorted SCIs at two different scales, respectively. Then, the degree of edge similarity is measured in terms of the smaller-scale edge map. Finally, the edge strength computed based on the larger-scale edge map is used as the weighting factor to generate the final SCI quality score. Experimental results have shown that the proposed IQA model for the SCIs produces high consistency with human perception of the SCI quality and outperforms the state-of-the-art quality models.

@article{fu2018screen,
	title={Screen content image quality assessment using multi-scale difference of gaussian},
	author={Fu, Ying and Zeng, Huanqiang and Ma, Lin and Ni, Zhangkai and Zhu, Jianqing and Ma, Kai-Kuang},
	journal={IEEE Transactions on Circuits and Systems for Video Technology},
	volume={28},
	number={9},
	pages={2428--2432},
	year={2018},
	publisher={IEEE}
}

ESIM: Edge Similarity for Screen Content Image Quality Assessment
Zhangkai Ni, Lin Ma, Huanqiang Zeng, Jing Chen, Canhui Cai, and Kai-Kuang Ma.
IEEE Transactions on Image Processing (TIP), vol. 26, no. 10, pp. 4818-4831, October 2017.
Abstract | Paper | Code | Dataset | Project | BibTex Abstract: In this paper, an accurate full-reference image quality assessment (IQA) model developed for assessing screen content images (SCIs), called the edge similarity (ESIM), is proposed. It is inspired by the fact that the human visual system (HVS) is highly sensitive to edges that are often encountered in SCIs; therefore, essential edge features are extracted and exploited for conducting IQA for the SCIs. The key novelty of the proposed ESIM lies in the extraction and use of three salient edge features-i.e., edge contrast, edge width, and edge direction. The first two attributes are simultaneously generated from the input SCI based on a parametric edge model, while the last one is derived directly from the input SCI. The extraction of these three features will be performed for the reference SCI and the distorted SCI, individually. The degree of similarity measured for each above-mentioned edge attribute is then computed independently, followed by combining them together using our proposed edge-width pooling strategy to generate the final ESIM score. To conduct the performance evaluation of our proposed ESIM model, a new and the largest SCI database (denoted as SCID) is established in our work and made to the public for download. Our database contains 1800 distorted SCIs that are generated from 40 reference SCIs. For each SCI, nine distortion types are investigated, and five degradation levels are produced for each distortion type. Extensive simulation results have clearly shown that the proposed ESIM model is more consistent with the perception of the HVS on the evaluation of distorted SCIs than the multiple state-of-the-art IQA methods.

@article{ni2017esim,
	title={ESIM: Edge similarity for screen content image quality assessment},
	author={Ni, Zhangkai and Ma, Lin and Zeng, Huanqiang and Chen, Jing and Cai, Canhui and Ma, Kai-Kuang},
	journal={IEEE Transactions on Image Processing},
	volume={26},
	number={10},
	pages={4818--4831},
	year={2017},
	publisher={IEEE}
}

Gradient Direction for Screen Content Image Quality Assessment
Zhangkai Ni, Lin Ma, Huanqiang Zeng, Canhui Cai, and Kai-Kuang Ma.
IEEE Signal Processing Letters (SPL), vol. 23, no. 10, pp. 1394–1398, August 2016.
Abstract | Paper | Code | Project | BibTex Abstract: In this letter, we make the first attempt to explore the usage of the gradient direction to conduct the perceptual quality assessment of the screen content images (SCIs). Specifically, the proposed approach first extracts the gradient direction based on the local information of the image gradient magnitude, which not only preserves gradient direction consistency in local regions, but also demonstrates sensitivities to the distortions introduced to the SCI. A deviation-based pooling strategy is subsequently utilized to generate the corresponding image quality index. Moreover, we investigate and demonstrate the complementary behaviors of the gradient direction and magnitude for SCI quality assessment. By jointly considering them together, our proposed SCI quality metric outperforms the state-of-the-art quality metrics in terms of correlation with human visual system perception.

@article{ni2016gradient,
	title={Gradient direction for screen content image quality assessment},
	author={Ni, Zhangkai and Ma, Lin and Zeng, Huanqiang and Cai, Canhui and Ma, Kai-Kuang},
	journal={IEEE Signal Processing Letters},
	volume={23},
	number={10},
	pages={1394--1398},
	year={2016},
	publisher={IEEE}
}

Conference Publications

AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm
Xinyue Li, Zhangkai Ni^*, and Wenhan Yang
In Proceedings of the International Conference on Computer Vision (ICCV), October 2025.
Abstract | Paper | Code | BibTex Abstract: Existing learning-based methods effectively reconstruct HDR images from multi-exposure LDR inputs with extended dynamic range and improved detail, but they rely more on empirical design rather than theoretical foundation, which can impact their reliability. To address these limitations, we propose the cross-iterative Alignment and Fusion deep Unfolding Network (AFUNet), where HDR reconstruction is systematically decoupled into two interleaved subtasks -- alignment and fusion -- optimized through alternating refinement, achieving synergy between the two subtasks to enhance the overall performance. Our method formulates multi-exposure HDR reconstruction from a Maximum A Posteriori (MAP) estimation perspective, explicitly incorporating spatial correspondence priors across LDR images and naturally bridging the alignment and fusion subproblems through joint constraints. Building on the mathematical foundation, we reimagine traditional iterative optimization through unfolding -- transforming the conventional solution process into an end-to-end trainable AFUNet with carefully designed modules that work progressively. Specifically, each iteration of AFUNet incorporates an Alignment-Fusion Module (AFM) that alternates between a Spatial Alignment Module (SAM) for alignment and a Channel Fusion Module (CFM) for adaptive feature fusion, progressively bridging misaligned content and exposure discrepancies. Extensive qualitative and quantitative evaluations demonstrate AFUNet's superior performance, consistently surpassing state-of-the-art methods. The code is publicly available at: https://github.com/eezkni/AFUNet
```
@article{li2025afunet,
	title={AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm},
	author={Li, Xinyue and Ni, Zhangkai and Yang, Wenhan},
	journal={arXiv preprint arXiv:2506.23537},
	year={2025}
}
```
Perceptual-GS: Scene-adaptive Perceptual Densification for Gaussian Splatting
Hongbi Zhou and Zhangkai Ni^*
In Proceedings of the International Conference on Machine Learning (ICML), July 2025.
Abstract | Paper | Code | BibTex Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis. However, existing methods struggle to adaptively optimize the distribution of Gaussian primitives based on scene characteristics, making it challenging to balance reconstruction quality and efficiency. Inspired by human perception, we propose scene-adaptive perceptual densification for Gaussian Splatting (Perceptual-GS), a novel framework that integrates perceptual sensitivity into the 3DGS training process to address this challenge. We first introduce a perception-aware representation that models human visual sensitivity while constraining the number of Gaussian primitives. Building on this foundation, we develop a \cameraready{perceptual sensitivity-adaptive distribution} to allocate finer Gaussian granularity to visually critical regions, enhancing reconstruction quality and robustness. Extensive evaluations on multiple datasets, including BungeeNeRF for large-scale scenes, demonstrate that Perceptual-GS achieves state-of-the-art performance in reconstruction quality, efficiency, and robustness. The code is publicly available at: https://github.com/eezkni/DDR
```
@article{zhou2025perceptual,
	title={Perceptual-GS: Scene-adaptive Perceptual Densification for Gaussian Splatting},
	author={Zhou, Hongbi and Ni, Zhangkai},
	journal={arXiv preprint arXiv:2506.12400},
	year={2025}
}
```
DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor
Juncheng Wu, Zhangkai Ni^*, Hanli Wang, Wenhan Yang, Yuyin Zhou, and Shiqi Wang
In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), December 2024.
Abstract | Paper | Code | BibTex Abstract: Image deep features extracted by pre-trained networks are known to contain rich and informative representations. In this paper, we present Deep Degradation Response (DDR), a method to quantify changes in image deep features under varying degradation conditions. Specifically, our approach facilitates flexible and adaptive degradation, enabling the controlled synthesis of image degradation through text-driven prompts. Extensive evaluations demonstrate the versatility of DDR as an image descriptor, with strong correlations observed with key image attributes such as complexity, colorfulness, sharpness, and overall quality. Moreover, we demonstrate the efficacy of DDR across a spectrum of applications. It excels as a blind image quality assessment metric, outperforming existing methodologies across multiple datasets. Additionally, DDR serves as an effective unsupervised learning objective in image restoration tasks, yielding notable advancements in image deblurring and single-image super-resolution. Our code is available at: https://github.com/eezkni/DDR
```
@article{wu2024ddr,
	title={DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor},
	author={Wu, Juncheng and Ni, Zhangkai and Wang, Hanli and Yang, Wenhan and Zhou, Yuyin and Wang, Shiqi},
	journal={arXiv preprint arXiv:2406.08377},
	year={2024}
}
```
Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement
Lingyu Zhu, Wenhan Yang, Baoliang Chen, Hanwei Zhu, Zhangkai Ni, Qi Mao, and Shiqi Wang
In Proceedings of the European Conference on Computer Vision (ECCV), September 2024.
Abstract | Paper | Code | BibTex Abstract: Obtaining pairs of low/normal-light videos, with motions, is more challenging than still images, which raises technical issues and poses the technical route of unpaired learning as a critical role. This paper makes endeavors in the direction of learning for low-light video enhancement without using paired ground truth. Compared to low-light image enhancement, enhancing low-light videos is more difficult due to the intertwined effects of noise, exposure, and contrast in the spatial domain, jointly with the need for temporal coherence. To address the above challenge, we propose the Unrolled Decomposed Unpaired Network (UDU-Net) for enhancing low-light videos by unrolling the optimization functions into a deep network to decompose the signal into spatial and temporal-related factors, which are updated iteratively. Firstly, we formulate low-light video enhancement as a Maximum A Posteriori estimation (MAP) problem with carefully designed spatial and temporal visual regularization. Then, via unrolling the problem, the optimization of the spatial and temporal constraints can be decomposed into different steps and updated in a stage-wise manner. From the spatial perspective, the designed Intra subnet leverages unpair prior information from expert photography retouched skills to adjust the statistical distribution. Additionally, we introduce a novel mechanism that integrates human perception feedback to guide network optimization, suppressing over/under-exposure conditions. Meanwhile, to address the issue from the temporal perspective, the designed Inter subnet fully exploits temporal cues in progressive optimization, which helps achieve improved temporal consistency in enhancement results. Consequently, the proposed method achieves superior performance to state-of-the-art methods in video illumination, noise suppression, and temporal consistency across outdoor and indoor scenes.
```
@article{zhu2024unrolled,
	title={Unrolled Decomposed Unpaired Learning for Controllable Low-Light Video Enhancement},
	author={Zhu, Linyu and Yang, Wenhan and Chen, Baoliang and Zhu, Hanwei and Ni, Zhangkai and Mao, Qi and Wang Shiqi},
	journal={arXiv preprint arXiv:2408.12316},
	year={2024}
}
```
Misalignment-Robust Frequency Distribution Loss for Image Transformation
Zhangkai Ni, Juncheng Wu, Zian Wang, Wenhan Yang, Hanli Wang, Lin Ma.
In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), June 2024.
Abstract | Paper | Code | BibTex Abstract: This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution, which heavily rely on precisely aligned paired datasets with pixel-level alignments. However, creating precisely aligned paired images presents significant challenges and hinders the advancement of methods trained on such data. To overcome this challenge, this paper introduces a novel and simple Frequency Distribution Loss (FDL) for computing distribution distance within the frequency domain. Specifically, we transform image features into the frequency domain using Discrete Fourier Transformation (DFT). Subsequently, frequency components (amplitude and phase) are processed separately to form the FDL loss function. Our method is empirically proven effective as a training constraint due to the thoughtful utilization of global information in the frequency domain. Extensive experimental evaluations, focusing on image enhancement and super-resolution tasks, demonstrate that FDL outperforms existing misalignment-robust loss functions. Furthermore, we explore the potential of our FDL for image style transfer that relies solely on completely misaligned data. Our code is available at: https://github.com/eezkni/FDL
```
@inproceedings{ni2024misalignment,
	title={Misalignment-robust frequency distribution loss for image transformation},
	author={Ni, Zhangkai and Wu, Juncheng and Wang, Zian and Yang, Wenhan and Wang, Hanli and Ma, Lin},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
	pages={2910--2919},
	year={2024}
}
```
ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field
Zhangkai Ni, Peiqi Yang, Wenhan Yang, Hanli Wang, Lin Ma, and Sam Kwong.
In Proceedings of the 38th Annual AAAI Conference on Artificial Intelligence (AAAI), vol. 38, no. 5, pp.4325-4333, 2024.
Abstract | Paper | Code | BibTex Abstract: Neural Radiance Fields (NeRF) have demonstrated impressive potential in synthesizing novel views from dense input, however, their effectiveness is challenged when dealing with sparse input. Existing approaches that incorporate additional depth or semantic supervision can alleviate this issue to an extent. However, the process of supervision collection is not only costly but also potentially inaccurate, leading to poor performance and generalization ability in diverse scenarios. In our work, we introduce a novel model: the Collaborative Neural Radiance Fields (ColNeRF) designed to work with sparse input. The collaboration in ColNeRF includes both the cooperation between sparse input images and the cooperation between the output of the neural radiation field. Through this, we construct a novel collaborative module that aligns information from various views and meanwhile imposes self-supervised constraints to ensure multi-view consistency in both geometry and appearance. A Collaborative Cross-View Volume Integration module (CCVI) is proposed to capture complex occlusions and implicitly infer the spatial location of objects. Moreover, we introduce self-supervision of target rays projected in multiple directions to ensure geometric and color consistency in adjacent regions. Benefiting from the collaboration at the input and output ends, ColNeRF is capable of capturing richer and more generalized scene representation, thereby facilitating higher-quality results of the novel view synthesis. Extensive experiments demonstrate that ColNeRF outperforms state-of-the-art sparse input generalizable NeRF methods. Furthermore, our approach exhibits superiority in fine-tuning towards adapting to new scenes, achieving competitive performance compared to per-scene optimized NeRF-based methods while significantly reducing computational costs. Our code is available at: https://github.com/eezkni/ColNeR
```
@inproceedings{ni2024colnerf,
	title={ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field},
	author={Ni, Zhangkai and Yang, Peiqi and Yang, Wenhan and Wang, Hanli and Ma, Lin and Kwong, Sam},
	booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
	volume={38},
	number={5},
	pages={4325--4333},
	year={2024}
}
```
Cycle-Interactive Generative Adversarial Network for Robust Unsupervised Low-Light Enhancement
Zhangkai Ni, Wenhan Yang, Hanli Wang, Shiqi Wang, Lin Ma, and Sam Kwong.
In Proceedings of the 30th ACM International Conference on Multimedia (ACM Multimedia), October 2022
Abstract | Paper | Code | BibTex Abstract: Getting rid of the fundamental limitations in fitting to the paired training data, recent unsupervised low-light enhancement methods excel in adjusting illumination and contrast of images. However, for unsupervised low light enhancement, the remaining noise suppression issue due to the lacking of supervision of detailed signal largely impedes the wide deployment of these methods in real-world applications. Herein, we propose a novel Cycle-Interactive Generative Adversarial Network (CIGAN) for unsupervised low-light image enhancement, which is capable of not only better transferring illumination distributions between low/normal-light images but also manipulating detailed signals between two domains, e.g., suppressing/synthesizing realistic noise in the cyclic enhancement/degradation process. In particular, the proposed low-light guided transformation feed-forwards the features of low-light images from the generator of enhancement GAN (eGAN) into the generator of degradation GAN (dGAN). With the learned information of real low-light images, dGAN can synthesize more realistic diverse illumination and contrast in low-light images. Moreover, the feature randomized perturbation module in dGAN learns to increase the feature randomness to produce diverse feature distributions, persuading the synthesized low-light images to contain realistic noise. Extensive experiments demonstrate both the superiority of the proposed method and the effectiveness of each module in CIGAN.
```
@inproceedings{ni2022cycle,
	title={Cycle-Interactive Generative Adversarial Network for Robust Unsupervised Low-Light Enhancement},
	author={Ni, Zhangkai and Yang, Wenhan and Wang, Hanli and Wang, Shiqi and Ma, Lin and Kwong, Sam},
	booktitle={Proceedings of the 28th ACM International Conference on Multimedia},
	pages={1697--1705},
	year={2020}
}
```
Unpaired Image Enhancement with Quality-Attention Generative Adversarial Network
Zhangkai Ni, Wenhan Yang, Shiqi Wang, Lin Ma, and Sam Kwong.
In Proceedings of the 28th ACM International Conference on Multimedia (ACM Multimedia), pp. 1697-1705, October 2020
Abstract | Paper | Code | BibTex Abstract: In this work, we aim to learn an unpaired image enhancement model, which can enrich low-quality images with the characteristics of high-quality images provided by users. We propose a quality attention generative adversarial network (QAGAN) trained on unpaired data based on the bidirectional Generative Adversarial Network (GAN) embedded with a quality attention module (QAM). The key novelty of the proposed QAGAN lies in the injected QAM for the generator such that it learns domain-relevant quality attention directly from the two domains. More specifically, the proposed QAM allows the generator to effectively select semantic-related characteristics from the spatial-wise and adaptively incorporate style-related attributes from the channel-wise, respectively. Therefore, in our proposed QAGAN, not only discriminators but also the generator can directly access both domains which significantly facilitate the generator to learn the mapping function. Extensive experimental results show that, compared with the state-of-the-art methods based on unpaired learning, our proposed method achieves better performance in both objective and subjective evaluations.
```
@inproceedings{ni2020unpaired,
	title={Unpaired image enhancement with quality-attention generative adversarial network},
	author={Ni, Zhangkai and Yang, Wenhan and Wang, Shiqi and Ma, Lin and Kwong, Sam},
	booktitle={Proceedings of the 28th ACM International Conference on Multimedia},
	pages={1697--1705},
	year={2020}
}
```
A JND Dataset Based on VVC Compressed Images
Xuelin Shen, Zhangkai Ni, Wenhan Yang, Xinfeng Zhang, Shiqi Wang, and Sam Kwong.
2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), June 2020.
Abstract | Paper | Dataset | BibTex Abstract: In this paper, we establish a just noticeable distortion (JND) dataset based on the next generation video coding standard Versatile Video Coding (VVC). The dataset consists of 202 images which cover a wide range of content with resolution 1920×1080. Each image is encoded by VTM 5.0 intra coding with the quantization parameter (QP) ranging from 13 to 51. The details regarding dataset construction, subjective testing and data post-processing are described in this paper. Finally, the significance of the dataset towards future video coding research is envisioned. All source images as well as the testing data have been made available to the public.
```
@inproceedings{shen2020jnd,
	title={A JND Dataset Based on VVC Compressed Images},
	author={Shen, Xuelin and Ni, Zhangkai and Yang, Wenhan and Zhang, Xinfeng and Wang, Shiqi and Kwong, Sam},
	booktitle={2020 IEEE International Conference on Multimedia \& Expo Workshops (ICMEW)},
	pages={1--6},
	year={2020},
	organization={IEEE}
}
```
SCID: A Database for Screen Content Images Quality Assessment
Zhangkai Ni, Lin Ma, Huanqiang Zeng, Ying Fu, Lu Xing, and Kai-Kuang Ma.
International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), pp. 774-779, November 2017.
Abstract | Paper | Dataset | Project | BibTex Abstract: Perceptual quality assessment of screen content images (SCIs) has become a new challenging topic in the recent research of image quality assessment (IQA). In this work, we construct a new SCI database (called as SCID) for subjective quality evaluate of SCIs and investigate whether existing IQA models can effectively assess the perceptual quality of distorted SCIs. The proposed SCID, which is currently the largest one, containing 1,800 distorted SCIs generated from 40 reference SCIs with 9 types of distortions and 5 degradation levels for each distortion type. The double-stimulus impairment scale (DSIS) method is then employed to rate the perceptual quality, in which each image is evaluated by at least 40 assessors. After processing, each distorted SCI is accompanied with one mean opinion score (MOS) value to indicate its perceptual quality as ground truth. Based on the constructed SCID, we evaluate the performances of 14 state-of-the-art IQA metrics. Experimental results show that the existing IQA metrics do not be able to evaluate the perceptual quality of SCIs well and an IQA metric specifically for SCIs is thus desirable. The proposed SCID will be made publicly available to the research community for further investigation on the perceptual processing of SCIs.
```
@inproceedings{ni2017scid,
	title={SCID: A database for screen content images quality assessment},
	author={Ni, Zhangkai and Ma, Lin and Zeng, Huanqiang and Fu, Ying and Xing, Lu and Ma, Kai-Kuang},
	booktitle={2017 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS)},
	pages={774--779},
	year={2017},
	organization={IEEE}
}
```
Screen Content Image Quality Assessment Using Euclidean Distance
Ying Fu, Huanqiang Zeng, Zhangkai Ni, Jing Chen, Canhui Cai, and Kai-Kuang Ma.
International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), pp. 44-49, November 2017.
Abstract | Paper | BibTex Abstract: Considering that human visual system (HVS) is greatly sensitive to edge, in this study, we design a new full-reference objective quality assessment method for screen content images (SCIs). The key novelty lies in the extracting of the edge information by computing the Euclidean distance of luminance in the SCIs. Since HVS is greatly suitable for extracting structural information, the structure information is incorporated into our proposed model. The extracted information is then used to compute the similarity maps of the reference SCI and its distorted SCI. Finally, we combine the obtained maps by using our designed pooling strategy. Experience results have shown that the designed method get higher correlation with the subjective quality score than state-of-the-art quality assessment models.
```
@inproceedings{fu2017screen,
	title={Screen content image quality assessment using Euclidean distance},
	author={Fu, Ying and Zeug, Huanqiang and Ni, Zhangkai and Chen, Jing and Cai, Canhui and Ma, Kai-Kuang},
	booktitle={2017 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS)},
	pages={44--49},
	year={2017},
	organization={IEEE}
}
```
Screen Content Image Quality Assessment Using Edge Model
Zhangkai Ni, Lin Ma, Huanqiang Zeng, Canhui Cai, and Kai-Kuang Ma.
IEEE International conference on Image Processing (ICIP), pp. 81–85, August 2016.
Abstract | Paper | Code | BibTex Abstract: Since the human visual system (HVS) is highly sensitive to edges, a novel image quality assessment (IQA) metric for assessing screen content images (SCIs) is proposed in this paper. The turnkey novelty lies in the use of an existing parametric edge model to extract two types of salient attributes - namely, edge contrast and edge width, for the distorted SCI under assessment and its original SCI, respectively. The extracted information is subject to conduct similarity measurements on each attribute, independently. The obtained similarity scores are then combined using our proposed edge-width pooling strategy to generate the final IQA score. Hopefully, this score is consistent with the judgment made by the HVS. Experimental results have shown that the proposed IQA metric produces higher consistency with that of the HVS on the evaluation of the image quality of the distorted SCI than that of other state-of-the-art IQA metrics.
```
@inproceedings{ni2016screen,
	title={Screen content image quality assessment using edge model},
	author={Ni, Zhangkai and Ma, Lin and Zeng, Huanqiang and Cai, Canhui and Ma, Kai-Kuang},
	booktitle={2016 IEEE International Conference on Image Processing (ICIP)},
	pages={81--85},
	year={2016},
	organization={IEEE}
}
```

Patents

A Rain Removal Image Post-processing Method Based on Progressive Collaborative Representation
Huanqiang Zeng, Xiangwei Lin, Zhangkai Ni, Jiuwen Cao, Jianqing Zhu, and Kai-Kuang Ma
Application No. 10201906356T, July 2019. (Chinese Patent)
Colour Image Demosaicing Using Progressive Collaborative Representation
Kai-Kuang Ma, and Zhangkai Ni
Application No. 10201906356T, July 2019. (Singapore Patent)
A Multi-Exposure Fused Image Quality Assessment Method Based on Contrast and Saturation
Huanqiang Zeng, Lu Xing, Zhangkai Ni, Jiuwen Cao, Canhui Cai, and Kai-Kuang Ma
Application No. 2016111584053, December 2016. (Chinese Patent)
A Screen Content Image Quality Assessment Method Based on Phase Congruency
Huanqiang Zeng, Zhangkai Ni, Lin Ma, Jiuwen Cao, Canhui Cai, and Kai-Kuang Ma
Application No. 2016108863395, October 2016. (Chinese Patent)