RINE: Leveraging Representations from Intermediate Encoder-Blocks for Synthetic Image Detection
A high-performing synthetic image detection method that utilises intermediate layers of the CLIP image encoder.

RINE is a synthetic image detection framework that leverages the intermediate representations from CLIP’s Vision Transformer blocks to build a forgery-aware feature space, enabling high accuracy in detecting synthetic images with minimal computational resources.
RINE is a novel approach for Synthetic Image Detection (SID) that utilises intermediate representations from CLIP’s image encoder. Unlike traditional methods that primarily use final-layer features, RINE extracts information from multiple intermediate Transformer blocks. These encapsulate low-level details, which are crucial for identifying synthetic traces.
The RINE architecture first processes an input image through CLIP's encoder, extracting CLS tokens from each Transformer block. These are concatenated and projected into a forgery-aware vector space using a lightweight network. This process ensures that RINE captures nuanced, fine-grained details that indicate synthetic artifacts. A unique feature of RINE is its Trainable Importance Estimator (TIE), which assigns weights to each block’s representation based on its relevance to the SID task, enabling more accurate aggregation of features.
To further enhance learning, RINE employs a combination of binary cross-entropy loss for classification accuracy and supervised contrastive learning, which organises feature vectors into dense clusters based on their class. This approach not only improves the model’s classification but also enhances its ability to generalise across different synthetic image datasets.