VISReg

A scaling friendly method with a better generalizability

arXiv
Haiyu Wu1Randall Balestriero2 Morgan Levine1
1Altos Labs   2Brown University

TL;DR: We propose VISReg (Variance-Invariance-Sketching Regularization), a novel method that prevents embeddings from collapse while learning good representations. We systematically evaluate its ability in scaling efficiency, low-quality datasets, in-domain and OOD performance, dense instance prediction, and generation. VISReg has the best OOD performance at ImageNet-1K level and achieves similar OOD performance as DINOv2 with only 0.1x training data.

🛡
Heuristic-Free Training: Stable self-supervised learning without EMA, stop-gradient, teacher-student, etc.
💪
Strong collapse prevention: Providing stronger gradients at collapse stage.
🌍
Best OOD Generalization: SOTA on 6 diverse out-of-distribution datasets, outperforming all methods.
🚀
Scales Efficiently: Linear complexity to scaling factors with GPU-distributable slices, friendly to large-scale training.

Key Findings

1

Robust to low-quality datasets: Methods grounded in the Cramér–Wold theorem are stable to low-quality datasets (e.g., long-tailed, sparse) and the slices/projections for regularization loss can be naturally increased in distributed training which ensures training stability.

2

Best OOD performance: VISReg has the best OOD linear probe accuracy at ImageNet-1K level. At a larger scale, VISReg trained on ImageNet-22K (14M images) is able to achieve similar average accuracy to DINOv2 trained on LVD-142M (142M images) on OOD benchmarks.

3

Good downstream applications: VISReg has on-par or better performance than DINOv1 on transfer learning, dense instance prediction, and generation guidance.


Why Decoupled Regularization?

Self-supervised learning methods like DINO, iBOT, and I-JEPA rely on heavy heuristics — EMA updates, teacher-student architectures, stop-gradient, frozen layers — to prevent embedding collapse. LeJEPA showed that regularizing the embedding space to an isotropic Gaussian can replace these heuristics entirely. However, its regularizer (SIGReg) has a critical weakness: its gradient vanishes precisely when the model needs it most — at collapse.

Gradient magnitude comparison under embedding collapse for VISReg, SIGReg, Barlow Twins, VICReg, and SWD
Figure 2. Embedding collapse prevention. We simulate the gradient \(\|\nabla\mathcal{L}\|\) of regularization methods under different collapse stages. VISReg maintains strong gradients at collapse, while SIGReg’s gradient vanishes.

This motivates our approach: decouple the regularization into independent scale (variance) and shape (sketching distribution geometry) components. By operating on these separately, VISReg is more robust to collapse, more efficient in high dimensions, and more resilient to low-quality datasets.


VISReg: Variance-Invariance-Sketching Regularization

VISReg regulates the embedding space through three complementary objectives:

Scale Regularization

We constrain the per-dimension variance to prevent magnitude collapse: \[\mathcal{L}_{\text{scale}} = \frac{1}{D}\sum_{j=1}^{D}(1 - \sigma_j(\hat{Z}))^2\] The gradient approaches a constant when the model collapses, ensuring stable recovery.

Shape Regularization

After normalizing out scale, we align the distribution geometry to the isotropic Gaussian using the Sliced Wasserstein Distance (SWD), grounded in the Cramér-Wold theorem: \[\mathcal{L}_{\text{shape}} = \frac{1}{K}\sum_{k=1}^{K}\left\|\text{sort}(\tilde{Z}w_k) - q_N\right\|_2^2\] where \(q_N\) are the Gaussian quantiles and \(w_k\) are random projection directions.

Combined Objective

\[\mathcal{L}_{\text{Reg}} = \mathcal{L}_{\text{scale}} + \mathcal{L}_{\text{shape}} + \mathcal{L}_{\text{center}}\] Combined with an Euclidean invariant prediction loss, the full VISReg objective is: \[\mathcal{L}_{\text{VISReg}} = (1-\lambda)\mathcal{L}_{\text{pred}} + \lambda\mathcal{L}_{\text{Reg}}\]

Algorithm 1. regularization part of VISReg in PyTorch — ~15 lines of code
def visreg(z, K=64):
    # 1. Center loss
    mu = z.mean(dim=0)
    L_center = mu.pow(2).mean()

    # 2. Scale loss
    z_cent = z - mu
    std = z_cent.std(dim=0, unbiased=False)
    L_scale = (1.0 - std).pow(2).mean()

    # 3. Shape loss: Sliced Wasserstein Distance
    z_norm = z_cent / (std.detach())
    W = torch.randn(D, K)
    W /= W.norm(p=2, dim=0)
    p_sorted = torch.sort(z_norm @ W, dim=0).values
    u = torch.arange(1, N+1) / (N+1)
    target = Normal(0, 1).icdf(u)
    L_shape = (p_sorted - target).pow(2).mean()

    return L_scale + L_shape + L_center

Scaling Analysis

VISReg has complexity \(\mathcal{O}(NDK)\), linear in all scaling factors. Crucially, the \(K\) random slices can be distributed across GPUs — generating \(K/M\) slices per GPU on \(M\) GPUs yields the same accuracy as \(K\) slices on one GPU. This keeps \(K\) constant during scaling.

Linear probe accuracy scaling with number of GPUs at fixed K and D
Figure 6. Linear probe accuracy when scaling GPUs with fixed \(K\) and \(D\). Scaling GPUs compensates for insufficient \(K=\frac{1}{4}D\): with 8× more GPUs, accuracy matches the \(K=2D\) target, making constant \(K\) feasible at scale.

Results

Out-of-Distribution Performance

We evaluate on 6 OOD datasets spanning medical (ChestXRay, RetinaMNIST, OrganAMNIST), space (Galaxy10), aerial (AID), and texture (DTD) domains. VISReg achieves the best average OOD accuracy across all methods and backbone scales.

Average OOD accuracy comparison: VISReg vs iBOT, DINO, MoCoV3, I-JEPA, MAE, data2vec
Figure 4. Average OOD linear probe accuracy. VISReg outperforms all methods, including those with heuristics and larger backbones.

10x Data Efficiency

When pre-trained on ImageNet-22K, VISReg with ViT-L/14 achieves comparable OOD performance to DINOv2, which was pre-trained on the 10× larger LVD-142M dataset. This demonstrates the strong generality of representations learned by VISReg.

VISReg-IN22K vs DINOv2-LVD142M average OOD accuracy comparison
Figure 5. VISReg pre-trained on ImageNet-22K matches DINOv2 pre-trained on LVD-142M (10× more data) on OOD benchmarks.

Transfer Learning

Despite having lower linear probe accuracy on in-domain datasets than DINO, VISReg outperforms DINO after fine-tuning on all tested datasets (CIFAR-10, CIFAR-100, Flowers, ImageNet-1K, Galaxy10), indicating stronger transferable representations.

Transfer learning accuracy: VISReg vs DINO vs Supervised on CIFAR10, CIFAR100, Flowers, ImageNet1K, Galaxy10
Figure 6. Transfer learning comparison. VISReg outperforms both DINO and supervised pre-training after fine-tuning on all tested datasets.

Dense Prediction & Generation Guidance

Linear segmentation on ADE20K: MoCoV3, DINO, data2vec, MAE, VISReg
Figure 7. Linear segmentation on ADE20K. Without heuristics, VISReg achieves competitive mIoU, second only to MoCoV3.
Generation guidance via iREPA: DINO vs VISReg on IS, gFID, Precision, Recall
Figure 8. Image generation with SiT-B/2 guided by VISReg vs DINO features. VISReg provides better guidance across all metrics (lower gFID, higher Precision and Recall).

Robustness to Low-Quality Data

On long-tailed (ImageNet-LT) and low-rank (Galaxy10) datasets, VISReg successfully prevents collapse and learns meaningful embeddings, while DINO fails without careful hyperparameter tuning.

Table 1. Linear probe accuracy on ImageNet-LT. ViT-S/8 trained for 400 epochs from scratch. Our VISReg method outperforms all methods at all levels. DINO fails to learn meaningful embeddings. * means increasing the weight of shape loss.

Method Overall Many Medium Few
SWD 31.85 51.54 22.70 8.36
SIGReg 32.00 51.86 22.88 7.92
VISReg 32.11 51.55 23.19 8.52
VISReg* 35.14 54.49 26.87 9.40
VICReg 33.08 52.29 24.63 8.54
DINO 5.13 12.22 0.82 0.24

Table 2. In-domain linear probe accuracy on Galaxy10. The model is trained from scratch to test the performance of methods on the low-rank task. SIGReg, SWD, and VISReg successfully prevent the training from collapsing while obtaining a good linear probe accuracy, whereas DINO struggles to learn meaningful features. * means increasing the weight of shape loss.

Method SWD SIGReg VISReg VISReg* VICReg DINO
Acc. 80.60 80.50 80.51 80.76 79.93 73.49

Conclusion

VISReg demonstrates that decoupling embedding regularization into scale and shape yields a self-supervised method that is more stable, more efficient, and produces more generalizable representations than existing approaches. Without any training heuristics, VISReg achieves SOTA OOD performance and strong transfer learning capabilities, pointing toward a promising direction for foundation model training.

Citation

@inproceedings{wu2026visreg,
  title     = {VISReg: Variance-Invariance-Sketching Regularization for JEPA training},
  author    = {Wu, Haiyu and Balestriero, Randall and Levine, Morgan},
  booktitle = {arXiv},
  year      = {2026}
}