VISReg: Variance-Invariance-Sketching Regularization for JEPA training

Wu, Haiyu; Balestriero, Randall; Levine, Morgan

Key Findings

1

Robust to low-quality datasets: Methods grounded in the Cramér–Wold theorem are stable to low-quality datasets (e.g., long-tailed, sparse) and the slices/projections for regularization loss can be naturally increased in distributed training which ensures training stability.

2

Best OOD performance: VISReg has the best OOD linear probe accuracy at ImageNet-1K level. At a larger scale, VISReg trained on ImageNet-22K (14M images) is able to achieve similar average accuracy to DINOv2 trained on LVD-142M (142M images) on OOD benchmarks.

3

Good downstream applications: VISReg has on-par or better performance than DINOv1 on transfer learning, dense instance prediction, and generation guidance.

Why Decoupled Regularization?

Self-supervised learning methods like DINO, iBOT, and I-JEPA rely on heavy heuristics — EMA updates, teacher-student architectures, stop-gradient, frozen layers — to prevent embedding collapse. LeJEPA showed that regularizing the embedding space to an isotropic Gaussian can replace these heuristics entirely. However, its regularizer (SIGReg) has a critical weakness: its gradient vanishes precisely when the model needs it most — at collapse.

Gradient magnitude comparison under embedding collapse for VISReg, SIGReg, Barlow Twins, VICReg, and SWD — **Figure 2.** Embedding collapse prevention. We simulate the gradient \(\|\nabla\mathcal{L}\|\) of regularization methods under different collapse stages. VISReg maintains strong gradients at collapse, while SIGReg’s gradient vanishes.

This motivates our approach: decouple the regularization into independent scale (variance) and shape (sketching distribution geometry) components. By operating on these separately, VISReg is more robust to collapse, more efficient in high dimensions, and more resilient to low-quality datasets.

VISReg: Variance-Invariance-Sketching Regularization

VISReg regulates the embedding space through three complementary objectives:

Scale Regularization

We constrain the per-dimension variance to prevent magnitude collapse: \[\mathcal{L}_{\text{scale}} = \frac{1}{D}\sum_{j=1}^{D}(1 - \sigma_j(\hat{Z}))^2\] The gradient approaches a constant when the model collapses, ensuring stable recovery.

Shape Regularization

After normalizing out scale, we align the distribution geometry to the isotropic Gaussian using the Sliced Wasserstein Distance (SWD), grounded in the Cramér-Wold theorem: \[\mathcal{L}_{\text{shape}} = \frac{1}{K}\sum_{k=1}^{K}\left\|\text{sort}(\tilde{Z}w_k) - q_N\right\|_2^2\] where \(q_N\) are the Gaussian quantiles and \(w_k\) are random projection directions.

Combined Objective

\[\mathcal{L}_{\text{Reg}} = \mathcal{L}_{\text{scale}} + \mathcal{L}_{\text{shape}} + \mathcal{L}_{\text{center}}\] Combined with an Euclidean invariant prediction loss, the full VISReg objective is: \[\mathcal{L}_{\text{VISReg}} = (1-\lambda)\mathcal{L}_{\text{pred}} + \lambda\mathcal{L}_{\text{Reg}}\]

Algorithm 1. regularization part of VISReg in PyTorch — ~15 lines of code

def visreg(z, K=64):
    # 1. Center loss
    mu = z.mean(dim=0)
    L_center = mu.pow(2).mean()

    # 2. Scale loss
    z_cent = z - mu
    std = z_cent.std(dim=0, unbiased=False)
    L_scale = (1.0 - std).pow(2).mean()

    # 3. Shape loss: Sliced Wasserstein Distance
    z_norm = z_cent / (std.detach())
    W = torch.randn(D, K)
    W /= W.norm(p=2, dim=0)
    p_sorted = torch.sort(z_norm @ W, dim=0).values
    u = torch.arange(1, N+1) / (N+1)
    target = Normal(0, 1).icdf(u)
    L_shape = (p_sorted - target).pow(2).mean()

    return L_scale + L_shape + L_center

Scaling Analysis

VISReg has complexity \(\mathcal{O}(NDK)\), linear in all scaling factors. Crucially, the \(K\) random slices can be distributed across GPUs — generating \(K/M\) slices per GPU on \(M\) GPUs yields the same accuracy as \(K\) slices on one GPU. This keeps \(K\) constant during scaling.

Linear probe accuracy scaling with number of GPUs at fixed K and D — **Figure 6.** Linear probe accuracy when scaling GPUs with fixed \(K\) and \(D\). Scaling GPUs compensates for insufficient \(K=\frac{1}{4}D\): with 8× more GPUs, accuracy matches the \(K=2D\) target, making constant \(K\) feasible at scale.

Results

Out-of-Distribution Performance

We evaluate on 6 OOD datasets spanning medical (ChestXRay, RetinaMNIST, OrganAMNIST), space (Galaxy10), aerial (AID), and texture (DTD) domains. VISReg achieves the best average OOD accuracy across all methods and backbone scales.

Average OOD accuracy comparison: VISReg vs iBOT, DINO, MoCoV3, I-JEPA, MAE, data2vec — **Figure 4.** Average OOD linear probe accuracy. VISReg outperforms all methods, including those with heuristics and larger backbones.

10x Data Efficiency

When pre-trained on ImageNet-22K, VISReg with ViT-L/14 achieves comparable OOD performance to DINOv2, which was pre-trained on the 10× larger LVD-142M dataset. This demonstrates the strong generality of representations learned by VISReg.

VISReg-IN22K vs DINOv2-LVD142M average OOD accuracy comparison — **Figure 5.** VISReg pre-trained on ImageNet-22K matches DINOv2 pre-trained on LVD-142M (10× more data) on OOD benchmarks.

Transfer Learning

Despite having lower linear probe accuracy on in-domain datasets than DINO, VISReg outperforms DINO after fine-tuning on all tested datasets (CIFAR-10, CIFAR-100, Flowers, ImageNet-1K, Galaxy10), indicating stronger transferable representations.

Dense Prediction & Generation Guidance

Linear segmentation on ADE20K: MoCoV3, DINO, data2vec, MAE, VISReg — **Figure 7.** Linear segmentation on ADE20K. Without heuristics, VISReg achieves competitive mIoU, second only to MoCoV3.

Generation guidance via iREPA: DINO vs VISReg on IS, gFID, Precision, Recall — **Figure 8.** Image generation with SiT-B/2 guided by VISReg vs DINO features. VISReg provides better guidance across all metrics (lower gFID, higher Precision and Recall).

Robustness to Low-Quality Data

On long-tailed (ImageNet-LT) and low-rank (Galaxy10) datasets, VISReg successfully prevents collapse and learns meaningful embeddings, while DINO fails without careful hyperparameter tuning.

Table 1. Linear probe accuracy on ImageNet-LT. ViT-S/8 trained for 400 epochs from scratch. Our VISReg method outperforms all methods at all levels. DINO fails to learn meaningful embeddings. * means increasing the weight of shape loss.

Method	Overall	Many	Medium	Few
SWD	31.85	51.54	22.70	8.36
SIGReg	32.00	51.86	22.88	7.92
VISReg	32.11	51.55	23.19	8.52
VISReg*	35.14	54.49	26.87	9.40
VICReg	33.08	52.29	24.63	8.54
DINO	5.13	12.22	0.82	0.24

Table 2. In-domain linear probe accuracy on Galaxy10. The model is trained from scratch to test the performance of methods on the low-rank task. SIGReg, SWD, and VISReg successfully prevent the training from collapsing while obtaining a good linear probe accuracy, whereas DINO struggles to learn meaningful features. * means increasing the weight of shape loss.

Method	SWD	SIGReg	VISReg	VISReg*	VICReg	DINO
Acc.	80.60	80.50	80.51	80.76	79.93	73.49

Conclusion

VISReg demonstrates that decoupling embedding regularization into scale and shape yields a self-supervised method that is more stable, more efficient, and produces more generalizable representations than existing approaches. Without any training heuristics, VISReg achieves SOTA OOD performance and strong transfer learning capabilities, pointing toward a promising direction for foundation model training.

Citation

@inproceedings{wu2026visreg,
  title     = {VISReg: Variance-Invariance-Sketching Regularization for JEPA training},
  author    = {Wu, Haiyu and Balestriero, Randall and Levine, Morgan},
  booktitle = {arXiv},
  year      = {2026}
}

VISReg

A scaling friendly method with a better generalizability