Appearance
JEPA Explained: How Yann LeCun's AI Predicts Meaning, Not Pixels
JEPA is Yann LeCun's self-supervised AI architecture that predicts abstract meaning instead of raw pixels or words, powering Meta's world models.
Most AI models today are obsessed with details. Show an image model a photo with part of it blacked out, and it will try to repaint every pixel, down to the exact shade of blue in the sky. Show a language model half a sentence, and it predicts the next word, one token at a time.
That sounds impressive, but it's also wasteful. A huge chunk of compute goes into guessing things that don't actually matter, like the precise texture of grass or the exact pixel value of a shadow. Humans don't learn this way. If you watch a ball get thrown into the air, you don't predict its exact RGB values frame by frame. You just know it's going to come back down.
This is the gap Yann LeCun set out to close. His answer is called JEPA, the Joint Embedding Predictive Architecture. Instead of predicting raw data, JEPA predicts the meaning of data. It is the foundation behind Meta's I-JEPA, V-JEPA, and V-JEPA 2 models, and it's quickly becoming the backbone of what LeCun calls "world models." Here's what it actually does, how it works, and how you can try it yourself.
What Is JEPA and Why Did LeCun Propose It?
JEPA stands for Joint Embedding Predictive Architecture. Yann LeCun proposed it in 2022 as a way of learning that predicts in an abstract space of meaning rather than in the raw space of pixels or words.
The core bet is simple: predicting what something means and skipping the unpredictable details is closer to how humans and animals actually learn.
LeCun's complaint about earlier architectures is specific. He argues that generative models waste compute on reconstruction and focus too much on low-level features, and that decoding outputs back into pixels is genuinely hard in many domains. He also points out a separate problem with contrastive learning, the other popular self-supervised method: it depends heavily on hand-crafted data augmentations and can suffer from representation collapse, where the model learns to output the same thing for everything.
JEPA tries to dodge both problems at once.
How Does JEPA Actually Work?
Strip away the jargon and JEPA has three simple parts:
| Component | What it does |
|---|---|
| Context Encoder | Turns the visible part of the input (e.g. an unmasked region of an image) into an embedding |
| Target Encoder | Turns the hidden/target part of the input into a separate embedding (used only during training) |
| Predictor | Takes the context embedding and tries to guess the target embedding, sometimes with help from a latent variable that absorbs whatever can't be predicted |
The model never tries to reconstruct the missing pixels. It only compares embeddings to embeddings. If the predicted embedding is close to the real one, the model did a good job. This setup can be viewed as an energy-based model: it assigns low energy when the prediction matches the actual target, and high energy when it doesn't.
Why does skipping pixels matter? Because, as one researcher put it plainly: if you're predicting what happens next in a video, you don't need to know the exact shade of blue in the sky three seconds from now. You need to know whether the car turns left or right. JEPA is built to focus on exactly that kind of meaningful signal and throw away the noise.
JEPA vs Generative Models vs Contrastive Learning
People often confuse JEPA with the two other major self-supervised approaches. Here's the quick breakdown:
| Approach | What it predicts | Main weakness |
|---|---|---|
| Generative (e.g. masked autoencoders) | Raw pixels or tokens | Wastes compute reconstructing irrelevant detail |
| Contrastive (e.g. SimCLR) | Whether two augmented views match | Needs heavy augmentation and large batches; prone to collapse |
| JEPA | Abstract embeddings of missing/future parts | Needs careful design to avoid trivial shortcuts (collapse) |
LeCun's view is that contrastive learning gives an extremely sparse training signal, which forces models to need huge batch sizes and massive datasets to train well. JEPA tries to get a richer signal without that cost.
The Real Models: I-JEPA, V-JEPA, and V-JEPA 2
JEPA started as a theory paper, but Meta has since shipped working models.
I-JEPA (2023) works on still images. It masks out blocks of an image and asks the predictor to guess the embedding of the masked region from the visible context.
V-JEPA (2024) extends the idea to video. V-JEPA learns visual representations purely by watching video, without using pretrained image encoders, text, labels, or pixel-level reconstruction.
V-JEPA 2 (2025) scales this up massively. It's a 1.2 billion-parameter model built on the same joint-embedding predictive architecture first introduced in 2022. It has two parts: an encoder that turns raw video into embeddings capturing the state of the world, and a predictor that takes a video embedding plus context about what to predict and outputs predicted embeddings.
V-JEPA 2 is trained in two stages:
- Self-supervised pretraining on roughly a million hours of unlabeled internet video, learning concepts like gravity, motion, and object permanence just from watching.
- Action-conditioned fine-tuning (called V-JEPA 2-AC) on a small set of robot arm videos, so the model can connect its world knowledge to real actions like reaching and grasping.
This is what lets V-JEPA 2 do zero-shot robot planning, interacting with unfamiliar objects in new environments after training on just 62 hours of robot data.
Folder Structure of a JEPA Codebase
If you look at Meta's official I-JEPA repository, the structure is fairly approachable:
ijepa/
├── configs/ # YAML configs for each experiment
├── src/
│ ├── train.py # the I-JEPA training loop
│ ├── helper.py # model/optimizer init, checkpoint loading
│ ├── transforms.py # pretraining data transforms
│ ├── datasets/ # datasets and data loaders
│ ├── models/ # encoder, predictor model definitions
│ └── masks/ # mask collators, masking utilities
└── main.py # entry point for trainingTraining is launched by pointing main.py at a config file and a list of GPU devices, for example running on three local GPUs with a given config. Note that the largest configs are meant to run on 16 A100 80GB GPUs to reproduce the paper's results, so this isn't a laptop-friendly training job.
How to Try JEPA Yourself (Code Examples)
You don't need to train a JEPA from scratch. Meta's pretrained checkpoints are on Hugging Face, and you can run them with a few lines of transformers code.
Install what you need:
bash
pip install -U transformers torch pillow requests datasetsExample 1: Image classification with I-JEPA
python
from transformers import AutoImageProcessor, IJepaForImageClassification
import torch
from datasets import load_dataset
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
image_processor = AutoImageProcessor.from_pretrained("facebook/ijepa_vith14_1k")
model = IJepaForImageClassification.from_pretrained("facebook/ijepa_vith14_1k")
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])This loads the I-JEPA ViT-H/14 checkpoint trained on ImageNet and runs inference to predict one of 1000 ImageNet classes.
Example 2: Comparing two images by embedding similarity
python
import requests
from PIL import Image
from torch.nn.functional import cosine_similarity
from transformers import AutoModel, AutoProcessor
url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
processor = AutoProcessor.from_pretrained("facebook/ijepa_vith14_1k")
model = AutoModel.from_pretrained("facebook/ijepa_vith14_1k", attn_implementation="sdpa", device_map="auto")
def infer(image):
inputs = processor(image, return_tensors="pt").to(model.device)
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1)
embed_1 = infer(image_1)
embed_2 = infer(image_2)
similarity = cosine_similarity(embed_1, embed_2)
print(similarity)This is the practical use case for JEPA embeddings: comparing meaning, not pixels. Two images of different cats will score high similarity even though the raw pixels are completely different.
Example 3: Video feature extraction with V-JEPA 2
python
import numpy as np
import torch
from torchcodec.decoders import VideoDecoder
from transformers import AutoModel, AutoVideoProcessor
hf_repo = "facebook/vjepa2-vitl-fpc64-256"
model = AutoModel.from_pretrained(hf_repo, device_map="auto", attn_implementation="sdpa")
processor = AutoVideoProcessor.from_pretrained(hf_repo)
video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64)
video = vr.get_frames_at(indices=frame_idx).data
inputs = processor(video, return_tensors="pt").to(model.device)
outputs = model(**inputs)
encoder_outputs = outputs.last_hidden_state
predictor_outputs = outputs.predictor_output.last_hidden_stateThis snippet loads the V-JEPA 2 model, samples 64 frames from a video, and returns both the encoder's embeddings and the predictor's predicted embeddings.
Why Should Developers Care About JEPA?
A few practical reasons this matters beyond the research papers:
- Cheaper representation learning. Because JEPA skips pixel-level reconstruction, it avoids the overhead of heavy data augmentation, since only the context blocks need to go through the context encoder.
- Better transfer to downstream tasks. Embeddings trained this way tend to generalize well to classification and other tasks without retraining the whole model.
- A real path toward robotics. V-JEPA 2-AC shows a working example of a model planning physical actions from a learned world model, not just describing the world in text.
- It's open source. Meta has released I-JEPA, V-JEPA, and V-JEPA 2 code and weights, so you can experiment without needing massive compute.
Limitations of JEPA Right Now
JEPA isn't a finished product. A few honest caveats:
- It currently learns and predicts at a single time scale, while many real tasks need planning across multiple time scales, like breaking "bake a cake" into smaller steps.
- Latent prediction by itself cannot guarantee the model avoids representation collapse or anisotropy on its own, which is why follow-up work like LeJEPA exists to add stability guarantees.
- Most JEPA models today are single-modality (just vision, or just video). Meta has said multimodal JEPA models that combine vision, audio, and touch are still a future direction, not something available yet.
- Training the larger configs still requires serious GPU clusters, not consumer hardware.
Q&A
1. What does JEPA stand for?
Joint Embedding Predictive Architecture. It's a self-supervised learning method proposed by Yann LeCun.
2. Who created JEPA and when?
Yann LeCun proposed the concept in 2022. The first working model, I-JEPA, was published in 2023.
3. Is JEPA a generative model like GPT or Stable Diffusion?
No. JEPA is non-generative. It never outputs pixels or text directly. It only predicts embeddings, which are compact numerical representations of meaning.
4. What's the difference between I-JEPA and V-JEPA?
I-JEPA works on still images and predicts masked patches. V-JEPA and V-JEPA 2 work on video and predict masked or future frames, which lets them learn motion and physics.
5. Can I run JEPA models on my own computer?
Yes, for inference. Pretrained I-JEPA and V-JEPA 2 checkpoints are on Hugging Face and run with standard transformers code on a single GPU, or even CPU for smaller tests. Training from scratch needs much more hardware.
6. Why does LeCun think JEPA is better than predicting pixels?
Because most pixel-level detail is unpredictable noise that doesn't help with understanding or planning. Predicting meaning instead lets the model focus compute on what actually matters.
7. Does JEPA use labeled data?
No. JEPA is self-supervised. It learns by predicting parts of the input from other parts, with no human annotations needed during pretraining.
8. What is V-JEPA 2-AC?
It's a version of V-JEPA 2 fine-tuned on a small amount of robot trajectory data so it can plan physical actions, like grasping objects, using its learned world model.
9. Is JEPA related to "world models"?
Yes. JEPA is the architecture LeCun uses to build what he calls a world model, a system that predicts how the world will change and can plan actions based on that prediction.
10. Where can I find the official JEPA code?
Meta's official repositories are on GitHub under facebookresearch/ijepa, facebookresearch/jepa, and facebookresearch/vjepa2, with pretrained weights also mirrored on Hugging Face.
My SaaS
Acluebox
Build modular and reusable system prompts with my SaaS,
Acluebox
. Also, free prompt template generators there. References
- Introducing the V-JEPA 2 world model and new benchmarks for physical reasoning - https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
- Joint Embedding Predictive Architecture (JEPA): A Complete, In-Depth Guide - https://nextwaves.com/blog/joint-embedding-predictive-architecture-jepa-a-complete-in-depth-guide
- I-JEPA Model Documentation - https://huggingface.co/docs/transformers/en/model_doc/ijepa
- V-JEPA 2 Model Documentation - https://huggingface.co/docs/transformers/model_doc/vjepa2
- GitHub: facebookresearch/ijepa: Official codebase for I-JEPA - https://github.com/facebookresearch/ijepa
