How to Actually Understand Dense Machine Learning Papers - Solveit free lesson

Jeremy Howard · 2.1K views · 54 likes

Analysis Summary

20% Minimal Influence

mildmoderatesevere

“Be aware that while the technical explanation is sound, the framing of this specific research as a 'substantive breakthrough' reflects the host's personal academic bias rather than a universal industry consensus.”

Transparency Transparent

Human Detected

98%

Signals

The transcript exhibits clear human characteristics including natural filler words, spontaneous conversational dynamics between two speakers, and specific personal context that AI narrators currently lack. The content is a live educational demonstration with unscripted verbal patterns.

Natural Speech Disfluencies Frequent use of 'um', 'so', 'I guess', and self-corrections like 'Facebook or Meta I guess'.

Conversational Interruption and Flow Natural back-and-forth dialogue between Jonno and Jeremy with overlapping context and informal phrasing ('crosseyed', 'scribbling').

Personal Anecdotes and Context References to specific office hours, personal opinions on Yann LeCun's career moves, and internal course context.

Worth Noting

Positive elements

This video provides a practical, high-quality framework for using modern AI tools to parse and implement complex academic research that would otherwise be inaccessible to non-experts.

Be Aware

Cautionary elements

The video frames a specific, non-consensus research path (JEPA) as a 'substantive breakthrough' to create a sense of urgency for the viewer to learn this specific material.

Influence Dimensions

How are these scored?

About this analysis

Knowing about these techniques makes them visible, not powerless. The ones that work best on you are the ones that match beliefs you already hold.

This analysis is a tool for your own thinking — what you do with it is up to you.

Analyzed March 13, 2026 at 16:07 UTC Model google/gemini-3-flash-preview-20251217 Prompt Pack bouncer_influence_analyzer 2026-03-08a App Version 0.1.0

More on This Topic

Related content covering similar topics.

3 Hands-On AI Projects That Will Level Up Your Skills #AI #MachineLearning #AIProjects #TechSkills

Cognitive Class

Minimal Transparent

machine learning artificial intelligence

Intelligence Is Legos [Dr. Jeff Beck]

Machine Learning Street Talk

Minimal Transparent

machine learning artificial intelligence

Gemini FAILS to spell numbers under 1000

FatherPhi

Minimal Transparent

machine learning artificial intelligence

Java and AI? - Inside Java Newscast #72

Java

Low Transparent

machine learning artificial intelligence

AI Expectations: Hype, Reality, and the Road Ahead

Pattern

Minimal Transparent

machine learning artificial intelligence

Transcript

Um, so I'm going to go um hopefully you can see the the reading papers example that we were looking at. Um, so this was me figuring out what is a good recommendation. it um doesn't ultimately matter that much, but the goal is to look at a recent paper 11th of November from Yan Lun and colleague at Facebook or Meta I guess his last paper I believe since he's now left or at least announced he's leaving. And this was one that we looked at during the office hours and Jeremy said, "Oh, I think only you know a handful of people made it. So it' be it would be good to show how you went through it and also what we what we discovered in that process there." Um, and so to contextualize this paper, I could just try and show straight away what I did. Um, but then I know that the bulk of the audience is going to be a little bit crosseyed because it does cover on this is the end of maybe, you know, half a decade of papers building on these ideas um, in machine learning and learning representations. Um, and so I'm going to try and give a little bit of an explore of the background to this paper. Um and so yeah, we're gonna just dig into a little bit of scribbling on a whiteboard um to try and just >> high level context. Jono is like, >> you know, Yan Lun is one of the three cheering award winners, the kind of so forward godfathers of deep learning. He has some fairly strongly theoretical basis for kind of saying he doesn't think we're actually on the path to AGI with LLMs and that there's actually a different approach. He's uh his different approach is basically the paper you're showing is kind of the latest in the sequence of research that he's doing to try to build out this different approach. So, it's kind of like a very foundational piece of research, but it's also very kind of it's true research in the sense it's quite early. It's not one of these things that's at a point of like, okay, everything's now working, but it's uh this is where this is where things start, but it's quite a substantive breakthrough, I thought, in progress in this area, >> progress along a path that isn't the main line, right? Like everybody else is maybe running one way, it's progress along an attempt to run a slightly different way. >> Um, fantastic. Okay, so hopefully you can see um I've got a little whiteboardy thing here and actually before the lesson I was planning to put in some screenshots from the paper and stuff and I tested out um Kimi Slides which is free at the moment with lots of Google's Nano Banana Pro um image generation and so um that's what I'm going to be scribbling over the background here. Um yeah, so as Jeremy was saying, there's this kind of like clash of ideologies with Yanlen saying look just predicting the next token in a sequence of text or next pixels in an image or next frame of video directly. Um, that's going to be very useful and get us very far, but it's missing something. And specifically, if you're trying to do kind of long range planning and interaction and understanding of the world, you want something better than just that raw like tokenized text or tokenized video representation. So, world models are something that comes up a lot um as is um yeah, discussions around like what are you trying to learn? What are you trying to model? And so Jeeper is his um joint embedding predicting prediction architecture. It's an attempt to frame a way of learning from data such that you build extremely useful representations. And so that's going to be the key of what we're looking at today is like what are these representations? What do we mean by useful representations? What is this new paper doing for how we learn those representations? Um, and so to just spend a bit of time on what this could mean, you can imagine trying to learn a like like something like chat GBT or or an image model. You could imagine training it on all the world's data by saying predict the next frame in a video or given half of an image, reconstruct the other half or given the start of some text, construct the end of the text, right? And so this is predicting or generating these are generative methods that are creating sort of the final artifact. Um, and the problem with this is if I give you a photo of a dog and you try and generate back that photo, most of the like quote unquote information in those pixels is going to be things like the fur textures and the grass and the leaves on the tree behind, right? And so that's at a completely different level to what we actually want, which is an understanding of dogness and dog earness and grassness, right? And so if we could have this latent representation that captured those important attributes, but it wasn't so focused on these raw like pixel level um bits and pieces that would be ideal. So there's been other attempts to do this. You might know like clip and sigip, these contrasted methods that try and align similar images similarly or images and text that are related should be aligned. Um, but that's got all sorts of problems around um what you see in your batch and and things that are associated but not necessarily like actually correlated or reliant on each other. And so Jepper is this alternate view where um let's say we have two frames of video or two versions of an image. The goal is to say um for each one here for each for each input I would like to produce an embedding and this embedding is going to be a compressed representation of our input. Right? So the image might be 500 pixels x 500 pixels. This embedding could be only like a,000 values, right? So it's a highly compressed representation. And this is what we want for when we're imagining like, oh, what if I was planning, you know, 100 steps in the future. I don't want to generate 100* 500* 500 pixels. I just want to generate 500 like, oh, I'll walk over there and then I'll climb up the stairs. This more like condensed level representation. Um so if we have one input and we map it to its embedding and then we take a second input and we map it to its embedding. Um our goal is if those are related um I should be able to predict the embedding of one from the embedding of the other. So for video, I give you the first second of video. You could say, "Oh, there's a dog that's just run into the room." Like that would that idea would be your compressed representation and then you're saying, "What happens next?" And you could say, "Oh, the dog's going to skid under the table." Right? And so if you look at the first embedding and it captures dog running in on slippery floors and then you try and predict what's going to happen next, hopefully you've got enough information in that embedding to then capture the embedding of the next second of video which happens where the dog slips on the floor. Right? So the idea is we want to build up these really rich representations. Um, and so yeah, there's lots of like theoretical arguments for why this would be uh delightfully useful and delicious and better than all of the existing methods. Um, and they have shown various benefits of this, right? There's been papers all from Yanlen's lab showing that it does well on various image learning tasks. There was one called VJepper that showed the same for video. Um, they seem to learn faster, etc., etc. Um, but there's been a problem which is that you probably haven't heard of Jepper, right? Um and not that many people have been using it out in the wide world. Um and the problem is the framing of Japer to say from one input produce an embedding that's useful for predicting the embedding of another input. That has a very easy solution that the model can come up with which is to say just map everything to the same embedding. [laughter] Right? Because if I predict zero and then you say okay what is the next frame? Oh that's also zero. Like correct you've got a 100% accuracy at predicting the embeddings. Right? you're within some infinitismly small margin of error. Um, but that's not useful at all. That that collapsed uh representation is not actually capturing anything of value. Um, and so there have been all of these attempts within the Jeppa line of papers to fix this. Um, things like exponential moving averages where you have two slightly different versions of the encoder, one of which is lagging behind the other and being updated with the sort of uh regular averaging. There's been all of these extra attempts at regularization and stop gradients and duct tape fixes. Um, but it's been in practice fairly tricky to tame these models and to get them to actually learn the kinds of useful representations that we're hoping for. And so that's where we get to the current paper lea um has this idea of sketched isotropic gaussian regularization. And this is the kind of mathy term when you're reading the paper that might be very intimidating. And this is the piece that we're going to look at and solve it to try and make it less intimidating and to try and build some intuition for it. Um, and so the idea here, and we'll look at this in the paper and solve it. Um, but we want to enforce that our embeddings don't all collapse. They don't skew in some weird way. We want to enforce that they're this nice smooth isotropic gaussian, right? Isotropic, it's the same variance in all the different dimensions. Um, and you could try and do this by taking your model and embedding all of your data and then doing some statistics on that, but that's going to be incredibly expensive. And so the whole trick of this um paper is finding a way to do this one batch at a time in a way that's GPU friendly um that's not like high compute, but that still encourages this desirable property. And a lot of the paper is going into why this property is desirable and and why it's useful. Um but the core idea here like if you zoom out we're trying to learn um a nice distribution of embeddings or representations that capture the essence of some data so that we can use those for downstream tasks whether that's predicting what uh an agent might do next or classifying an image or understanding what's going on in a video. Um the goal is to try and build up these useful representations. Um okay so that's that's the background to to Jeppa. Um, hopefully that wasn't too long and boring, but that should give you the context then for when we say, "Okay, so let's look at what did we find in office hours." Um, and so yeah, this was one of the impromptu hangout suggestions. Someone said, "Hey, maybe you could read a paper." And this paper had just come out. Um, so I loaded it up. In this case, I use Gina. Um, and you can see here, "Hey, SV, can you give me a quick summary of the paper?" Um and so it's gone and given us the background. Self-supervised learning is this idea of learning from the data without labels. Um that gets rid of these huristic that that others used to have to use um and introduces this novel regularization method. Um and combining that with the standard jeer ideas, it gives a competitive approach that has much less knobs to tweak and seems to do really well even scaling up to a couple of billion parameters. Um, and so then we wanted to look in closer, you know, why might you want these embeddings to follow this particular distribution? Oh, they've got various statistical arguments for why this is particularly useful for downstream tasks. Um, and if you had, you know, skew and high variance in some dimensions but not others, that might be worse. I can kind of get on board with that. Um and so then um yeah starting to dig into okay what does the actual like what is this magical thing that they do um what does it look like how does it work um and so the idea is that we're going to take random projections so rather than looking at um the whole highdimensional embedding space we're going to take random projections and then along that line um we're going to use some very clever statistical tests to check for how desirable or how nice the distribution of our set of sample points is um and so then I said okay great can you show me the code and in the paper um in the paper they have this code I think it's like um somewhere somewhere in the uh here we go uh nope I should just um search for oh here we go def sigreg um so this is just solving pretty much straight from the paper um where it says here we go siggreg it added some comments but the code is the same. Getting the device, setting [snorts] up some random seeds we don't need to worry about. Um getting ready to project our input to um some number of dimensions that you're working with. Um and then calculating this a matrix that's going to do the projection and then okay picking some points that we're going to be looking at for our test. um and then doing the test and then computing the distance and then returning the results. Um and so when we were looking at this um full disclosure, I I'd maybe seen some discussion that there was something up with the code in the paper. Um we calculated this a matrix right the um the code here that put and I I went and checked this in the paper. This a matrix is created but it's never used um except that it is now. So this is the updated version of the paper, but when we looked at it in in office hours, you'll have to take my word for it. Um, it wasn't there. And it's like, wait, I'm not seeing that used. Is there an error? Um, so it's like, great catch. There is. Um, and it was able to say like, look, it thinks this is the missing step where we taking our full dimensional X and we're projecting it down. Um, and so yeah, it was pretty fun to like wait have this little like double take to say I see some lines of code there that don't seem like they're doing anything. is this solve it making a mistake and then to go to the paper like no actually um it's it's picked out the same error um that others have since noted. Um so then to investigate and to see okay like does this invalidate all the results? I I don't think so. Like they're generally pretty careful people. Um I loaded up our file creation and editing tools and said okay can you explore the the repo? So I cloned the um open source code that they have released for the paper and basically just let solve it go and it was able to dig around and view the contents of the directory uh load up different files and then pull out for us a here is the version that's built from the published source code version of the of the sig function rather than the oh we read this from the paper and it's got a mistake version. Um and so solvent made it all nice and annotated and um maybe a bit too verbose. I think I had um default mode on. Um but yeah, that was a really interesting exercise in taking a pretty complicated paper with a pretty abstract new formulation. Um and then being able to go from that to a piece of code. Um but for me, like I still don't feel like I understand it. Like I get I get the general idea. We're trying to get a nicer distribution. Um but I feel like I don't know. I don't know about you. Like if I look at this, this doesn't really like I don't have an intuition for what it does. I don't know how it works. I don't know what it would do on my particular data. Um so the next part, and this is something that's really nice about being in an interactive environment, was to think like is there a minimum demo that I can show? Um and so I think I just voice dictated. I have this idea. What if we just create some like fake embeddings that are in a random or you know like a uniform distribution some other non ideal distribution and then we measured our sigg measure of how good this distribution is and then we updated our embeddings with the gradient of that so that we kind of push them further and further into alignment with this ideal that's laid out in the paper. Um and so that's what we do here. We create some random samples like a thousand random samples with 32 dimensions and we lay them out in a uniform like square grid. Right? So this is not the nice smooth Gaussian that Yan Kern says is ideal. So what we're going to do is we're going to take a um yeah switch to concise mode and um together me and solvent got a nice simple version of sigreg that is the same algorithm from the paper but without all of the like multi-device multiGPU um extra fluff and then we create an optimizer. Now, if you haven't done deep learning, it's this is all a little bit scary, but the the core idea here is that we're starting with our initial distribution and then we are calculating our quote unquote loss, this measure of how good this distribution is. Um, and then we're updating the values that we're optimizing. In other words, updating our embeddings. Um, and then we are repeating that again and again and again. And when we plot, you can see they start out all spread out. And then over time, they get more and more and more mostly clustered around the center, but still evenly spreading out. In other words, more and more and more looking like an isotropic gaussian distribution. And this is viewing two of the 32 dimensions, but you could plot any two and you'd see the same pattern. Um, and so now I have a picture in my head that is like, okay, it's not just some magical math function when I'm picturing using this as part of my loss function. Normally, you'd also have some other training objective there. I'm imagining that this part of the loss function is going to be nudging whatever my embeddings happen to look like so that they look more like a nice even smooth distribution. And I've got a little toy that I can play with and poke at. And I can start to understand like, okay, everyone here, I'm getting my input. I'm projecting it down to a lower dimensionality. I'm taking a few points along a line. I'm doing my test to see how good this is. I'm summing those up. I'm returning that as my sort of average error. And the lower I get that, the nicer my distribution is. And so now it feels like, okay, I'm finally starting to understand this paper. Um, and this is all done just piece by piece, you know, somewhat comfortably because this is something I'm quite comfortable with. Um, but at any point you'd have lots of opportunities to dive in a lot more and say, "Okay, actually I'm not comfortable with Foria features or with math plot lib or with optimizing in a loop in PyTorch." So there's lots of places where you might find a different level of wanting to understand it. But it was a very positive experience for me to take a paper, build an intuition um and yeah explore it interactively with some other people in the office hours chat.

Video description

Using Yann LeCun's LeJEPA paper as an example, Jonno Whittaker walks through his process for how to read an academic paper with an LLM. He sets up the needed context, loads the paper, and then asks for summaries and explanations. Tools allow him to explore the source code repo from within the LLM environment. He concludes by building a simple interactive demo in order to develop intuition for the ideas in the paper. 00:00 - Set up context 00:44 - Background on the paper 09:56 - Ask for summaries and targeted explanations 11:19 - Try the code from the paper 13:44 - Use file tools to explore the source repo 15:06 - Build a minimal interactive demo to develop intuition 17:00 - Iterate This was part of a lesson in the How to Solve It With Code fast.ai course. Find out more here: https://solve.it.com/