Study log 4
Summary of the day
Learning about momentum, polishing this blog.
Momentum
Yesterday I could experience first-hand how more sophisticated optimizers like AdamW are much less dependant on choosing a good learning rate compared to simple stochastic gradient descent (SGD). Today I wanted to dive a bit deeper on this topic, so I read this blog post on momentum.
A simplistic explanation for momentum is as follows: The landscape of the loss function of a neural network is full of hills and valleys. Our network starts at a random point in this landscape (random initialization) and through backpropagation and gradient descent we try to find the lowest point in the loss landscape we can. With a fixed learning rate, you might set it either too low, meaning you need to train for a long time until you find a local minimum, or too high, skipping over potentially good local minima (or in the worst case not converging at all, possibly even increasing the loss when you oscillate). Momentum on the other hand means your learning rate changes and is akin to a heavy boulder rolling down the loss landscape, accelerating in downhill sections, rolling over small bumps, and due to a certain inertia and friction slowing down in flat sections.
The article is fantastic, though pretty math-heavy, and shows with some examples and a lot of links to more in-depth papers how you can mathematically prove benefits of momentum (though not on all domains/you can construt adversarial examples), or e.g. how in one example the convergence rate of SGD is related to the condition (ratio of largest and smallest eigenvalue) of a matrix (which appears in the equation that defines the domain), and how momentum both improves the rate as well as the band of starting points that converge (which matches my experience from yesterday).
Finally setting up this blog
I spent a few hours polishing this blog, enabing comments via giscus, writing a privacy policy and some legal disclaimers, as well as some beta testing/proofreading by my girlfriend. Overall this was super easy thanks to Claude Code and I definitely would have struggled much more in the pre-LLM era. I also noticed that there are no actual blog posts yet. I want these to be much higher quality than this diary, which is mostly for myself, with animations, graphs, links to source code, etc. However, I already have a doable first coding project in mind to hopefully fill this void soon.
Comments
Comments are powered by Giscus (giscus.app) and load content from GitHub. They are not loaded until you accept.