Study log 3 - Sub Specie

Summary of the day

Off day yesterday. Looked at the memory layout of numpy arrays, differences to torch Tensors, and did the tutorial of Pytorch, as well as run a number of optimizations on it. I published what I did on GitHub. Overall, I tried:

MLP optimized with SGD
MLP (same architecture) optimized with AdamW
Convolutional NN (4 runs: basic, with hyperparameter tuning, with BatchNorm, with dropout)
Improved Convolutional NN architecture, with data augmentation, label smoothing, massively increased epoch/training time and learning rate schedule/cosine annealing.

Overall, this was really cool and great to see. I already feel much more comfortable using torch, LLMs are great at spotting bugs in training code (which used to be incredibly demotivating), I managed to run a bunch of experiments and massively improved the naive performance and got somewhat within distance of SOTA. I definitely need to figure out a good way of both sanity checking and optimizing hyperparameters without running a full training loop, static analysis for model code to spot bugs more easily and earlier, but overall I feel more than ready for the coming challenges.

I think I’ll start with the ARENA curriculum afer some reading, but I also really want to do the Karpathy GPT-2 repro. In the coming days

NumPy arrays vs Torch tensors

Lower-level details of NumPy arrays

Numpy arrays have a “data buffer” which stores the actual entries of the array. This is essentially a 1-dimensional array in C - a contiguous block of memory. Important: Every element has to have a fixed size (e.g. int32, float64, tuples or structs of primitive data types, fixed-length strings, etc) for any of the following to work.

They additionally store a bunch of metadata which makes a bunch of the operations Numpy offers (e.g. transposing a matrix, defining a view on an existing array, broadcasting, going through an existing array in reverse) zero-copy and allows it to reuse the data buffer of a different array. Short summary of performance-relevant metadata:

Offset: Defines where in the data buffer the given array starts. Lets you define e.g. the following (example from Claude) with a single data buffer shared by a and b, only the metadata is duplicated

a = np.arange(10)        # buffer: [0,1,2,3,4,5,6,7,8,9]
b = a[3:]                # same buffer, offset = 3 * itemsize

Stride: for each dimension (number of these and size of each is also of course stored), a stride defines how many bytes in the buffer you need to advance to get to the start of the next element (which can then be read based on the fixed size of each element, see above). This allows the following applications:
- Transposing an array: Simply swap the strides. Data buffer stays untouched
- Broadcasting: You can set the stride for a certain dimension to 0 which enables broadcasting along a certain dimension without copying. Advancing in that dimension in terms of index will not advance in terms of actual memory address, leading to broadcasting.
- Reversing through the array. For example with a[::-1] you can reverse through a vector. This can simply be done by setting the stride to be negative and updating the offset to point at the last element in the data buffer. Again, zero-copy.

After having done the numpy exercises, this really struck me with how performant most of the numpy operations actually are on a computer (I mostly thought about them from a mathematical perspective, where e.g. transposing means you write the matrix down again). There are a few more bytes to allow for better interoperability (e.g. endianness, C-style (rows first in memory, then columns) vs Fortran style matrix layout (columns first in memory, then rows)) when working with remote machines, databases, scientific file formats etc, but I found the above the most interesting. Note that the matrix layout is extremely important for practical performance, as it is hugely influential for your cache hit rate (see e.g. the famous matrix multiplication comparison - Michal Pitr gets a 100x speedup over the naive implementation due to cache misses in his blog post).

Differences between numpy arrays and pytorch tensors

First of all, both are similar data structures - they allow storing and running computation on mathematical tensors, both use the approach I discussed above - a data buffer, strides, offsets, etc. For a pytorch tensor on the CPU, converting between one or the other is zero-copy, the data buffer can be reused. A short list of their differences (essentially pytorch just adds ML functionality on top):

Pytorch has autograd (automatic gradient calculation) functionality, meaning each tensor has an additional field to store its gradient, along with e.g. a requires_grad boolean that indicates if gradients should be calculated
Pytorch natively supports GPU acceleration via the device parameter, allowing it to store vectors on cuda, mps etc devices. The memory layout there will likely be different than for CPU.
Small differences between dtypes, e.g. only pytorch has bfloat16 (lower-precision, higher-range version of float16) and quanitzed types, numpy has a few dtypes pytorch doesn’t have. The default float dtype differs as well.

Basics of Pytorch

`Dataset` and `Dataloader`

A Dataset stores samples and associated labels. A Dataloader offers a structured way of iterating through the data. For example in stochastic gradient descent you might want to do batching, you might want to customize what happens if a batch is not full (e.g. batch size is 10 but you have 99 samples), you might want a simple way to inspect a certain batch, you might want to customize the sampling (e.g. using a certain random seed, or not doing completely random sampling), etc. Dataloaders offer either a version that allows random access to samples for the sampler, or an iterable style loading that just knows how to get the next sample.

High-level training overview

When you train a model, you essentially have to do the following steps:

Load your dataset(s), transform them into the right shape (and any other transformations you might want)
Define your model architecture (number of layers, which components, etc)
Choose a loss function and an optimizer
Core training loop on training set:
- Zero gradients
- Loss forward pass
- Loss backward pass
- Optimizer step
Evaluation on the test set

Pytorch has powerful primitives for all of these steps, make sure to learn the API and use them. For debugging, you can also visualize samples or predictions mid-training.

Gotchas I noticed

Make sure you call model.train() before the training loop and model.eval() when evaluating performance (i.e. running on the testset). Also ensure you use with torch.no_grad when you don’t want to collect gradients (e.g. running on the testset)
Nobody stops you from messing up the core training loop of zeroing gradients, evaluating the model on the current batch and accumulating gradients, then updating weights. They’re separate method calls and e.g. zeroing gradients can easily be forgotten.
The above issue permeates the ecosystem IMO, e.g. the learning rate scheduler needs you to call step(), the optimizer needs you to call step(), etc.
When defining your network, often a previous layer determines the shape of the next layer. However, at least by default there is no static analysis for this, meaning you get often confusing runtime errors about shape mismatches etc. LLMs are now quite good at spotting and fixing these, but this still doesn’t seem ideal. I’ll look for tooling for this on a later day
Wrongly set hyperparameters can cause significant trouble. Eg if the learning rate is too high, your network might not converge at all, or increase the loss rather than decrease it. Similarly to the point above, I’ll look for some framework that automatically checks for hyperparameters that are significantly wrong.
If you’re unhappy with your network’s performance, it makes sense to look at the accuracy/some metric on both the training and test set. This allows you to notice overfitting, when you are undertraining (eg increasing the number of epochs would significantly boost your performance), when you have a capacity problem (your network is too small/dumb to learn the function required by your dataset etc).
I haven’t yet run into this, but similarly you can run into numerical instability problems (eg vanishing gradients etc). These seem very hard to debug, not sure yet what you can do about this (the solution is usually changing your computation or algorithm)

Side note: Experiment setup

Today I really noticed why large tech companies all have their own ML training framework that automatically registers every training run along with some metadata etc. In ML you really want to iterate quickly (eg in a Jupyter notebook) and try out different architectural tweaks, etc. This makes it really easy to lose track over what change caused which performance change, which setup was actually the best one, etc. I’ll need to rigorously use git, and make good notes for all my experiment runs. Hopefully the blog can help me with this :)