>>64069622I've analyzed their training before and the only things I am missing is being able to use AdamW8bit, 6 batch size and xformers. Instead I'm stuck with Adam, 2 batch size and sdpa.
Maybe I have been nuking the learning rate down too much after so many 2-views and stiff datasets.