>>75470529a paper come out a little bit ago showing that the later layers of models (except the very last ones) tend to be more redundant and don't do much. they found out you can erase a lot of these layers, re-tune the model on a relatively small amount of data and the model will be almost as good as it was before
https://arxiv.org/abs/2403.17887seems to work here, loss is roughly close to where a normal llama-8b would be with this data