Quoted By:
Looking at this paper again, I think you guys are doomposting too much. The difference between the 2B and the 137B model after pretraining is not that big, according to their paper. I believe a good GPU can run models with 2 billion parameters.