>>77388848>Do we have estimateion or numbers for how long it will take to train?We don't know, but my guess is training SD3 "DiT-only" will be faster that training SDXL "unet-only"
Training text encoders will be more difficult. SD3 has 3 text encoders, 2 clips like in SDXL and T5 with 4.7B parameters, which you won't be able to train
But if you ignore T5 and only use clip, it shouldn't be too bad
From their paper:
>we observe limited performance drops when using only the two CLIP-based text-encoders for the text prompts and replacing the T5 embeddings by zeros. Only for complex prompts involving either highly detailed descriptions of a scene or larger amounts of written text do we find significant performance gains when using all three text-encoders. Removing T5 has no effect on aesthetic quality ratings (50% win rate), and only a small impact on prompt adherence (46% win rate), whereas its contribution to the capabilities of generating written text are more significant (38% win rate).>>77389731Well, the list was auto-generated, so there's some mistakes.
Mascots shouldn't be here