>>61353077danbooru / gelboru is your dataset.
You can use
https://github.com/Bionus/imgbrd-grabber or hydrus to download.
I use grabber. Downloading "virtual_youtuber -filetype:mp4 -filetype:webm -filetype:gif parent:none mpixels:>1 score:>2 id:<5000000" and then same with id:>5000000 - should get about 170k images from danbooru. Danbooru hides loli by default, so dl them from gelbooru. You can also add some non-vtuber images (20-30% of your total dataset)
Add to blacklist tags like "animated, virtual_youtuber multiple_girls, *_text, multiple_views, monochrome, greyscale, sketch, comic, heavily_censored, text_focus, translation_request" and very low scores "rating:explicit score:<30, -rating:explicit score:<3"
To write tags in grabber try:
"<!"virtual_youtuber"?:%character:unsafe,separator=^, %>, @, <"rating:explicit"?nsfw, :><"rating:questionable"?nsfw, :>%general:unsafe,separator=^, %, <!"virtual_youtuber"?%character:unsafe,separator=^, %:>, <by:%artist:unsafe%>
Run something like fixup() from
https://github.com/space-nuko/sd-webui-utilities/blob/master/tagtools.py after downloading.
Character tags for vtubers will be first so it will be easy to use a script to count or sort them in folders. If you want to also add autotags,
https://rentry.org/ckmlai#ensemblefederated-wd-taggers is decent autotagger script. Just change wt to at in open so it won't overwrite your tags.
Further filtering: try cafeai/cafe_aesthetic score < 0.6, and skytnt/anime-aesthetic score < 0.1 for removing very bad images. Run duplicate finder.
Sort vtubers into folders by name, try balancing by setting more repeats for vtubers with less images or, at least, delete unnecessary Gura/Marine art. Each of them has almost 10k images, you don't need more than 500-800 per vtuber. You also don't need shitters with <10-15 images. Delete them too.
If you want a better setup, try using some things from
https://github.com/cyber-meow/anime_screenshot_pipeline/tree/main and
https://github.com/deepghs/waifuc - you can try cropping faces/upper bodies, for example.