>>85170453if chatGPT is a decent representation of current multimodal model capabilities (haven't used one locally myself) they can't directly do coordinates of objects in images. i've heard of people working around it by overlaying a labeled grid on the image that it can reference, but if you try to ask chatgpt for coordinates of where to click next in an osu screenshot even if it gives you the right circle it has to write some quick python code to basically eyeball it without nearly enough precision to play the game
so besides the actual speed of the whole thing and whether or not the LLM knows how the game is played i think there's a fundamental issue with how image analysis works that'll be a blocker