>>59847680Depends on the model. If we're serving 13B, each GPU should be able to handle it if like a 50 users prompt **simultaneously**. But if it's properly paced we can do hundreds to thousands per GPU. Keep in mind that these are estimations based on benchmarks, not sure how they'd apply to real-world cases. Here's a chart I made a couple months ago, but by now we should be like 2.5x more efficient (too lazy to update it):
13B model:
- # GPUs is the number of GPUs model is running on
- N is how many prompts each user requests (think of it like the pre-loaded swipes of the early day
C.AI)
- Requests/s is how many requests we'll be able to serve every second
- Tokens/s is the total throughput per second
Now that we natively support Exllama and AWQ, and I've improved the efficiency by a factor of at least 60%, we're likely much better than this.