cross-posted from: https://programming.dev/post/51407459

Check what can you use and at what rate of token per seconds would it be… It has examples of many models and quantization levels. Huge resource!

  • SamuelEllis@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    2 days ago

    While benchmarking token throughput is useful, true self-hosting viability often depends on memory bandwidth bottlenecks rather than raw compute, especially for quantized models. Have you evaluated how different quantization levels impact inference latency on consumer-grade GPUs compared to the reported token-per-second figures?