> But my students weren’t as happy as I was - they wanted to build something genuinely useful, and they were really disappointed that our “product” had strong architectural limits and couldn’t outperform titans like nginx and haproxy.
I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.
You can probably eek out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).
If you also cpu align your outgoing sockets, you should get a significant boost, but afaik, there's no great api for that. Linux does have an api for compatible NICs (traffic steering) which can work, but if you know what hash your NIC uses (it's probably toeplitz) and you manage source port selection to your backend, you can pick ports that will hash properly.
The goal is for your proxy to be able to handle packets without any cross cpu communication.
The author takes a very benchmark focus on this topic which only says part of the story particularly for complex systems. Noticed that there are a number of very similar interface that exist on other platform like windows long before io_uring, but that does make Linux’s I/O system worse or slow than these platforms. A fast server is likely fast in either multiplexing or async API if implemented correctly in almost all cases.
Yes, io_uring is significantly faster than epoll (I think I had like 20% faster req/s with io_uring.) The catch is that its kernel opt-in and disabled just about everywhere for security reasons. I think that it has direct memory sharing between the kernel and user-land which is kind of yikes. There's been multiple exploits that hit io_uring in recent times. It's because of this that even engineering projects that try to reach the highest performance possible (like Go) don't really bake io_uring in as a sane default. Though if you want to take the risk you can always run it yourself for your favourite language. It is faster but the cost is possible exploits.
For a project like Go, wouldn't it be an option to do one-time iouring feature detection in the runtime startup? Exploits are an issue for the entire OS, not the program choosing to use iouring, yeah?
I switched out asio's epoll backend for its io_uring in a database server and CPU utilization shot up. Probably depends on usage and the specifics of how it's integrated into the event code.
That’s paradoxically what you can expect on a busy server - your CPU can spend time doing work that would have been previously IO wait time. Of course, it could be a bug in the implementation where you’re spinning doing no work erroneously, but depends on the details.
Yeah, the explanation that I've usually heard for this sort of thing is that it's intended to get back CPU time that's lost when too many system threads are blocking to keep something on every core even during I/O (or pay for it in terms of the context switching overhead if you compensate for this with an extremely large number of system threads). The theory is that you'll avoid idle CPU compared to the common "one thread per core" way of doing things due to some of them being idle during I/O, at the cost of using some extra CPU to handle more things in user space. Obviously how much this helps can vary between use cases, but the measure of how much it's helping (or if it's maybe not helping at all!) is throughput, not CPU utilization.
I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.
You can probably eek out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).
If you also cpu align your outgoing sockets, you should get a significant boost, but afaik, there's no great api for that. Linux does have an api for compatible NICs (traffic steering) which can work, but if you know what hash your NIC uses (it's probably toeplitz) and you manage source port selection to your backend, you can pick ports that will hash properly.
The goal is for your proxy to be able to handle packets without any cross cpu communication.
[1] https://github.com/sibexico/TinyGate
Rdma, dpdk, io_uring it’s really kind of up to the user to do the memory isolation
In io_urings case tho, you can’t do much because the rings are in the kernel.
I’m hopeful though that with Llm things will get better.
But it’s just hard problem to solve . Very difficult to do in the kernel itself, and folks don’t really even understand tuning for it.