Sorry, I have no experience with Flutter and so far only used ONNX via their DML EP (and CPU for testing purposes).
If you have 30 FPS input given but the fastest network you have with desired or minimum detection rate is only able to guarantee X FPS, then perhaps you can simply forward only every trunc(30/X) frame in the input to the network. From what I understant I would think it to be almost theoretical impossible to do a sustained 30 FPS large image inference of even the simplest of networks on a mobile CPU (I mean, there is a reason phone manufacturers want to push for new NPU hardware on mobile platforms) so with that in mind you most surely have to reduce the load somehow, and down-sampling the input stream in either time and/or size is (as far as I know) really the only robust solution (even if its not optimal).
I am not familiar with the YOLOv11 network, but I understand from its
description that it already very much is designed for near real-time detection, so I doubt you can gain detection speed applying structure changes yourself. With your given fixed input rate, CPU hardware and YOLO network I don't see any other option than reducing network load as mentioned above.