Runq_overload

jbrzozoski · June 13, 2023, 8:13pm

We’re getting this alarm on our production broker (v5.0.19):

emqx_runq_overload

There was definitely a surge of traffic right at the time it started, but the load has dropped back down and the alarm hasn’t gone away. I kinda suspect the alarm is stuck on, as it’s been about 50 minutes and everything is operating normally now.

But, more importantly, our system may need to handle surges of traffic like that every now and then and I’d like to know what to tune to avoid that. I’m reading through the v5.0 tuning guide and suspect it might be the Erlang VM tuning options, but can’t tell for sure.

We left most of our emqx.conf file default/blank, so what would the default values for node.process_limit and node.max_ports be on our system? Are those the proper things to tune to avoid the runq_overload in the future?

I’d like a moderate level of confidence before having to bump the config on our production server.

liuxy · June 14, 2023, 7:56am

Hi, I will investigate it. For now there’s no configs available for tuning, it should be a problem if the alarm cannot be cleared automatically.

william · June 14, 2023, 12:23pm

pls check if it is a false alarm:

emqx eval 'emqx_olp:is_overloaded()'

off topic, We do have perf issue in emqx 5.0.19, pls try upgrade to the latest 5.0.

We do not suggest user tune what condition to trigger this alarm.

if you get an alarm from time to time, that means your setup is not ideal from time to time.

jbrzozoski · June 14, 2023, 3:25pm

Unfortunately, our production server is running in an AWS ECS container and I haven’t enabled the ability to exec shell commands yet. Is there any way I can check the false alarm via the dashboard?

(I am also looking into enabling the ability to exec shell commands, but that is not a high priority.)

I am not looking to change the condition that triggers this alarm. I would like to know what part of my setup I should be looking to change to avoid this alarm. Is it just a matter or more CPU/RAM and EMQX will automatically make use of the increased capacity?

liuxy · June 15, 2023, 2:32am

If the surge of traffic is not very common, then just ignore that alarm. You can monitor you CPU usage using a system monitor tools.

We suggest you to upgrade your emqx to latest of 5.0.x, as lots of bugs have been fixed.

william · June 15, 2023, 10:01am

I am not looking to change the condition that triggers this alarm. I would like to know what part of my setup I should be looking to change to avoid this alarm. Is it just a matter or more CPU/RAM and EMQX will automatically make use of the increased capacity?

This alarm means there is a long queue of work to do, mostly the reason is you don’t have enough resources for computing. either inc the number of cores or use more capable core will help.

But we do see the issue when cloud provider limits the IO (network/ disk) that the scheduler gets long blocking that could not switch back to handle the work in the VM. So you should also check if your deployment reached other limitations of the cloud provider.

There could be other reasons so we suggest you have a good monitoring system pulling metrics, log forwarding from EMQX so we could have an overview of what is happening.