I was very happy to notice this summit session proposal by Mike Wilson for the OpenStack summit in Hong Kong. Its title is “Rethinking scheduler design” and the body reads:
Currently the scheduler exhaustively searches for an optimal solution to requirements given in a provisioning request. I would like to explore breaking down the scheduler problem in to less-than-optimal, but “good enough” answers being given. I believe this approach could deal with a couple of current problems that I see with the scheduler and also move us towards a generic scheduler framework that all of OpenStack can take advantage of:
-Scheduling requests for a deployment with hundreds of nodes take seconds to fulfill. For deployments with thousands of nodes this can be minutes.
-The global nature of the current method does not lend itself to scale and parallelism.
-There are still features that we need in the scheduler such as affinity that are difficult to express and add more complexity to the problem.
Finally. Someone gets it.
My take on this is the same as it was a couple of years ago. Yes, the scheduler is “horizontally scalable” in the sense that you can spin up N of them and have the load automatically spread evenly across them, but — as Mike points out — the problem each of them is solving grows significantly as your deployment grows. Hundreds of nodes isn’t a lot. At all. Spending seconds on a simple scheduling decision is not near good enough.
Get rid of the scheduler and replace it with a number of work queues that distribute resource requests to nodes with spare capacity. I don’t care about optimal placement. I care about placement that will suffice. Even if I did, the metrics that the current scheduler takes into account aren’t sufficient to identify “optimal placement” anyway.
Someone is inevitably going to complain that some of the advanced scheduling options don’t lend themselves well to this “scheduling” mechanism. Well.. Tough cookies. If “advanced scheduling options” prevents us from scaling beyond a few hundred nodes, the problem is “advanced scheduling options”, not the scheduling mechanism. If you never expect to grow beyond a few hundred nodes and you’re happy with scheduling decisions taking a couple of seconds, that’s great. The rest of us who are building big, big deployments need something that’ll scale.