In my last post, I talked about how we’ve deviated from our original design tenets. I’d like to talk a bit about how we can move forward.
I guess first of all I should point out that I think the design tenets are sound and we’re doing it wrong by not following them, so the solution isn’t to just throw away the design tenets or replace them with new, shitty ones.
I should also point out that my criticism does not apply to Swift. Swift mostly gets it right.
If we want OpenStack to scale, to be resilient in the face of network failures, etc., we need to start going through the various components and see how they violate the design tenets. It’s no secret that I think our central data store is our biggest embarrassment. I cannot talk about our “distributed architecture” and then go on to talk about our central datastore while keeping a straight face.
I don’t believe there’s anything you can do to MySQL to make it acceptable for our use case. That goes for any flavour of MySQL, including Galera. You can make it “HA” in various ways, but at best you’ll be compartmentalising failures. A network partition will inevitably render a good chunk of your cloud unusable, since it won’t be able to interact completely with its datastore.
What we need is a truly distributed, fault tolerant, eventually consistent data store. Things like Riak and Cassandra spring to mind. And, while we’re at it, I think it’s time we stop dealing with the data store directly from the individual projects and instead abstract it away as a separate service that we can expose publically as well as consume internally. I know this community enjoys defining our own API’s with our own semantics, but I think we’d be doing ourselves a horrible disservice by not taking a good, hard look at AWS’s database services and working out how we can rework our datastore usage to function under the constraints these services impose.
I’m delighted to learn that we also have a queueing service in the works. As awesome as RabbitMQ is(!), it’s still a centralised component. ZeroMQ would probably solve a lot of this as well, but having an actual queueing service that we can expose publically as well as consume internall makes a ton of sense to me.
If we make these changes, that’ll take us a long way. What else do you think we should do?