Consistency: Eventual vs. (the illusion of) Strong

Lemma:
All distributed database systems are eventually consistent.

Summary of argument:
Strong consistency algorithms (2PC, T2PC, D2PC, etc.) do not remove eventual consistency from our distributed systems, they simply allow the transaction manage to know when consistency has been achieved or when a problem has occurred. This fact isn’t really understood by a lot of engineers and thus we get discussions about eventual consistency vs. strong consistency.

In my mind, which, admittedly, could be considered a universe in its own right, there is no such thing as strong consistency. It’s a shared consensual illusion that allows us to sleep at nights without having to worry about an entire class of problems. But, it’s just an illusion.

Ok, it’s possible to synchronize multiple data sources so that they have the same information at the same time, using distributed transactions; two-phase-commit (XA) and the like (T2PC, D2PC, etc.), but does this give us strong consistency where all data sources are in synch? Not really, for a couple of reasons.

First, an XA transaction is considered complete when all data sources have committed their portion of the transaction and reported back to the transaction manager that the commit was successful. What happens if there is a failure in one of the commits? It happens. The XA manage will report the transaction as having failed (assuming it’s not the manager that’s at fault), but the database instances will be out of sync. In this case, 2PC doesn’t guarantee consistency, but as long as the transaction manager isn’t the thing that failed you should be informed of the problem. Assuming that someone will take action to resync the data at some point we have dropped back to eventual consistency.

Another issue is that distributed data sources take different amounts of time to commit. Eg: my XA manager progresses to the commit phase, say for a 2 node transaction, and everything is fine and the transaction completes with both ack’s turning up. There is still a period of time during which the first node has committed but the second one hasn’t yet done so, during which time we’ll discover inconsistent data across these nodes. Note that with every possible model of distributed transactions this is always the case — there is always a period of inconsistency. Therefore, we have eventual consistency again. Ok, we’ve got eventual consistency, and a callback to tell us when things are in synch again, but that’s all we have.

Then, lets consider the human element. If a human is driving the transaction somehow, say by using a credit card to pay for some goods, the databases involved in the transaction may be considered synchronised, but the human definitely is not in synch, but the human can be considered part of the system, a part that is always eventually consistent.

CQRS systems have even bigger problems. While XA may be used for the write side, it is rarely linked in to the updates for the read side, so in this case reads may be out of sync with writes.

Then, we have the issue of where the data is going once it is read after a transaction. If it’s being pulled into a gui, for example, it could be seconds between the commit and the display being updated, during which time another commit may have happened. What is the point in having the datasources up to date if the gui is reporting stale data? Of course, this doesn’t have to be a gui — that’s just for the example, it could be any upstream process that uses this data.

So, all distributed systems just are eventually consistent. You need to decide if you are interested in getting an event every time a limited group of datasources are found to be consistent (if that’s of any use at all), or if you’re just willing to live with the fact that parts of your system may be out of sync with other parts, and relax.

UK Contracting Market — Rant

I have been a contractor on the UK market for most of my career — which spans 4 decades now. So, my opinion is probably valid.

In my slightly biased opinion, contracting is the best way to work, and using contractors is the best way to get software built.

Contractors, Permies, and Consultants: A contractor’s only real interest is their revenue stream, and this is protected through doing as good a job as possible so that they are extended or recommended for other work. Permanent staff, on the other hand, are always too interested in politics and protecting their job to do what’s necessary and right — and anyway, they’re not paid very well so they’re not that engaged — they don’t really care about the company they work for because the company really isn’t interested in them. Also, generally speaking, if you are any good you’d go contracting and triple your salary (I said generally speaking: I’ve met some great permies who do it because of location, or love for the company, or for the experience). Consultants have one job and only one job — selling more products and services to the client. They do not have the client’s best interest at heart, because the client isn’t paying them.

The Market:

The biggest problem with the contractor market is the agency model that has persisted since the late 70’s. The idea is that agents provide useful services for both contractor and client, however this simply isn’t true.

For example, I’m currently looking for my next gig, so in the past couple of weeks I’ve applied for a large number of roles through Jobserve.com, via agencies. When I apply for a role I state my day rate, so that agents with low paying jobs won’t have to waste their time getting back to me. Now, I’m very experienced, and very good, and I charge well above average – right up at the higher end of investment bank contracts, because I often work for investment banks — so I would expect to have a lot of interest in my CV. But, I got nothing. Hundreds of applications, not a single response. So, what happened? Are the roles evaporating? Is my CV hideous? Is my experience in things like DevOps, automated testing, DDD, REST, Spring, jBoss, JEE, scrum and XP not really in demand?

No, what I think is happening is this: a client calls an agent and tells them that they have a role and they’ll pay a thousand a day. The agency wants as much of that as they can get, so after looking at CVs and finding some that might meet the clients needs they push them to the client in reverse order of expected rates. So, contractor A asks for 600 a day, and contractor B asks for 500, so the agency sends contractor B off to the client so that they can make more money.

Is the client upset by this? Well, they’re probably getting the skills they asked for, and the project will go ahead. But, I’m not just good — I’m a hyper-productive polymath. The client has been given someone who will suffice, rather than someone who will excel. The cost to the client doesn’t change, but they get a lower quality resource for their money (that’s an assumption, but let’s run with it). If they’d seen all the CVs they would have recognised me as being special and may have preferred someone with a lot more experience and skills in other areas that they could use to their advantage.  Instead they get the lowest suitable bidder.

Now, my argument here isn’t that I’m special and more deserving of work than someone else. My argument is that this should be the client’s decision and they should not be affected by the agency trying to make a bigger profit because the end result is that it reduces the quality of their project team.

I’m making an assumption too that price == quality, which isn’t always true, but I charge more because I can,  and contractors are pretty smart in general so I believe that most everyone else is doing the same too.

So, agencies are manipulating the market in a way that’s bad for their clients. No surprise with that though — they don’t make money through projects succeeding — in fact, they benefit more from a failure that’s retried time and time again.

Inflation:

I noticed something else today too. The average high-rate for the roles I’m interested in is 600/day. Funnily enough it’s been that way for decades. Given inflation (low as it is at the moment), I would have expected this to have risen to quite a lot more than this by now. So, what’s happening here?

Again, it’s the agents suppressing our rates. By creating a market where the lowest competent bidder wins they are keeping the low rates around the 500 mark, which keeps the high rates at 600. Ok, there are outliers that are pushing 750, but that’s for algorithmic trading specialists with years of experience doing just that for front end investment banking. There aren’t many people who can do that, so the rates reflect that. However, I don’t think that there are only 20% fewer front-office developers than other developers — I think that such developers are incredibly rare — there are probably fewer than a hundred people who can do that right in the whole of London, and tens of thousands of contractors in general. So, the rates for the top guys should be a lot higher — and they are — when i did this six years ago I was told that I could name my own rate, that the  client didn’t really care how much I wanted — it was peanuts compared to what they stood to earn. So, why are they showing at 750 on jobserve? Because the agents don’t want other contractors knowing how much disparity there really is in the market — so that they can continue to offer 500 to other contractors without a fight.

Not all agencies are like this. I’ve worked with some lovely people in the past, but the majority of them are working both sides off against the other and they are the only people benefiting from this arrangement.

Solution:

We need to remove the agencies from the equation, and have contractors and companies deal directly with each other. I understand that the agencies have the corporate interface, invoicing and credit worked out, but surely this isn’t hard (I know it isn’t). We need a contractor exchange, where companies can list their requirements and contractors can bid for work, and the work will be managed by the exchange — which takes no active part in the matching process. Only then will contractors and clients get what they need — the best deal for them — without having a third party in the middle looking out for itself.

The problem is that as a contractor I’ve written and rewritten  my CV so many tiAmes that I’m sick of it. We could use linked in, which at least has employment history information, but it needs skill information too to be really useful in this regard (skills per assignment, not the ‘recommendation’ nonsense it currently has).

Another problem is getting a consistent definition of  the skills needed for a role. Based on an informal review of a few hundred job-specs, it’s clear that some of them have been written by junior associates at the agency — because I haven’t heard anyone else in the industry ask for J2EE as a skill for over 10 years. And, pitifully minor skills are continually listed — must know xml and jaxb — it’s hard to imagine an experienced java contractor who doesn’t have that as a core skill, because that’s what it is and so doesn’t need to be listed — if  you know java, and you’re charging 400+ quid a day, you will know jaxb, it’s as simple as that. I think that half the skills listed are in that category.

Rest and HATEOAS

So, we’re going to use json for HATEOAS REST, but we have to decide what to do with collections on our objects first.

Problem: In a very popular online school system, I have a Course object that has a collection of Students who are/have attended. Typically there are a lot of students for one course (maybe millions). How is this represented in HATEOAS?

We don’t really want to embed the student collection in the course object. If we did that then we’d have a number of problems. First, the user may not be interested in that information, but we’ve just forced them to pay for downloading it. Second, the list is long, and will take a long time to transfer. Third, the app/page displaying the course object will have to know how to deal with students too, because they’re embedded in the course.

Rather than embed the collection we should instead include a link to the collection which the client can then present to the user and allow them to choose if they want to see any of that information. The HTML protocol  solves this by having a header section in the server’s responses that may include a list of links. HATEOAS suggests that we should copy the html protocol’s approach, but we can’t really do that because there isn’t a hateoas json standard — a json object represents a resource on the server and doesn’t have a header or links section.

Spring Data Rest has implemented Atom+RSS patterns in their json responses, so that there is a list of links in the json object that represents the requested resource. But, this really is an arbitrary choice. I think that I could design a much better links protocol for json myself — something tidier and simpler and easier, but it wouldn’t matter if I did because it’s still a solution that I’m writing myself and which won’t be interoperable with any other solutions. This means that we can’t yet write a generic json/rest/hateoas client that will work for all possible  services. We need a standard for this.

Personally, I think that the json response sent by the server should be wrapped in a response json wrapper that contains a header and body section, just like html, and the header should include links to javascript, css, and collections. That way, if I request a resource, my client can simply pull the code necessary to render it.

Now, if the code (javascript) to render an object were also standardized, so that it always displays in the given div, and the css were standardized in such a way that the given renderer is guaranteed not to draw outside the given container, then we’d have a way of accessing and viewing arbitrary resources without having prior knowledge of what they are, where they are, and how they work. Unfortunately, until we have these standards we’re left floundering around creating our own point solutions.

Basic Artifact Workflow

We build our software locally, on a developer’s machine, and we run the tests. The artifacts that are generated from our code at this point are disposable. Once a developer is happy that the tests are passing they commit their code to the source repository and the CI box takes over.

In my world, the CI box is the ultimate arbiter of truth. Once it has built and successfully tested an artifact we upload it to nexus so that we have a central copy of it. A copy of this is then deployed to the functional test environment where it is examined for flaws. One through functional testing we take the same file from nexus and put it in our integration test environment and from their it goes through UAT and pre-prod. If it passes all the tests and the user’s agree then we deploy it to production.

The key here is that we’re using the same artifact across all environments. We’re not rebuilding for specific environments because this introduces a huge risk. You really don’t want to be putting an untested war in production, even if the only change from the tested one was in a config file that lists a database connection string. Can you count the things that could go wrong?

Now, we automate as much of this as possible of course using tools from what’s now called DevOps and custom automation scripts. This means that there is no chance for human error (at least at that level) and that once a process is automated and tested it’ll work tirelessly forever without flaw. It’ll also work very rapidly.

In my current gig it takes us a couple of minutes to build our suite of software and about 20 minutes to run all the unit and integration tests. We have automated tests throughout each environment too — gherkin — to run smoke tests in Fun, and integration tests in Int. We haven’t automated UAT yet, and I don’t think that we will — it’s just too useful to watch humans break things.

If your process is vastly different from this then you might have problems that you haven’t thought about. Propagation of tested artifacts through test environments is a core principle of software engineering and if you aren’t doing it then you’re missing out.

Object Orientation – you’re probably doing it wrong

Is it possible that almost everyone really has no idea that they’re doing it wrong?

If taken literally, and to the extreme, encapsulation means that objects should have no getters or setters, only command methods. If we follow this pattern, then our code becomes a lot simpler and easier to understand. But how then do we write user interfaces? We follow the CQRS pattern, of having two distinct sets of code — one containing our object model which is used for writing, and which contains the true status of our application, and a second containing the information used by the UI. The read-side model is updated by responding to events in the write-side model. As a consequence, we gain the ability to independently tune the performance of the two aspects of the application.

Martin Fowler called this pattern CQRS, but I think it’s more than that — I think that this is object orientation. Having accessor and mutator methods (getters and setters) leads to all sorts of trouble. The kind that encapsulation was designed to fix. A long time ago, the industry jumped onto interfaces as the defacto pattern for implementing encapsulation — after all, you’re not exactly exposing internal representation if you use a method — you can vary the implementation separately from the interface, the getter method, without changing your dependent code — problem solved, right? No, the problem is still very much in effect because this naive approach is still exposing way too many implementation details. It is almost always the case that when we change the implementation, we’ll also change the interface. This is seen very commonly in the java beans approach to OO — a very influential approach that I now believe is retarding engineering practices.

Try the cqrs approach for a while, and you’ll probably find that your code is cleaner and more elegant, and just as importantly, it is easier to maintain, contains fewer defects, and is faster, and therefore, less expensive to develop. All of this adds up to reduced costs, shorter time to market, and reduced risks.

In a typical webapp, the pattern is simple: javascript fires off rest requests to the server for data to display — that’s handled by the read-side, which is optimised to handle these queries. Eventually, the user submits a change that the server needs to know about, so the javascript at the front formulates a command and sends it, along with any appropriate data, to the back end for processing. This command will be processed by the rest controller, which will call a method on an aggregate root object which is responsible for propagating instructions to its subservient parts. While this is happening, events will be raised which are handled by both other parts of the write-side model, and by the read-side model which updates itself to reflect changes made to the write-side model (this is a one-way street — no read-side events are handled on the write-side). When the command processing is complete, the rest server returns the appropriate code to the client, which then issues a request for updated data from the read-side.

This pattern is remarkably simple, but it leads to very clear code and very clear separation of concerns. Rest services provide data to be rendered by the client. Rest services provide command processing interfaces. Write-side models process those commands. Read-side models react to notifications that the model has changed. All independent of each other. Not a getter in sight.

Where I have done this we have had separate databases for the two sides, and we’ve used the excellent Axon framework for the write-side, and the excellent spring-data framework for the read-side. I’d like to have used spring-data-rest, but that’s just not ready yet. We have also used spring-webmvc for our rest controllers.

We build all of this through maven — once — and release it to a common repo. Builds are repeatable, but we don’t care about that — we promote a single built artifact through all of our environments, ensuring that the artifact in production is the one we tested in functional testing, integration, QA, UAT, and Pre-Prod.

Optimistic Locking and Object Identity

Optimistic locking is the process by which we block a client from changing an object that was changed by someone else during this client’s edits. The process is common: user A and B both load the same resource and start work. User B submits their work first. When user A submits their work it’s rejected because the underlying object has changed while they were editing it. We call this optimistic locking because we assume it isn’t going to happen very often and the user is optimistic that they won’t have wasted their edits.

Usually we implement this with a number (or timestamp — same thing) on the object. We get a request to change an object so we load it by ID and check it’s version number. If it’s different from the version number on the submitted object we know that the submitted object is out of date and we can’t accept the changes.

Of course, we don’t need to load the whole object just to find out that we’re going to reject the changes — an optimistic locking exception. For efficiency, we could query for the object using a compound key — the ID and Version Number together – if we don’t find an object to update we reject the changes. We could even simply issue the DB update (if we’re using a DB), using the compound ID+version key in the where clause and note if a change was made, offering a rejection response if nothing was updated. Efficient it may be, but really it’s 100% equivalent to load and check (actually, it’s better to use the direct update, rather than load and check, unless you want to risk the underlying data changing between the load and the check).

But, given the optimisation above I noticed that, effectively, the object’s identity changes on every update. An object with optimistic locking is an object who’s identity is encoded in the sum of its attributes, and if a single byte changes then so does its identity.

This is deep. We often assume that an object’s identity is tied to its ID, but here we’re deliberately subverting the id to ensure that a user is fully informed before they take an action. In so doing, we’re forcing an object’s identity to change, when we should be changing only the value of its attributes.

Rest Non-Crud

Often, REST interfaces can be seen as basic CRUD interfaces, and the pattern is so strong that sometimes it’s difficult to see how to use REST for things other than CRUD. In this post I’ll discuss how to issue commands against resources.

Rest Command Pattern

Problem

Given an object, ask the rest server to perform a non-crud operation on it.

Example

My user wants to submit an order, but all orders must be approved before processing.

Naive Design:

The client could change the order’s status field and issue a PUT to the service sending back a complete copy of the order. The server could use the resource uri to load the order and compare it to the order sent from the client. It will discover that the status field has changed and infer the correct command from its value.

The reason this is naive is that in any non-trivial application the object graph representing the order may be large and complex, and navigating through it to the correct status indicator may be a chore. Also, you may not actually want the client changing the status, or even sending that field across the wire. If the client changes the status to ‘approved’ and then the PUT fails, then it has to change it back again, which would mean that the server and client models are out of sync for a while — something that is quite undesirable.

This leads to a lot of code pulling apart the model looking for changes and trying to guess the appropriate command/action to take based on those changes. Even when you find a class of commands (‘approve’, ‘reject’) you end up with a string of ‘if’ statements which isn’t particularly OO: if(status == ‘approve’) then new ApproveCommand() else if (status == ‘reject’) then new RejectCommand()…

Also, it seems like a poor idea to try to infer the user’s intent from the data, as the client actually has this information — it just hasn’t communicated it clearly to the server.

Solution

The client should send extra information with the PUT to allow the server to route to the correct, specific, handler.

Implementation

There are three possible implementations

  1. Send the command as a query parameter: http://server/service/model/id?command=approve
  2. Send the command as part of the URI: http://server/service/model/approve/id or http://server/service/approve/model/id
  3. Send the command as part of the object being PUT: order.command=’approve’
  4. Send the command as a header parameter: headers : {command:’approve’}

Number one and two suffer from problems like cache busting (because the uri has appeared to change). The second solution adds an unnecessary name/value pair to the object which may break the service because the server may not be expecting it or may require extra code to deserialize, and then the code still has may if-statements based on the value of the command.

The final solution, using a header parameter, seems perfect. The resource isn’t infected with the command, the URI doesn’t change (and why should it — we’re operating on a resource, not changing which resource we’re using), and, at least in Spring, we can route to specific methods in our service implementation based on header values, rather than having lots of if-statements everywhere.

In Spring we can use header values as part of the router which is easily configured via annotations on the service implementation, so it’s easy to invoke a specific method based on a header value: @RequestMapping(header = “command=approve”, method=PUT).