Building a Team for your next Microservices Project

11 min readOct 11, 2021

We are tasked with bringing over a Mainframe Application to modern distributed architecture. The Mainframe based system has been in use for over thirty years. To give credit where it’s due, the system has served us well and is quite stable and feature rich. Although, it has reached a state of stable equilibrium and keeps servicing daily business, moving forward and adding new features is quite painful due to it’s monolithic nature. On-boarding new clients is cumbersome since inevitable client customizations are hard to implement reliably and fast. Quick Feedback loops are almost non-existent since faster deployments are not possible with the existing System. Not to mention the batch oriented processing that can only happen at certain time of the day and requires us to ‘stop the world’ for few hours while it’s running.

Here is a blueprint we followed for implementing our new microservices based design.

We started with two simple goals:

Do not disrupt existing business and client base
Fast Feature development

Starting From Scratch

Coding Standards

Start with a set of simple guidelines on coding standards. Things like variable and function names. Tab vs Spaces. Size of functions. I like to emphasize on writing of pure functions as much as possible. Languages like Kotlin and Clojure make it easier but even in Java one can strive to write side effect free functions. Codify these simple rules and make them easily available. Follow these rules for each code merge until it become acceptable by the wider Team. You do not want to go crazy so make sure your list doesn’t run beyond ten points.

Technology Stack

One of the ‘promises’ of independent services is that each service exposes an interface and internally they can be developed using whatever language or tech stack that particular Team desires. While this sounds nice and romantic, it should be strongly discouraged. There is immense value in picking a company wide Tech Stack and programming Language. Sharing of Libraries and Sharing of knowledge base are only few of the advantages. Makes it easier for developers to change Teams. Troubleshooting across services is easier when there is some uniformity in the tech stack. Individual Teams can have freedom to pick Frameworks but at the very least, same programming language should be used across Teams.

Debugging/Tracing

This is indeed the most important piece to get right. Each request must be traceable from start to finish. Start is defined as the point of ingress into your network. As it proceeds through the maze of services, we need to be able to track it’s progress. It’s imperative to be able to query your centralized logs for each unique request.

Each request must be assigned an unique ID at the ingress point. This could be done by attaching an gatekeeper service to the Kafka Topic which is bringing in events to the ecosystem. You can have the gatekeeper service perform basic validations and attach an Unique request ID to the incoming event before sending it downstream. From there on, each service needs to forward this unique ID. This can be formalized by adding it to your Avro Schema as a mandatory parameter if you are using Avro. For REST requests, a systemwide validation library must validate each incoming request to contain this Unique ID. This is another place a single Tech Stack becomes useful.

Logging

Application Logging is not to be taken lightly. Since Production issues are a given, one must have good and searchable logs.

Understand that free text logging is no substitute for good metrics collection. Each Application should track basic metrics and auditing, in it’s own database instead of writing to logs. For example, Application Start Time, Last Request received, configuration parameters, git Tag which is deployed etc should be recorded in the database. Use async database writes if it’s a performance concern.

Applications must be able to increase or decrease logging levels dynamically. You are not going to restart an application just to get more logs. Provide an REST API in or listen for special command events on Kafka to adjust log levels at run time.

Practice 12 factor app development and output your Logs to STDOUT/STDERR.

Lastly, use centralized logging infrastructure such as Splunk or ELK. Both of these require you to run an agent alongside your Application so that must be part of your deployment. Kubernetes makes log aggregation somewhat simpler by using something like fluentd. Built in Log aggregation is another advantage of managed Kubernetes engines such as EKS and GKE.

Another approach is to write a Log Appender to output your Logs to a Kafka Topic. A Kafka Stream application can then generate alerts and other stats. Another consumer can feed raw logs from Kafka to Splunk or Elastic Database.

Communication among Services

One of the biggest issues is how to make sure all your services are speaking the same vocabulary. Free flowing JSON can get out of sync quickly.

One solution is to use something like Avro and use code generation at each service. Care must be taken to make sure Avro evolution is backward compatible. Your REST as well as Kafka pipes will be configured to speak Avro.

Another solution is to ditch REST altogether and use Kafka for all your communication needs. Use REST for the services which are serving web UI but everything else uses Kafka for request and response. This has few advantages. One is Kafka persistence of your request and response messages. Other is back pressure is built into the system. No need for an API discovery Engine since Kafka is serving that purpose. Downside is obvious need for increased kafka hardware and increased latency since requests and responses must be persisted by kafka whereas HTTP traffic is in memory . Your applications will also need to handle requests and responses asynchronously since kafka request and responses are decoupled from each other. This doesn’t mean you have to go completely reactive on day one but your app will have to maintain enough state to be able to tie responses with requests arriving asynchronously.

Another very important consideration is event delivery semantics. Event Streaming will require your application to be able to withstand either missed events or duplicate events since exactly once delivery of events in distributed systems is simply not possible. Financial Systems usually can’t afford to lose events so they must be coded to handle duplicate events. Each service should maintain enough state to detect a duplicate event and reject it or handle it in an idempotent way. If fast event detection using database queries is too slow, you should look into Bloom Filters to maintain this information in memory — depending upon your requirements you should be able to tune the probability of collisions.

Orchestration of Workflows

This is somewhat abstract and may not be obvious right away. The Microservices ecosystem you are building exists to serve a higher order workflow. For example, sale of an item involves multiple steps and each step requires work to be performed by a separate service. You’d need to check inventory, check credit, call payment processor, work with shipping, send notifications etc. You might even have a manual ‘approval’ step. Often times, the sequence of operations is not linear and can involve a flow chart of operations to complete entire Transaction. Errors and remediations can make this flow even more complex. This entire workflow needs to be coordinated in proper order.

Most of the time this need is not articulated and the workflow gets implemented implicitly. For example, event A gets picked by Service A which triggers events B and C which in turn trigger further events to be processed by other services. There is a workflow in progress but it’s not explicitly defined anywhere. This is almost never desirable. You don’t have a well defined set of steps and there is no visibility into how these events are tied together to implement a higher order Business function.

One should look into building a separate service to define and ‘run’ these workflows. It can be thought of as a distributed state machine. Unless your workflow is simple, implementation can get complex quickly. It’s better to look into off-the-shelf tools such as Camunda. The key to such a system is a good UI to keep track of and manipulate System State. You can start by building a simple system which keeps a table of incoming triggers (e.g. Kafka Events), code to run in response to triggers and what events to trigger after successful completion of Trigger code. Another important requirement is to maintain extensive auditing and history of each state change. An immutable Database such as Datomic can come in handy to build such state machines and track system evolution over time.

Testing at Scale

If you do not have a full suite of automated Tests, you will never be able to reach the promised land of Microservices. The entire premise of Microservices architecture is to be able to deploy an service fast and without disruption. We want to be able to build and deploy a service on-demand at any time of the day. This is only possible if you have a full suite of automated Tests which are fast to execute on your Pipeline. The Holy grail is when you make a git commit to your main branch and it triggers a Pipeline to run tests and perform Production Deploys automatically.

This will take some time to achieve but it can be done if you’re focused and make this a requirement from day one. We make use of docker extensively and our tests do NOT use mock objects. Our git pipeline spins dockerized database and kafka servers and our tests are run against real database. We do not believe in Mocking Objects and substituting light weight in-memory database for Testing. Use Docker, Docker-compose and even Kubernetes to spin up the infrastructure you need for tests. This is not Unit Test. It’s not Integration Test. Call it what you will, it serves the purpose and tests with real infrastructure instead of mocks.

Although not part of CI/CD pipeline, you’d need to plan for Performance Tests. Obviously a close replica of Production data set is ideal for running Perf tests. But this is easier said than done. The hard truth is you will end up doing performance tests in Production. Accept it and plan for it. Good old feature flags to turn on/off sections of code come in handy when you want to test multiple scenarios in Production running code.

DevOps/Deployments

Running flawless CI/CD pipelines are an absolute requirement. I would say this is even more important than your mainline codebase. Development and Deployment should use same pipeline and each developer should be able to spin up the entire setup either locally or in the cloud. Spend time upfront getting the pipeline working and as make it as frictionless as possible. The Agile concept of autonomous Teams really works great here. Each Team should be responsible for entire lifecycle of it’s service. They should own the code, own automated test suites, own the deployment pipeline and must be able to take their service from development to production and be able to provide support. Having separate SRE team is an impediment. Your Infrastructure team needs to keep the road lights on — not drive your car. A suboptimal CI/CD pipeline is like trying to drive your Ferrari on a cobblestone street. It’ll end up wasting enormous amount of valuable developers’ time and has to be avoided at all cost. This often requires some convincing to the senior Management but it’s worth the every penny so spend some time upfront to get the build and deploy pipeline running smoothly.

No Shared Databases

Each Microservice should have it’s own data store. If you are able to split your microservices along correct domain boundaries, databases should become natural fit for each Service. Expose APIs, not databases.

This setup is quite standard, however, it makes analytics quite cumbersome or even impossible since you don’t have an overarching view of the entire dataset. Not having entire dataset in one place also makes discovering usage patterns impossible. You’ll eventually be forced into setting up a central data lake. It can be a Postgres database or Spark cluster. It might even be an Elastic search cluster. Usual practice is to feed your data lake from the same Kafka Stream that is used to feed your services.

One other consideration is to understand the effect of eventual consistency in a distributed system. You will need to prepare for the system state not being in a consistent state at a given moment of time. A failure in one of the service will require a rollback of ‘committed’ events in other services. This can only be done by issuing a compensating transaction. Rollbacks can be a manual step but it’s imperative to be able to detect the inconsistency in an automated fashion. Most of the time, you cannot afford to lose a failed transaction with any service. This might involve running of another service to check for system invariants (by running reconcilations across multiple services periodically) or making sure your database writes are in sync with Error writes on the Kafka Topic. Something like Debezium should be investigated to make sure you don’t lose a failed database write.

A Good UI

Often overlooked but if you are building a service that would eventually face a human user, a good looking and engaging UI is absolutely must. People respond to beautiful and simple presentations. I believe getting a UI designer along with an Psychologist would do wonders to such Products.

As far as UI development goes, it’s often over-engineered with Angular, React and SPAs and whatever Framework happens to be in vogue at that time. Keep it simple. CSS and simple interactions go a long way. I don’t think SPAs are warranted unless you’re building an application with complex interactions. You can go quite far with simple html templating, CSS and cross platform javascript libraries. Most projects do not need the complex Angular, React, Babel, WebPack, npm pipelines to function. Keep it simple and be absolutely sure you need an SPA with added complexity before buying into it.

Team Formation

Goes without saying the proper Team formation is perhaps the hardest part. Good Developers are hard to find and even harder to engage while keeping motivation high. There has to be right amount of balance between individual creative freedom and discipline that a Business dictates. That said, IMO, keeping a lean team of mostly Engineers and perhaps one or two subject matter experts is the best way to move quickly. Do not go around staffing the team with fuzzy titles such as Product Manager, Program Manager, Agile coaches, Scrum Masters and so on. Being Agile is quite simple actually if you do not fall for un-necessary ceremonies. There is no need for long Sprint Plannings, Sprint Retros, Story Grooming Sessions, Point poker exercises. Work with your SME to maintain a small backlog of no more 5 items and work off of it. A Bug — any bug has to be fixed right away. It doesn’t go to the backlog and wait for Next release. If your deployment pipeline is functioning as it should, bug fixes should be deployed as soon as they are ready and verified. Build a feature, get it verified by SME and move to deployment. If you are truly adventours, do not hesitate to look into trunk based development since it’ll force you to be nimble under all circumstances!

Last but not least, hire folks who have the grit and are bold enough to try new approaches on a regular basis. Once hired, give ’em elbow room to experiment. Tolerate failures - although it’s easier said than done. Do not be quick to hit the ‘fire’ button. Hire good folks and trust they will do a good job. Everyone wants to perform well — it’s when they feel they are being micromanaged that people lose interest in the shared mission.