Third in a series, this post covers our nearly complete re-write and re-design of our Image Template rendering system. We replaced our existing Angular, JavaScript, PHP, and ImageMagick-based service with one written in React, TypeScript, and rendered with Google Chrome. At the same time, we moved the infrastructure from bare metal servers to Kubernetes on bare metal.
Since its earliest days, rendering images based on ever-changing product catalog data has been our largest service by traffic volume at Smartly.io. Customers can create templates in our web-based editor, combine them with their product catalog data, and automatically push endless variations of images to their advertising campaigns.
The original codebase has been aging during the 7+ years of its existence. Though taking inspiration from our existing services, extensive load testing, and harnessing Kubernetes, we were able to meet the new demands for the system.
One of the factors that led us to a re-write was the aging codebase. Our original Image Template service was scaling and working very well, requiring relatively modest sustaining work. However, we came to a point where supporting customers' increasingly complex needs started to make the whole system hard to maintain. Additionally, the re-write would allow us to align technologies with other teams, as the growing Smartly.io was standardized on a different tech stack than the one used in the original service.
When work started on the service in 2014, the frontend was written in JavaScript, Angular, and jQuery UI, and the backend in PHP - the standard stack used throughout the Smartly.io codebase at the time. These days, all of our new services use TypeScript and React on the frontend, and many backend services are written in TypeScript. In addition, we make extensive use of reusable frontend components between services.
Although our PHP backend had proved to be very scalable, ensuring that PHP and ImageMagick-based backend produced images that matched what customers saw in the JavaScript-based frontend proved to be a significant software maintenance challenge.
Over time, our customers' use cases grew more complex and demanding. For example, customers often wanted to customize a single template across multiple languages and geographic locations, which led to recurring feedback about our limited support for non-Latin characters and right-to-left text with custom fonts. Adding similar smaller features on top of the service made the code harder to maintain, and we had to find workarounds to support more sophisticated functions. All these factors considering, we decided to embark on a re-write of the entire service.
In 2019, we created an entirely new Video Template service that does for videos what our Image Template service does for images - it is a browser-based video editor that automatically generates endless variations of the video based on product catalog data. In our re-write of the Image Templates, we decided to build on top of our video templates codebase because it solved many of the issues that the original Image Templates service had.
Learn more about how we built Video Templates here:
The Video Template service uses Typescript both for frontend and backend and critically uses the same rendering code for the editor and the server-side rendering. In addition, we use headless Chrome to render the videos, ensuring that the final rendered videos look the same as the previews in our editor.
Using browser technologies like React and CSS makes it easier to implement new features than implementing a rendering system from scratch - we don't need to write low-level graphics code to draw text, shapes, and images on a screen. For example, adding a new capability to the Image Template editor will automatically work in our renderer because both use a browser. The only issues we've had with this approach have been related to Chrome on Linux behaving differently to Chrome on other platforms. However, we have been able to work around these issues thus far.
Once using Chrome for rendering had been validated for videos, it seemed obvious we should use it for images, too. In simple terms, we could take the video rendering stack and use it to render a single frame of video and voilá, Image Templates! As you'll see, this has proven to be a good approach, but it hasn't been without its challenges.
Taking our video rendering system and replacing our Image Templates backend was simple conceptually, but some key differences between the two systems required special attention.
Replacing an existing production-hardened and highly scalable service with a new system based on Google Chrome and Node.js wasn't straightforward. We were confident that we could make it work, but we knew there would be hard-to-predict problems when the service was running at scale. So we decided to manage the risks in two ways:
While Alpha testing is standard practice at Smartly.io, load testing is less common. We started by creating our own test data set to test the service with a realistic load. We took a random sample of existing customer templates and converted them into our new Image Template format. We knew these real templates might not use all the new system's features, but they would still have multiple images and fonts. It would be sufficient to break the system - and break it we did!
Pushing the system to its limit with dummy data was useful for generating and prioritizing our product backlog. Thanks to the tests, we knew exactly where we needed to optimize next. Sometimes we hit limits with network bandwidth between servers, and other times, we overwhelmed external systems with requests. Sometimes we solved these bottlenecks by adding caches. Other times we re-designed the service to increase its performance. We had a roadmap of catch-up features we knew we'd need based on operating the old Image Templates system, but having a good load test meant we could delay implementation until they would actually improve performance. In addition, delaying some of the work freed up some of our time to react to feedback from the Alpha customers, which allowed us to build an even better product.
Our load testing didn't come without mistakes, and we learned valuable lessons. Most critically, we failed to include broken Image Templates in the load test. When we moved from Alpha to Beta phase, the mistakes in Image Templates produced more errors than the artificial templates we used in our testing. Unfortunately, the system we had built to collect, collate and store errors couldn't cope with the load, and for a brief moment, we were only able to display a fraction of errors to some customers. Luckily that was still the Beta phase, and our customers were willing to test our systems in exchange for early access and a chance to influence product development.
In Smartly.io, almost all services run on Kubernetes. It's proven to be a robust and powerful platform to deploy and operate our 50+ services across hundreds of servers. One exception has been our old Image Template system, as it predates our use of Kubernetes and uses considerably more servers than the rest of our services combined. The scale and volume of traffic were something we'd not tried to handle with Kubernetes before. When we chose Kubernetes for the new Image Template service, we knew it was an ambitious decision that would need extra engineering work to achieve the required scale.
There are many advantages of Kubernetes, but for us specifically, its value comes from:
It turned out to be a good decision to run Image Templates on Kubernetes, but we had to work quite a bit to meet our performance and scalability objectives. We especially wanted to have our entire service configuration as code rather than just the JavaScript parts. Having the whole stack, including all the load balancers, caches, message queues, etc., deployed from our CI system made it much faster to make significant changes.
For example, we began with our rendering, caching and metadata services distributed across the cluster but later realized that networking bottlenecks limited scalability. So, we decided to switch to running our entire stack on each node and scale horizontally. We were able to make this big architectural change incrementally in production just by adjusting the Kubernetes manifests.
We knew from the start our Image Template service would put extreme demands on Kubernetes due to the scale and ongoing growth of our business. We also knew that our existing self-hosted bare metal Kubernetes clusters would need extra work to scale and tune them for this application.
We would, in effect, experiment in production with different service architectures, ingress controllers, and tuning parameters. For this reason, we decided to use dedicated Kubernetes clusters, at least until we better understood our service design and requirements. We also decided to use at least two dedicated clusters for operational and implementation flexibility. Having two clusters means we can experiment with new Kubernetes releases, new Kubernetes features like topology-aware routing, and different ingress controllers.
We're still scaling up our new Image Template platform and slowly migrating customers to the new system. At the same time, usage of the old system continues to grow! Busy times ahead!
We've got a packed roadmap of improvements planned to handle the explosive growth from new and existing customers. A few examples:
Now we need to get back to work to prepare for our busiest season of the year. Wish us luck!
We are also hiring new engineers to build even more exceptional services. If you are interested, we'd love to hear from you!