Friday, May 3, 2024
HomeTechnologySoftwareUsing the cloud to scale Etsy

Using the cloud to scale Etsy

Date:

Related stories

spot_imgspot_img


Etsy, an online marketplace for unique, handmade, and vintage items, has
seen high growth over the last five years. Then the pandemic dramatically
changed shoppers’ habits, leading to more consumers shopping online. As a
result, the Etsy marketplace grew from 45.7 million buyers at the end of
2019 to 90.1 million buyers (97%) at the end of 2021 and from 2.5 to 5.3
million (112%) sellers in the same period.

The growth massively increased demand on the technical platform, scaling
traffic almost 3X overnight. And Etsy had signifcantly more customers for
whom it needed to continue delivering great experiences. To keep up with
that demand, they had to scale up infrastructure, product delivery, and
talent drastically. While the growth challenged teams, the business was never
bottlenecked. Etsy’s teams were able to deliver new and improved
functionality, and the marketplace continued to provide a excellent customer
experience. This article and the next form the story of Etsy’s scaling strategy.

Etsy’s foundational scaling work had started long before the pandemic. In
2017, Mike Fisher joined as CTO. Josh Silverman had recently joined as Etsy’s
CEO, and was establishing institutional discipline to usher in a period of
growth. Mike has a background in scaling high-growth companies, and along
with Martin Abbott wrote several books on the topic, including The Art of Scalability
and Scalability Rules.

Etsy relied on physical hardware in two data centers, presenting several
scaling challenges. With their expected growth, it was apparent that the
costs would ramp up quickly. It affected product teams’ agility as they had
to plan far in advance for capacity. In addition, the data centers were
based in one state, which represented an availability risk. It was clear
they needed to move onto the cloud quickly. After an assessment, Mike and
his team chose the Google Cloud Platform (GCP) as the cloud partner and
started to plan a program to move their
many systems onto the cloud
.

While the cloud migration was happening, Etsy was growing its business and
its team. Mike identified the product delivery process as being another
potential scaling bottleneck. The autonomy afforded to product teams had
caused an issue: each team was delivering in different ways. Joining a team
meant learning a new set of practices, which was problematic as Etsy was
hiring many new people. In addition, they had noticed several product
initiatives that did not pay off as expected. These indicators led leadership
to re-evaluate the effectiveness of their product planning and delivery
processes.

Strategic Principles

Mike Fisher (CTO) and Keyur Govande (Chief Architect) created the
initial cloud migration strategy with these principles:

Minimum viable product – A typical anti-pattern Etsy wanted to avoid
was rebuilding too much and prolonging the migration. Instead, they used
the lean concept of an MVP to validate as quickly and cheaply as possible
that Etsy’s systems would work in the cloud, and removed the dependency on
the data center.

Local decision making – Each team can make its own decisions for what
it owns, with oversight from a program team. Etsy’s platform was split
into a number of capabilities, such as compute, observability and ML
infra, along with domain-oriented application stacks such as search, bid
engine, and notifications. Each team did proof of concepts to develop a
migration plan. The main marketplace application is a famously large
monolith, so it required creating a cross-team initiative to focus on it.

No changes to the developer experience – Etsy views a high-quality
developer experience as core to productivity and employee happiness. It
was important that the cloud-based systems continued to provide
capabilities that developers relied upon, such as fast feedback and
sophisticated observability.

There also was a deadline associated with existing contracts for the
data center that they were very keen to hit.

Using a partner

To accelerate their cloud migration, Etsy wanted to bring on outside
expertise to help in the adoption of new tooling and technology, such as
Terraform, Kubernetes, and Prometheus. Unlike a lot of Thoughtworks’
typical clients, Etsy didn’t have a burning platform driving their
fundamental need for the engagement. They are a digital native company
and had been using a thoroughly modern approach to software development.
Even without a single problem to focus on though, Etsy knew there was
room for improvement. So the engagement approach was to embed across the
platform organization. Thoughtworks infrastructure engineers and
technical product managers joined search infrastructure, continuous
deployment services, compute, observability and machine learning
infrastructure teams.

An incremental federated approach

The initial “lift &
shift” to the cloud for the marketplace monolith was the most difficult.
The team wanted to keep the monolith intact with minimal changes.
However, it used a LAMP stack and so would be difficult to re-platform.
They did a number of dry runs testing performance and capacity. Though
the first cut-over was unsuccessful, they were able to quickly roll
back. In typical Etsy style, the failure was celebrated and used as a
learning opportunity. It was eventually completed in 9 months, less time
than the full year originally planned. After the initial migration, the
monolith was then tweaked and tuned to situate better in the cloud,
adding features ​​like autoscaling and auto-fixing bad nodes.

Meanwhile, other stacks were also being migrated. While each team
created its own journey, the teams were not completely on their own.
Etsy used a cross-team architecture advisory group to share broader
context, and to help pattern match across the company. For example, the
search stack moved onto GKE as part of the cloud, which took longer than
the lift and shift operation for the monolith. Another example is the
data lake migration. Etsy had an on-prem Vertica cluster, which they
moved to Big Query, changing everything about it in the process.

Not surprising to Etsy, after the cloud migration the optimization
for the cloud didn’t stop. Each team continued to look for opportunities
to utilize the cloud to its full extent. With the help of the
architecture advisory group, they looked at things such as: how to
reduce the amount of custom code by moving to industry-standard tools,
how to improve cost efficiency and how to improve feedback loops.

Figure 1: Federated
cloud migration

As an example, let’s look at the journey of two teams, observability
and ML infra:

The challenges of observing everything

Etsy is famous for measuring everything, “If it moves, we track it.”
Operational metrics – traces, metrics and logs – are used by the full
company to create value. Product managers and data analysts leverage the
data for planning and proving the predicted value of an idea. Product
teams use it to support the uptime and performance of their individual
areas of responsibility.

With Etsy’s commitment to hyper-observability, the amount of data
being analyzed isn’t small. Observability is self-service; each team
gets to decide what it wants to measure. They use 80M metric series,
covering the site and supporting infrastructure. This will create 20 TB
of logs a day.

When Etsy originally developed this strategy there weren’t a lot of
tools and services on the market that could handle their demanding
requirements. In many cases, they ended up having to build their own
tools. An example is StatsD, a stats aggregation tool, now open-sourced
and used throughout the industry. Over time the DevOps movement had
exploded, and the industry had caught up. A lot of innovative
observability tools such as Prometheus appeared. With the cloud
migration, Etsy could assess the market and leverage third-party tools
to reduce operational cost.

The observability stack was the last to move over due to its complex
nature. It required a rebuild, rather than a lift and shift. They had
relied on large servers, whereas to efficiently use the cloud it should
use many smaller servers and easily scale horizontally. They moved large
parts of the stack onto managed services and third party SaaS products.
An example of this was introducing Lightstep, which they could use to
outsource the tracing processing. It was still necessary to do some
amount of processing in-house to handle the unique scenarios that Etsy
relies on.

Migration to the cloud-enabled a better ML platform

A big source of innovation at Etsy is the way they utilize their
Machine learning.

Etsy leverages
machine learning (ML) to create personalized experiences for our
millions of buyers around the world with state-of-the-art search, ads,
and recommendations. The ML Platform team at Etsy supports our machine
learning experiments by developing and maintaining the technical
infrastructure that Etsy’s ML practitioners rely on to prototype, train,
and deploy ML models at scale.

Kyle Gallatin and Rob Miles

The move to the cloud enabled Etsy to build a new ML platform based
on managed services that both reduces operational costs and improves the
time from idea generation to production deployment.

Because their resources were in the cloud, they could now rely on
cloud capabilities. They used Dataflow for ETL and Vertex AI for
training their models. As they saw success with these tools, they made
sure to design the platform so that it was extensible to other tools. To
make it widely accessible they adopted industry-standard tools such as
TensorFlow and Kubernetes. Etsy’s productivity in developing and testing
ML leapfrogged their prior performance. As Rob and Kyle put it, “We’re
estimating a ~50% reduction in the time it takes to go from idea to live
ML experiment.”

This performance growth wasn’t without its challenges however. As the
scale of data grew, so too did the importance of high-performing code.
With low-performing code, the customer experience could be impacted, and
so the team had to produce a system which was highly optimized.
“Seemingly small inefficiencies such as non-vectorized code can result
in a massive performance degradation, and in some cases we’ve seen that
optimizing a single tensor flow transform function can reduce the model
runtime from 200ms to 4ms.” In numeric terms, that’s an improvement of
two orders of magnitude, but in business terms, this is a change in
performance easily perceived by the customer.

We’re releasing this article in installments. The last installment will
include how Etsy handled the stresses of the pandemic, and its work on
measuring cost and carbon consumption.

To find out when we publish the next installment subscribe to the
site’s
RSS feed, or Martin’s
twitter stream,
Mastodon feed,






Source link

Latest stories

spot_img