Pinterest

Best Company-wide Implementation

2022
Providing a stable, scalable, efficient, and usable Apache Kafka platform to power business critical functions at Pinterest, such as home feed and recommendation, ads revenue, search, etc.
As much as 100% reduction in engineers' time spent per operation

PubSub: Apache Kafka®

Technologies Used:

  • Apache Kafka®
  • Apache Kafka MirrorMaker
  • Amazon Web Services

Short description of project/implementation:

Providing a stable, scalable, efficient, and usable Apache Kafka platform to power business critical functions at Pinterest, such as home feed and recommendation, ads revenue, search, etc.

What problem was the nominee organization looking to solve with Data Streaming Technology?

Some of the key problems addressed so far include:

  1. Density of Kafka clusters: how to maintain dense Kafka clusters that easily tolerate replays and backfills, broker replacements, traffic spikes, etc?
  2. MirrorMaker instability: how to make MirrorMaker resilient to traffic spikes at source compressed topics?
  3. Error-prone manual repetitive maintenance operations such as topic movements, optimal topic partition placements, broker health check, broker replacements, etc.: how to avoid human error and save engineers’ time?
  4. Client library instability and usability: how to reduce the impact of bugs in the client library and simplify configuring it?
  5. Platform cost: how to run the platform efficiently and save on the infra costs?
  6. Pipeline visibility: how to make it easier for their platform team and customers to see thelineage of pipelines?

How did they solve the problem?

  1. Kafka cluster density: One size does not fit all. The type of workload should determine the choice of hardware for the Kafka cluster. It is also important to categorize workloads into separate clusters to allow for this optimal hardware selection. NVMe SSD-based instance types allowed us to host a large number of partitions per broker (over 300)
  2. MirrorMaker instability: To avoid recompressing compressed messages, KIP-712 was proposed and implemented internally, reducing resource usage and improving stability under load & backpressure during spikes.
  3. Error-prone repetitive maintenance operations and issue remediations: The steps and logic behind those tasks were all implemented into Orion, which provides both reactive and proactive maintenance via automation and gives substantial time back to Platform engineers to work on solving actual hard problems.
  4. Client library instability: The Pinterest PSC (PubSub Client) SDK was designed and implemented to solve issues such as client library bugs, and complex client configuration, to free up customers from having to solve problems of cluster discovery, observability, and graceful error handling.
  5. Cost efficiency: Other than optimizing the hardware each cluster runs on, and increasing density, the cost of network transfer in and out of Kafka clusters was significantly reduced by applying AZ-aware producer partitioning and consumer assignment strategy.
  6. Improved visibility: Providing a lineage tool that illustrates the lineage of pipelines that run through Kafka clusters provided the platform team and the customers a simplified view of pipelines, their configuration, key metrics, etc.

Positive outcome

Increased platform stability, scalability, efficiency, and usability:

  1. Kafka clusters stability: 70% drop in the number of incidents related to Kafka
  2. MirrorMaker stability: 75% CPU usage reduction in MirrorMaker nodes 
  3. PubSub client stability: Significant engineers' time reduction related to client configuration, optimization, troubleshooting, etc; e.g. 1 FTE month with auto error resolution.
  4. Automation of repetitive operations: As much as 100% reduction in engineers' time spent per operation
  5. Cost efficiency: 300% growth of data throughput and 50% reduction in cost.

Links:

Streaming: Flink

Technologies Used:

  • Apache Flink®
  • Apache Hive®
  • Apache Kafka®
  • Amazon Web Services

Short description of project/implementation:

Xenon provides a stable and efficient stream processing platform for deploying, monitoring and maintaining Apache Flink-based applications.

What problem was the nominee organization looking to solve with Data Streaming Technology?

Since Apache Flink was chosen as the stream processing engine at Pinterest, the Steam Processing Platform (SPP) team has had to offer its customers a platform for

  • stably running their streaming jobs
  • quickly productionalizing new use cases and features
  • efficiently using infra resources and implementing best practices
  • automating key operations that originally required engineering time from the platform team and the customers

How did they solve the problem?

The SPP team

  • increased platform stability by implementing a self-service job deployment and job management framework
  • significantly bumped developer velocity by providing application developers a CICD framework, a job debugger, a job development framework, and an application certification program
  • reduced infra costs by implementing application auto-scaling, job health dashboards and reports, load balancing, and optimizing resource usage
  • provide thrift format native support, NRTG framework, SQL abstraction, federate source/sink system meta into unified catalog built on top of hive metastore

Positive outcome

Enabled more than 60 use cases from key initiatives, ranging from user growth, shopping, and user engagement to monetization. SPP has been efficiently and stably processing over 6 PB of data per day in near real-time at over a 10M TPS scale for more than two years. 

Links:

Disclaimer

Apache®️, Apache Kafka, Apache Flink, Kafka, and Flink are trademarks of the Apache Software Foundation.