Data Streaming Awards

PubSub: Apache Kafka®

‍

Technologies Used:

Apache Kafka®
Apache Kafka MirrorMaker
Amazon Web Services

‍

Short description of project/implementation:

Providing a stable, scalable, efficient, and usable Apache Kafka platform to power business critical functions at Pinterest, such as home feed and recommendation, ads revenue, search, etc.

‍

What problem was the nominee organization looking to solve with Data Streaming Technology?

Some of the key problems addressed so far include:

Density of Kafka clusters: how to maintain dense Kafka clusters that easily tolerate replays and backfills, broker replacements, traffic spikes, etc?
MirrorMaker instability: how to make MirrorMaker resilient to traffic spikes at source compressed topics?
Error-prone manual repetitive maintenance operations such as topic movements, optimal topic partition placements, broker health check, broker replacements, etc.: how to avoid human error and save engineers’ time?
Client library instability and usability: how to reduce the impact of bugs in the client library and simplify configuring it?
Platform cost: how to run the platform efficiently and save on the infra costs?
Pipeline visibility: how to make it easier for their platform team and customers to see thelineage of pipelines?

‍

How did they solve the problem?

Kafka cluster density: One size does not fit all. The type of workload should determine the choice of hardware for the Kafka cluster. It is also important to categorize workloads into separate clusters to allow for this optimal hardware selection. NVMe SSD-based instance types allowed us to host a large number of partitions per broker (over 300)
MirrorMaker instability: To avoid recompressing compressed messages, KIP-712 was proposed and implemented internally, reducing resource usage and improving stability under load & backpressure during spikes.
Error-prone repetitive maintenance operations and issue remediations: The steps and logic behind those tasks were all implemented into Orion, which provides both reactive and proactive maintenance via automation and gives substantial time back to Platform engineers to work on solving actual hard problems.
Client library instability: The Pinterest PSC (PubSub Client) SDK was designed and implemented to solve issues such as client library bugs, and complex client configuration, to free up customers from having to solve problems of cluster discovery, observability, and graceful error handling.
Cost efficiency: Other than optimizing the hardware each cluster runs on, and increasing density, the cost of network transfer in and out of Kafka clusters was significantly reduced by applying AZ-aware producer partitioning and consumer assignment strategy.
Improved visibility: Providing a lineage tool that illustrates the lineage of pipelines that run through Kafka clusters provided the platform team and the customers a simplified view of pipelines, their configuration, key metrics, etc.

‍

Positive outcome

Increased platform stability, scalability, efficiency, and usability:

Kafka clusters stability: 70% drop in the number of incidents related to Kafka
MirrorMaker stability: 75% CPU usage reduction in MirrorMaker nodes
PubSub client stability: Significant engineers' time reduction related to client configuration, optimization, troubleshooting, etc; e.g. 1 FTE month with auto error resolution.
Automation of repetitive operations: As much as 100% reduction in engineers' time spent per operation
Cost efficiency: 300% growth of data throughput and 50% reduction in cost.

‍

Links:

Optimizing Kafka for the cloud
Using graph algorithms to optimize Kafka operations (Part 1, Part 2)
Shallow Mirror
Lessons Learned from Running Apache Kafka at Scale at Pinterest
Unified PubSub Client at Pinterest
Orion: A pluggable automation platform for stateful distributed systems

‍

Streaming: Flink

‍

Technologies Used:

Apache Flink®
Apache Hive®
Apache Kafka®
Amazon Web Services

‍

Short description of project/implementation:

Xenon provides a stable and efficient stream processing platform for deploying, monitoring and maintaining Apache Flink-based applications.

‍

What problem was the nominee organization looking to solve with Data Streaming Technology?

Since Apache Flink was chosen as the stream processing engine at Pinterest, the Steam Processing Platform (SPP) team has had to offer its customers a platform for

stably running their streaming jobs
quickly productionalizing new use cases and features
efficiently using infra resources and implementing best practices
automating key operations that originally required engineering time from the platform team and the customers

‍

How did they solve the problem?

The SPP team

increased platform stability by implementing a self-service job deployment and job management framework
significantly bumped developer velocity by providing application developers a CICD framework, a job debugger, a job development framework, and an application certification program
reduced infra costs by implementing application auto-scaling, job health dashboards and reports, load balancing, and optimizing resource usage
provide thrift format native support, NRTG framework, SQL abstraction, federate source/sink system meta into unified catalog built on top of hive metastore

‍

Positive outcome

Enabled more than 60 use cases from key initiatives, ranging from user growth, shopping, and user engagement to monetization. SPP has been efficiently and stably processing over 6 PB of data per day in near real-time at over a 10M TPS scale for more than two years.

‍

Links:

‍

Disclaimer

Apache®️, Apache Kafka, Apache Flink, Kafka, and Flink are trademarks of the Apache Software Foundation.

‍

Best Company-wide Implementation

PubSub: Apache Kafka®

Streaming: Flink