Canva, the popular graphic design platform, has developed an impressive product analytics pipeline that processes a staggering 25 billion events per day, equivalent to 800 billion events per month. This sophisticated system plays a crucial role in driving data-informed decisions and powering various user-facing features. Let’s delve into the intricacies of Canva’s analytics infrastructure and explore how they manage this massive data stream with 99.999% uptime.
The Foundation: Structured Event Schema
At the core of Canva’s analytics success lies a fundamental decision: every collected event must adhere to a machine-readable, well-documented schema. This approach ensures data consistency and facilitates easier processing and analysis.
Protobuf-based Schema Definition
Canva utilizes the Protobuf language to define analytics event schemas, leveraging the team’s familiarity with this format for microservice contracts. However, they’ve implemented an additional rule: event schemas must maintain full transitive compatibility. This means that any schema version can deserialize any version of an event, ensuring both forward and backward compatibility.
Datumgen: Custom Code Generator
To enforce schema compatibility rules and streamline development, Canva created Datumgen, a custom code generator built on top of protoc. This tool verifies schema compatibility and generates code in multiple languages, including:
- TypeScript definitions for frontend type checking
- Java definitions for backend event handling
- SQL definitions for Snowflake table schemas
- An Event Catalog for easy exploration of collected events
The Collection Process
Canva’s event collection pipeline is designed for efficiency and reliability:
- Client-side Collection: Events are initially captured by Canva’s application, using a shared core library written in TypeScript for consistency across platforms.
- Server-side Validation: Events are sent to a server endpoint, which validates each event against predefined schemas before forwarding them to an ingest-worker via Kinesis Data Stream (KDS).
- Event Enrichment: The ingest-worker processes events, adding details like geolocation data and device information.
- Data Streaming: Processed events are sent to another KDS for routing to various consumers.
Optimizing for Performance and Cost
Canva has implemented several optimizations to enhance their analytics pipeline:
Kinesis Data Stream (KDS) Adoption
After evaluating options, Canva migrated from AWS SQS and SNS to KDS, resulting in an 85% cost reduction while maintaining acceptable performance.
Data Compression
By implementing zstd compression on batches of events, Canva achieved a 10x compression ratio, saving an estimated $600K per year in AWS costs.
SQS Fallback Mechanism
To handle KDS throttling and high tail latency, Canva implemented an SQS fallback system. This ensures consistent response times and provides a failover option during KDS outages.
Distribution and Consumption
Canva’s decoupled router allows for flexible event distribution to various consumers:
- Snowflake: All event types are delivered to Snowflake using Snowpipe Streaming for analytics and dashboarding.
- Real-time Consumers: Backend services can subscribe to events via KDS or SQS, depending on their needs.
Deduplication and Guarantees
Canva’s analytics service provides an at-least-once delivery guarantee, with the responsibility for deduplication falling on individual consumers when necessary.
By implementing this robust and scalable analytics pipeline, Canva has created a powerful foundation for data-driven decision-making and feature development, processing an impressive 25 billion events daily while maintaining high reliability and cost-effectiveness.
Read more such articles from our Newsletter here.
Add comment