top of page
Search
  • Writer's pictureTim Burns

Persistence and Transience - Navigating the Modern Data Architecture



Exhibit at the Southern VT Arts Center


A Case for Transient Data Sources


The modern data architecture spans a wide range of data storage. A few examples of the data engines the modern data architect encounters include big data engines like Databricks and Snowflake, caching systems like Redis, and sequential queues like SQS. When data needs persistence and transfer, a data architect must comprehend, design, integrate, govern, and manage it. 


Not surprisingly, with such a diverse array of tools and paradigms, the best path for the data architect is often avoiding persistence. Transient data reduces cost and technical complexity because it disappears when you finish it. 


An Amazon team achieved significant cost savings and operational efficiency by transitioning from a storage system that relied on S3 to one that stores all intermediate data in memory. This successful move is detailed in the article 'Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%.'


We realized that distributed approach wasn't bringing a lot of benefits in our specific use case, so we packed all of the components into a single process. This eliminated the need for the S3 bucket as the intermediate storage for video frames because our data transfer now happened in the memory. We also implemented orchestration that controls components within a single instance.

A second case in point is the usage of transient and in-memory data structures in processing data. Limiting persistence helps maintain clean and performant data environments and saves the data architect many headaches before they are allowed to start.


Let's explore some advantages of building complex systems using transient data for distribution.


Reduced Storage Requirements

Using transient data reduces the need for long-term storage capacity, which can be a substantial cost component, especially in big-data applications. A system using transient data releases any allocated resources when it completes the operation, and data does not accumulate.


Enhanced Performance

Transient data will live in faster storage - either in memory or as temporary high-bandwidth storage. With applications that require high performance while processing data, the data is ready when the processors need it and are limited by their processing capability rather than data transfer rates.


Scalability

A well-designed system that relies on transient data can scale horizontally with little effort because each worker can utilize its isolated data set. We can add workers to handle workloads with network overhead and synchronization bottlenecks.


Security and Compliance

Fewer copies of data mean fewer sources of data exposure, which can also help with audits. Access to persistent data sources can be hardened, and transient data sources can be protected by hardening access to the systems where they operate rather than individual data sets. 


Reproducibility and Resilience

Transient data sets need to be maintained over time, but they are resilient, and conditions can be reproduced easily by rerunning the processes that utilize the data. Recreating the active data set with every execution allows for effective unit and guardrail tests to ensure data integrity, quality, and consistency.


The Takeaway - Look for Solutions that Embrace Transience and Avoid Auxiliary Database Anti-Pattern

The individual technologies that use transience share the property that the data is gone forever when the process consumes it. In a database, this can take the form of a temporary table or a CTE. A data transfer process can be a set of records in a FIFO message queue that a system removes and processes. It could be a large hash set in the memory of a container instance that is processed and moved.  


When writing new systems or fixing data transfer performance or quality problems, watch for the anti-pattern of an auxiliary database. If you need to create a database or S3 bucket that is not your CRM, ERP, or OLAP database, consider whether a transient data source is a better alternative. Storing data so that multiple instances can access it indicates that the problem lies on your process side, and you need to refactor your architecture to ensure each worker process can access all the data it needs without interdependence.


In conclusion, use transient data sources as much as you can. Be as transient as you can. Use memory first, then use temporary tables, and finally, transient tables. By focusing on keeping active data transient, you will deliver more scalable, flexible, and cost-effective architectures.



15 views0 comments

Recent Posts

See All

Yorumlar


bottom of page