voltron-composable-codex - Harry's Computer Science Notes and Blog

- [The Composable Codex](#the-composable-codex) - [A New Frontier](#a-new-frontier) - [Micro Trends](#micro-trends) - [Macro trends](#macro-trends) - [Building a new frontier](#building-a-new-frontier) - [Structure of a composable data system](#structure-of-a-composable-data-system) - [Standards Over Silos](#standards-over-silos) - [Examples of adopting standards](#examples-of-adopting-standards) - [Arrow](#arrow) - [Hierarchy of needs](#hierarchy-of-needs) - [Bridging Language Divides](#bridging-language-divides)  # The Composable Codex https://voltrondata.com/codex # A New Frontier > Instead of coding from scratch, the pieces are now in place to build by composing first. > What would push a company over the edge to choose to build their own data system? ## Micro Trends > 1. Lock-in: Nobody likes feeling locked into their stack. Especially when they know that any change to the system means they will need to: > - Migrate all the data into the new system, > - Retrain all the data teams to use the new system, and > - Rewrite all the existing data workloads in a new user interface. > > 2. Scale: Nobody likes running out of resources. Especially when they have to rewrite queries to scale a workload. > > 3. Performance: Nobody likes a slow system. Speed matters. Compute time matters. Data transport time matters. ## Macro trends 1. The AI Arms Race 2. The rise of GPUs and other accelerated hardware > What was once a field where CPUs reigned is now “a wild west of hardware”, where chips like GPUs, FPGAs, ASICs, TPUs, and other specialized hardware have been rapidly evolving. But, as noted at JupyterCon in 2018, “the tools we use have not scaled so gracefully to take advantage of modern hardware.” > In the future, software would provide a gateway to accelerated hardware, but only for those whose systems were positioned to win the hardware lottery. Just as with AI, no one wanted to hear the question “GPUs are here - are we ready?” and answer “no.” The FUD (fear, uncertainty, and doubt) was real. - https://hardwarelottery.github.io/ ## Building a new frontier > [[composable-data-management-system-manifesto]] 3 main layers ![[diagram-03-txt.svg]] ![[diagram-03.svg]] - User interface - language frontend/api - Execution engine - performs operations on data, as specified by users - Data storage > Many talented engineers are developing the same thing over and over again, building systems that are essentially minor incremental improvements over the status quo. 1. Open source gives you more choices - unix philosophy: build a component that does one thing well - > You are starting to see various projects, in the open source community or organizations, to break out portions of a modern OLAP system into standalone components that are open source that other people can build on, other people can take advantage of, and other people could reuse for their own new OLAP systems. The idea here is that instead of everyone building the same thing, let's everyone work together on one thing, and have that be really really good and everyone gets to reap the rewards. - > The downside of having more choices is that ultimately you do have to make choices. 2. Standards help you make better choices - > interoperability is the "ability of a system or a product to work with other systems or products without special effort on the part of the customer. Interoperability is made possible by the implementation of standards." - interoperability in data systems 1. Data interop - common in-memory data structures - > This requires common data structures to represent datasets in-memory while they are being processed. 2. Query interop - portable format for representing query plans - > This requires a common format for representing query plans that are portable across engines, and not dependent on a specific storage engine, database, or SQL dialect. 3. System interop - serialization and data interchange interfaces - > This requires serialization and data interchange interfaces (network wire protocols, database clients, etc.) for moving data. - Challenges of BYODB teams (writing your own) 1. Maintenance is forever 2. Performance is unpredictable 3. Change is constant > The shift is to move away from building systems by coding first and to instead start building by composing first. --- ## Structure of a composable data system same 3 core layers "do-ers" 1. User interface 2. execution engine 3. data storage "gluers": core standards that glue layers together 1. standard way to represent query plans 2. standard for accessing databases 3. standard format for representing data in memory > Standards serve as the strips of glue that are needed to bridge the gaps between the user interface, execution engine, and data storage layers in a way that is: > > - Documented > - Consistent > - Reusable --- Why now? https://www.kleinerperkins.com/perspectives/infrastructure-in-23/#:~:text=The%20unbundling%20of%20OLAP https://ottertune.com/blog/2022-databases-retrospective https://www.usenix.org/publications/login/winter2018/khurana https://www.cs.cmu.edu/~natassa/courses/15-721/papers/P001.pdf > What is new now is that mere mortals can build composable data systems. Because of the BYODB movement, we now have the pieces in place for everybody else to compose with: > > - A reliable ecosystem of open source projects that paves the way for everyone else to “stand on the shoulders of giants.” > - Standards to ensure interoperability between those open source components. "the adjacent possible" > Switching to a composable system framework is a process. In our experience, you cannot completely move to a composable data system easily or quickly. What you can do is minimize the friction involved in switching by: > > Arming yourself with knowledge about every layer of the system. This involves doing a bunch of research for each and every component, and asking the important interoperability questions. > Discouraging developing in the dark. Teams need to be very cautious about letting engineers in a vacuum just implement the version of each component aspect the best they could at the time. A lot of the time, this development work is filling in interoperability gaps that standards could bridge better. > Designing for deviance in the system. You will have to work to retain enough extensibility to avoid lock-in or any single point of failure for a given layer in the system. Anytime you have only one component in a layer, you risk lock-in. > Start where you are, use what you have, do what you can. System interop # Standards Over Silos - standards in modern data systems - https://mad.firstmark.com/ - [[emerging-architectures-for-modern-data-infrastructure]] https://a16z.com/emerging-architectures-for-modern-data-infrastructure/ ## Examples of adopting standards 1. Streamlit infra to arrow - > In our legacy serialization format, as DataFrame size grew, the time to serialize also increased significantly… Just compare the performance of our legacy format vs Arrow. It's not even funny! - get to delete over 1k lines - [[all-in-on-apache-arrow]] https://blog.streamlit.io/all-in-on-apache-arrow/ 2. Meta - [[shared-foundations-modernizing-metas-data-lakehouse]] https://research.facebook.com/publications/shared-foundations-modernizing-metas-data-lakehouse/ > Over the last three years, we have implemented a generational leap in the data infrastructure landscape at Meta through the Shared Foundations effort. The result has been: > > - a more modern, composable, and consistent stack, > - with fewer components, richer features, consistent interfaces, and > - better performance for the users of our stack, particularly, machine learning and analytics. > > We have deprecated several large systems and removed hundreds of thousands of lines of code, improving engineering velocity and decreasing operational burden. - save devs from writing and maintaining thousands of lines of glue code. - benefits of developing on top of an open standards ecosystem: 1. Faster and more productive engineering teams - less duplicated work means more time for innocvation 2. Tighter innovation cycles - targeted feature development on a smaller code base means faster releases 3. Co-evolution of database software and hardware - Unifying the core layers means better performance and scalability 4. Better user experience - More consistent interfaces and semantics means a smoother user experience - standard = rules + community - good standard has rules that are: - documented - consistent - reusable - stable - community - [[First Follower: Leadership Lessons from a Dancing Guy]] https://sive.rs/ff - nurture first few followers. be public. be easy to follow. - *first follower*: best way to make a movement is to courageously follow and show others how to follow - followers follow other followers, not the leader - factors to consider 1. Open - Is the standard open to be inspected by self-selecting members of the community? Does it have an open source license (see https://choosealicense.com/)? 2. Community – Is there an active community that contributes to the standard? Has the standard evolved since it was created? Can it adapt when the world changes? 3. Governance - Does the standard have established governance? Is the group made up of people from more than one company? 4. Adoption - Are people actually using the standard? Is there a list of organizations that are on record about adopting the standard? 5. Ecosystem - Is there an ecosystem of software projects that build on top of and extend the standard? ## Arrow - standardized in-memory format for structured tabular data > Because when you are building data-intensive analyses and applications, systems get stuck on two main tasks: 1. Moving data - When a workload is transport-bound (or input/output[I/O]-bound), the speed of execution depends on the rate of transfer of data into or out of a system. 2. Processing data - When a workload is compute-bound, the speed of execution depends on the speed of the processor, whether it is a CPU, GPU, or another type of hardware. ![[figure-07.svg]] - columnar format 1. Better I/O - How: Reading and writing only the columns that are needed for a particular query - Why: Each column can be stored separately, so the processor only needs to read and write the columns that are needed 2. Lower memory usage - How: Storing only the values for each column, rather than the entire row - Why: Each column can be stored separately, so the processor only needs to store the columns that are needed 3. Significantly faster computation - How: Allowing processors to process data in parallel - Why: Multiple elements in a column (vector) can be processed simultaneously through vectorized execution, taking advantage of multi-core processors - zero copy 1. The Arrow format is the same across libraries, so you can share data without copy between processes. 2. It is also the same format on the wire, so you can pass data around the network without the costs of serialization and deserialization. - data in cloud storage, points of friction 1. From storage to system memory: Getting the data into the Arrow format for in-memory computation often meant that developers needed to write glue code to convert or preserve the Arrow format up and down the layers of the stack. 2. From system A memory to system B memory: Because most pipelines involved moving data through the layers of the stack via multiple processes or systems, engineers were writing glue code to de- and re-format the data so that each system could operate on it. - building with arrow - Augment data systems: developers can layer Arrow into existing systems by developing with open source Arrow standards & components individually. - Compose Arrow-native data systems: developers who are starting from scratch can build complete systems with Arrow standards & components at their core. - composable standards - Intermediate Representation: Substrait - Connectivity: ADBC - Data memory layout: Arrow ## Hierarchy of needs > People and politics are the bottleneck. Not performance. All you need is MICE 1. Modular: Standalone + swappable parts 2. Interoperable: Parts can connect and exchange information 3. Customizable: Parts can be combined into a complete custom system 4. Extensible: Extended system parts to new hardware, engines, etc. # Bridging Language Divides - language choice for data people 1. helps them do what they need to do: context-dependent language choices 2. access to tools that "fit their hand" > In reality, programming languages are how programmers express and communicate ideas - and the audience for those ideas is other programmers, not computers. The reason: the computer can take care of itself, but programmers are always working with other programmers, and poorly communicated ideas can cause expensive flops. > On the road from “think it” to “do it,” a lot of daily data drudgery happens at the “describe it” stop. The first step in finding common ground is to acknowledge what parts of the system have let us down: 1. **The friction of trying to work with multiple suboptimal tools**, because there is no single, perfect tool 2. **The frustration of the illusion of tool choice when there is none**, because powers that be make them for you 3. **The struggle of working without interoperable pieces**, because things do not work together as well as they could > Unfortunately, to do anything useful with data, any single UI choice will not suffice - there is no magic wand. Most tasks require more than one, and every choice has a tradeoff: - SQL (Structured Query Language) - Examples: MySQL, PostgreSQL - Benefits: “intergalactic data speak-i.e., it is the standard way for programmers to talk to databases” -Michael Stonebraker - Limitations: - So many SQL dialects, because SQL is not a standard - Common programming tasks are more difficult than with a scripting language - Difficult to adapt to modern workloads - Programming language - Examples: Python, R, Go, Java - Benefits: Stay in the flow of a single language made for programming - Limitations: - General purpose programming language - API not designed for analytics or tabular data manipulation - Dataframe API - Examples: pandas or Ibis for Python, dplyr/dbplyr or data.table for R - Benefits: API designed for analytics with tabular data that has rows and columns - Limitations - So many different APIs - Features have not kept up with the needs of analytics workloads - Sparse engine coverage > The point of this table is not to point out how all the tools are flawed. The point is that any changes to the system should start by asking “How might we design our data system to support people to do good work using imperfect tools?” --- > People look for ways to make complex problems easy rather than make it easy to work on complex problems... Asking how we can support our teams who NEED the time and space to work through that, it's often a much more tractable lever than trying to change the inherent nature of the work. - intermediate representation - translator between UIs and engines. - UI turns a user's query into IR - execution engine turns IR into the specific kind of code that it can execute - users do not learn it, use it, or touch it - adding a new frontend or backend goes from an MxN problem to M+N problem ![[figure-07 1.svg]] --- ## #substrait - open IR standard for UIs and engines to represent analytical plans ![[figure-08.svg]] --- ## Case study: #Ibis ### Modular 1. choosing one UI should not change the end result, should not require changes in other layers of the system like the engine or storage layers, and should not limit your future choices. - sparates data transformation code from compute operations within execution engine - avoids data duplication by connecting to and running in execution engine 2. it should be “plug and play”: the hurdles to entry and exit should be minimal. A UI needs to support two system changes: - adopting the ui - keeps users in Python programming language - data transformation API is consistent. What you know from other dataframe libraries like pandas can help you learn - generate SQL as an escape hatch - oferes comparable feature coverage with similar libraries, allowing you to migrate an existing codebase - migrating away from the ui --- ### Interoperable A UI needs to interoperate on two levels: 1. Query plans: Produce a standard query plan to pass to a compatible engine - produces substrait plans (WIP on the roadmap) 2. Data: Return results as a standard in-memory data format - returns arrow-formatted data - read/write from other dataframe libraries --- ### Customizable - Change system components without rewriting code or query logic - Changing the engine or storage is possible with a one-line change in the configuration code. - Prototype locally and deploy to production faster - Changing from local/development environment to distributed is possible with a one-line change in the configuration code. - Supports testing of queries in case of migration - Uses deferred evaluation and type checking to ensure that your queries fail early if they contain errors. - Supports connections to data across disparate sources - Connects to multiple sources at once to combine data in a single analysis. --- ### Extensible 1. Piggybacked extensions - multiple backends implemented for ibis 2. Greenfield extensions - User-defined functions (UDFs)