Sunday, October 5, 2025

Data Storage For Analytics And AI

For a  small or medium sized company storing all the data in relational database like Postgresql or MySQL is sufficient.  Perhaps if analytics is needed they might also use a columnar store, more like a data warehouse.

What if you have large amounts of unstructured data as well ? May be logs from your e-commerce site, emails, support logs etc. Those need to be queried, aggregated, summarized and reported as well.

If your business grows to handle large volumes of unstructured data—maybe logs from your e-commerce site, emails, support tickets, images, or customer audio—storing everything in a single RDBMS becomes impossible. These new data types require specialized architectures designed for scale, flexibility, and advanced analytics (like Machine Learning and Generative AI).

Here is a guide to the key data storage paradigms you will encounter:

This is a brief introduction to the options.

1. Relational Database Management Systems (RDBMS)


This needs no introduction.

Primary Use Case: Online Transaction Processing (OLTP). Applications requiring fast, frequent reads and writes, and ACID compliance.

Data Structure: Data is modeled at normalized rows and columns. Explicit relationships are enforced using foreign keys. In most cases storage is implemented as a B+ tree.

Schema Approach: The schema must be defined and enforced before data can be written.

Examples: PostgreSQL (Open Source), MySQL (Open Source/Commercial), Oracle, Microsoft SQL Server.

2. Data Warehouse (DW)


Primary Use Case: Online Analytical Processing (OLAP). Business Intelligence (BI), historical reporting, and generating complex aggregated reports across years of data.

Data Structure: Columnar data store. Data often denormalized into Star or Snowflake schemas to optimize large, analytical JOIN queries.

Schema Approach: Schema-on-Write: Data is cleaned, transformed, and structured via ETL/ELT pipelines before loading.

Examples: Snowflake, Google BigQuery, Amazon Redshift, Apache Pinot, Apache Druid, ClickHouse

3. Data Lake


Primary Use Case: Storing all data (raw and processed) at massive scale for Data Science, Machine Learning (ML), and exploratory analytics.

Data Structure: Stores data in its native, raw format—structured, semi-structured (JSON, XML), and unstructured (logs, images, audio).

Schema Approach: Schema-on-Read: Structure is applied dynamically by the query engine when the data is read. This offers maximum flexibility.

Examples: Amazon S3 (storage), Azure Data Lake Storage (ADLS), Apache Hadoop, Delta Lake, Apache Hudi

4. Data Lakehouse


Primary Use Case: Unifying the scale and flexibility of a Data Lake with the reliability and performance of a Data Warehouse.

Data Structure: Hybrid: Stores all raw data in the lake but adds a metadata and transaction layer (e.g., Delta Lake) to enforce quality and provide table-like features.

Schema Approach: Hybrid: Allows Schema-on-Read for raw ingestion while enforcing Schema Enforcement and ACID transactions for curated tables.

Example: Databricks, Apache Iceberg

5. NOSQL Database


Primary Use Case: High-volume, dynamic, operational use cases where schemas change frequently and extreme horizontal scaling is needed (e.g., user profiles, content management).

Data Structure: Varies (Document, Key-Value, Graph, Wide-Column). Data is often stored as flexible records or objects without strict relationships.

Schema Approach:Schema-less or Dynamic Schema: Structure can evolve on a per-document basis without downtime.

Example: MongoDB (Document), Redis (Key-Value/Cache), Apache Cassandra (Wide-Column), Neo4j (Graph).

6. Vector Database


Given the rise of LLMs and Generative AI, this is one more specialized option critical for working with unstructured data:

This is designed to store and index vector embeddings—numerical representations of unstructured data (text, images, audio) created by AI models. They allow for similarity search (finding "like" data) rather than exact keyword matches.

Primary Use Case: Retrieval-Augmented Generation (RAG), semantic search, recommendation engines, and high-dimensional ML applications.


Example: Pinecone (Commercial), Weaviate (Open Source/Commercial), Qdrant.

Summary


All of these options, from the structured RDBMS to the fluid Vector DB, combine to form a modern enterprise data architecture.

In essence, the modern enterprise no longer relies on a single data storage solution. The journey usually starts  with the RDBMS for transactional integrity, moves to the Data Warehouse for structured BI, and expands into the Data Lake to capture all raw, unstructured data necessary for Machine Learning and discovery.

The
Data Lakehouse is the cutting-edge step, unifying these functions by bringing governance and performance directly to the lake. Vector Databases bridge the gap between unstructured data and the world of Generative AI. 

Understanding the specialized role of each platform is the first and most critical step in designing a future-proof data strategy that extracts maximum value from every piece of information your business creates.

Note that there is some overlap between the categories. For example Postgresql supports JSONB and vector storage, making it useful for some NoSql and AI use cases. Some products that started of as data lakes added features to become lakehouses.