“Data Lake” Simulation of Data Processing and Analysis with Schema on Read Approach
Authors
Muhammad Irfan Zuhri
Institut Teknologi Sepuluh Nopember
Ade Irma Rosida Wijaya
Dwi Oktavianto Wahyu Nugroho
Abstract
Advancements in digital technology have driven an increase in volume, velocity, and variety of data generated by modern systems. Organizations are now required to manage structured, semi-structured, and unstructured data originating from diverse sources, creating the need for a flexible and scalable storage architecture. A Data Lake is a data storage architecture designed to accommodate various types of data in their original form without structural limitations, allowing the integration of structured, semi-structured, and unstructured data within a single platform. This study implements a simulation of data processing and data analysis using a schema-on-read approach, a method in which the schema is applied only when the data is accessed or analyzed. In this research, the simulation illustrates the flow of ingestion, storage, metadata management, and analytical processes that take advantage of the flexibility offered by schema on read. This approach provides an overview of how a data lake can support adaptive big data analysis without requiring complex initial transformations.