Table format (aka. format) was first proposed by Iceberg, which can be described as follows:
- It defines the relationship between tables and files, and any engine can query and retrieve data files according to the table format.
- New formats such as Iceberg/Delta/Hudi further define the relationship between tables and snapshots, and the relationship between snapshots and files. All write operations on the table will generate new snapshots, and all read operations on the table are based on snapshots. Snapshots bring MVCC, ACID, and Transaction capabilities to data lakes.
In addition, new table formats such as Iceberg also provide many advanced features such as schema evolve, hidden partition, and data skip. Hudi and Delta may have some differences in specific functions, but we see that the standard of table formats is gradually established with the functional convergence of these three open-source projects in the past two years.
For users, the design goal of Amoro is to provide an out-of-the-box data lake system. Internally, Amoro’s design philosophy is to use different table formats as storage engines for data lakes. This design pattern is more common in open-source systems such as MySQL and ClickHouse.
Currently, Amoro mainly provides the following three table formats:
- Iceberg format: Users can directly entrust their Iceberg tables to Amoro for maintenance, so that users can not only use all the functions of Iceberg tables, but also enjoy the performance and stability improvements brought by Amoro.
- Mixed-Iceberg format: Amoro provides a set of more optimized formats for streaming update scenarios on top of the Iceberg format. If users have high performance requirements for streaming updates or have demands for CDC incremental data reading functions, they can choose to use the Mixed-Iceberg format.
- Mixed-Hive format: Many users do not want to affect the business originally built on Hive while using data lakes. Therefore, Amoro provides the Mixed-Hive format, which can upgrade Hive tables to Mixed-Hive format only through metadata migration, and the original Hive tables can still be used normally. This ensures business stability and benefits from the advantages of data lake computing.