Benchmark Report

Test purpose

This test aims at comparing the OLAP benchmark performance of various data lake formats in the scenario of continuous streaming ingestion in the CDC database.

Meanwhile, particular attention was paid during the testing process to the impact of enabling self-optimizing on the analytical performance of the table.

Test envrionment

Hardware configuration

Number OS Cpu core Memory Disk type Deployed components
1 CentOS 7 40 256 SAS HDFS、Hive、Yarn
2 CentOS 7 40 256 SAS HDFS、Hive、Yarn
3 CentOS 7 40 256 SAS HDFS、Hive、Yarn
4 CentOS 7 40 256 SAS Trino、Presto
5 CentOS 7 40 256 SAS Trino、Presto
6 CentOS 7 40 256 SAS Trino、Presto

Software version

Software Version
Trino 380
Presto 274
Iceberg 0.13
Amoro 0.4
Hudi 0.11.1

Test plan

Overview

This test is based on CHbenchmark, which is a hybrid testing standard that integrates TPCC and TPCH. The overall testing workload can be divided into two categories:

  • 5 OLTP workloads based on TPC-C: NewOrder, Payment, OrderStatus, Delivery, and StockLevel.

  • 22 OLAP workloads based on TCP-H, where Q15 was abandoned in this test due to its association with views and the lack of view rewriting.

OLTP AND OLAP

Prepare test data

Based on the TPC-C, the raw data was constructed in MySQL for this test. The dataset includes a total of 12 tables, with the relationship between TPC-C and TPC-H tables shown in the following diagram:

TPCC AND TPCH

In addition, the relationship between the data sizes of each table is shown in the following table, where w represents the number of warehouses. It can be observed that the data sizes of intermediate tables such as new_order and stock are affected by the number of warehouses. Therefore, the data set size can be adjusted by controlling the number of warehouses during testing.

chbenchmark

In this test, the number of warehouses was set to 100, and the initial data set size in the MySQL database was approximately 10GB. The following table shows the number of data records for each table in the initial data set and the changes in the number of data records for each table after running the one-hour TPC-C test.

Table name The number of records (rows) in beginning The number of records (rows) after running a one-hour TPC-C test
warehouse 100 100
item 100000 100000
stock 10000000 10000000
district 1000 1000
customer 3000000 3000000
history 3000000 3119285(+119285)
oorder 3000000 3124142(+124142)
new_order 893709 907373(+13664)
order_line 29996774 31252799(+1256025)
region 5 5
nation 62 62
supplier 1000 1000

Perform the test

Before starting TPC-H testing, start a Flink job to synchronize both the history data and real-time data from MySQL into the data lake. Afterwards, execute TPC-H testing through query engines such as Trino or Presto.

TPC-H contains 22 query statements, only three of which are listed here due to space limitation:

-- query1 
SELECT ol_number,
    sum(ol_quantity) AS sum_qty,
    sum(ol_amount) AS sum_amount,
    avg(ol_quantity) AS avg_qty,
    avg(ol_amount) AS avg_amount,
    count(*) AS count_order
FROM order_line
WHERE ol_delivery_d > '2007-01-02 00:00:00.000000'
GROUP BY ol_number
ORDER BY ol_number;
        
 -- query2 
SELECT su_suppkey,
    su_name,
    n_name,
    i_id,
    i_name,
    su_address,
    su_phone,
    su_comment
FROM item, supplier, stock, nation, region, 
    (SELECT s_i_id AS m_i_id,
        MIN(s_quantity) AS m_s_quantity
    FROM stock, supplier, nation, region
    WHERE MOD((s_w_id*s_i_id), 10000) = su_suppkey
        AND su_nationkey = n_nationkey
        AND n_regionkey = r_regionkey
        AND r_name LIKE 'Europ%'
    GROUP BY s_i_id) m
WHERE i_id = s_i_id
    AND MOD((s_w_id * s_i_id), 10000) = su_suppkey
    AND su_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND i_data LIKE '%b'
    AND r_name LIKE 'Europ%'
    AND i_id=m_i_id
    AND s_quantity = m_s_quantity
ORDER BY n_name, su_name, i_id;
        
 -- query3 
SELECT ol_o_id,
    ol_w_id,
    ol_d_id,
    sum(ol_amount) AS revenue,
    o_entry_d
FROM customer, new_order, oorder, order_line
WHERE c_state LIKE 'A%'
    AND c_id = o_c_id
    AND c_w_id = o_w_id
    AND c_d_id = o_d_id
    AND no_w_id = o_w_id
    AND no_d_id = o_d_id
    AND no_o_id = o_id
    AND ol_w_id = o_w_id
    AND ol_d_id = o_d_id
    AND ol_o_id = o_id
    AND o_entry_d > '2007-01-02 00:00:00.000000'
GROUP BY ol_o_id, ol_w_id, ol_d_id, o_entry_d
ORDER BY revenue DESC, o_entry_d;

Test results

Static result

arctic iceberg static performence

The figure above shows a performance comparison of Iceberg and Mixed-Iceberg table formats for querying static data. It can be seen from the figure that the query performance of the two table formats is similar.

Dynamic result

Amoro Iceberg Hudi 100 warehouse performence with TPCC time

The figure above shows a performance comparison of Iceberg 、Mixed-Iceberg and Hudi table formats for querying dynamic data. The test recorded the results of running TPC-C for different time periods.

The following are the specific results of each test group:

Amoro Iceberg Hudi 100 warehouse performence on TPCC 0-30 minutes

Amoro Iceberg Hudi 100 warehouse performence on TPCC 30-60 minutes

Amoro Iceberg Hudi 100 warehouse performence on TPCC 60-90 minutes

Amoro Iceberg Hudi 100 warehouse performence on TPCC 90-120 minutes

Conclusion

  • In the case of static data, the query performance of Iceberg and Mixed-Iceberg tables is similar.
  • Without enabling self-optimizing, the query performance of all table formats will continue to decline as dynamic data is continuously written.
  • After enabling self-optimizing, the query performance of all table formats remains stable as dynamic data is continuously written.