Table Configurations

Multi-level configuration management

Amoro provides configurations that can be configured at the Catalog, Table, and Engine levels. The configuration priority is given first to the Engine, followed by the Table, and finally by the Catalog.

Catalog: Generally, we recommend users to set default values for tables through the Catalog properties configuration, such as Self-optimizing related configurations.
Table: We also recommend users to specify customized configurations when Create Table, which can also be modified through Alter Table operations.
Engine: If tuning is required in the engines, then consider configuring it at the engine level, refer to Spark and Flink.

Self-optimizing configurations

Self-optimizing configurations are applicable to both Iceberg Format and Mixed streaming Format.

Key	Default	Description
self-optimizing.enabled	true	Enables Self-optimizing
self-optimizing.group	default	Optimizer group for Self-optimizing
self-optimizing.quota	0.1	Quota for Self-optimizing, indicating the CPU resource the table can take up
self-optimizing.execute.num-retries	5	Number of retries after failure of Self-optimizing
self-optimizing.target-size	134217728(128MB)	Target size for Self-optimizing
self-optimizing.max-file-count	10000	Maximum number of files processed by a Self-optimizing process
self-optimizing.max-task-size-bytes	134217728(128MB)	Maximum file size bytes in a single task for splitting tasks
self-optimizing.fragment-ratio	8	The fragment file size threshold. We could divide self-optimizing.target-size by this ratio to get the actual fragment file size
self-optimizing.minor.trigger.file-count	12	The minimum numbers of fragment files to trigger minor optimizing
self-optimizing.minor.trigger.interval	3600000(1 hour)	The time interval in milliseconds to trigger minor optimizing
self-optimizing.major.trigger.duplicate-ratio	0.1	The ratio of duplicate data of segment files to trigger major optimizing
self-optimizing.full.trigger.interval	-1(closed)	The time interval in milliseconds to trigger full optimizing
self-optimizing.full.rewrite-all-files	true	Whether full optimizing rewrites all files or skips files that do not need to be optimized

Data-cleaning configurations

Data-cleaning configurations are applicable to both Iceberg Format and Mixed streaming Format.

Key	Default	Description
table-expire.enabled	true	Enables periodically expire table
change.data.ttl.minutes	10080(7 days)	Time to live in minutes for data of ChangeStore
snapshot.base.keep.minutes	720(12 hours)	Table-Expiration keeps the latest snapshots of BaseStore within a specified time in minutes
clean-orphan-file.enabled	false	Enables periodically clean orphan files
clean-orphan-file.min-existing-time-minutes	2880(2 days)	Cleaning orphan files keeps the files modified within a specified time in minutes
clean-dangling-delete-files.enabled	true	Whether to enable cleaning of dangling delete files
data-expire.enabled	false	Whether to enable data expiration
data-expire.level	partition	Level of data expiration. Including partition and file
data-expire.field	NULL	Field used to determine data expiration, supporting timestamp/timestampz/long type and string type field in date format
data-expire.datetime-string-pattern	yyyy-MM-dd	Pattern used for matching string datetime
data-expire.datetime-number-format	TIMESTAMP_MS	Timestamp unit for long field. Including TIMESTAMP_MS and TIMESTAMP_S
data-expire.retention-time	NULL	Retention period for data expiration. For example, 1d means retaining data for 1 day. Other supported units include h (hour), min (minute), s (second), ms (millisecond), etc.

Mixed Format configurations

If using Iceberg Format，please refer to Iceberg configurations，the following configurations are only applicable to Mixed Format.

Reading configurations

Key	Default	Description
read.split.open-file-cost	4194304(4MB)	The estimated cost to open a file
read.split.planning-lookback	10	Number of bins to consider when combining input splits
read.split.target-size	134217728(128MB)	Target size when combining data input splits
read.split.delete-ratio	0.05	When the ratio of delete files is below this threshold, the read task will be split into more tasks to improve query speed

Writing configurations

Key	Default	Description
base.write.format	parquet	File format for the table for BaseStore, applicable to KeyedTable
change.write.format	parquet	File format for the table for ChangeStore, applicable to KeyedTable
write.format.default	parquet	Default file format for the table, applicable to UnkeyedTable
base.file-index.hash-bucket	4	Initial number of buckets for BaseStore auto-bucket
change.file-index.hash-bucket	4	Initial number of buckets for ChangeStore auto-bucket
write.target-file-size-bytes	134217728(128MB)	Target size when writing
write.upsert.enabled	false	Enable upsert mode, multiple insert data with the same primary key will be merged if enabled
write.distribution-mode	hash	Shuffle rules for writing. UnkeyedTable can choose between none and hash, while KeyedTable can only choose hash
write.distribution.hash-mode	auto	Auto-bucket mode, which supports primary-key, partition-key, primary-partition-key, and auto

LogStore configurations

Key	Default	Description
log-store.enabled	false	Enables LogStore
log-store.type	kafka	Type of LogStore, which supports ‘kafka’ and ‘pulsar’
log-store.address	NULL	Address of LogStore, required when LogStore enabled. For Kafka, this is the Kafka bootstrap servers. For Pulsar, this is the Pulsar Service URL, such as ‘pulsar://localhost:6650’
log-store.topic	NULL	Topic of LogStore, required when LogStore enabled
properties.pulsar.admin.adminUrl	NULL	HTTP URL of Pulsar admin, such as ‘http://my-broker.example.com:8080’. Only required when log-store.type=pulsar
properties.XXX	NULL	Other configurations of LogStore. For Kafka, all the configurations supported by Kafka Consumer/Producer can be set by prefixing them with `properties.`， such as `'properties.batch.size'='16384'`， refer to Kafka Consumer Configurations, Kafka Producer Configurations for more details. For Pulsar，all the configurations supported by Pulsar can be set by prefixing them with `properties.`, such as `'properties.pulsar.client.requestTimeoutMs'='60000'`， refer to Flink-Pulsar-Connector for more details

Watermark configurations

Key	Default	Description
table.event-time-field	_ingest_time	The event time field for calculating the watermark. The default `_ingest_time` indicates calculating with the time when the data was written
table.watermark-allowed-lateness-second	0	The allowed lateness time in seconds when calculating watermark
table.event-time-field.datetime-string-format	`yyyy-MM-dd HH:mm:ss`	The format of event time when it is in string format
table.event-time-field.datetime-number-format	TIMESTAMP_MS	The format of event time when it is in numeric format, which supports TIMESTAMP_MS (timestamp in milliseconds) and TIMESTAMP_S (timestamp in seconds)

Mixed-Hive format configurations

Key	Default	Description
base.hive.auto-sync-schema-change	true	Whether synchronize schema changes of Hive Table from HMS
base.hive.auto-sync-data-write	false	Whether synchronize data changes of Hive Table from HMS, this should be true when writing to Hive
base.hive.consistent-write.enabled	true	To avoid writing dirty data, the files written to the Hive directory will be hidden files and renamed to visible files upon commit.