What is ClickHouse?
ClickHouse® is a high-performance, column-oriented SQL database management system (DBMS) for online analytical processing (OLAP). It is available as both an open-source software and a cloud offering.
What are analytics?
Analytics, also known as OLAP (Online Analytical Processing), refers to SQL queries with complex calculations (e.g., aggregations, string processing, arithmetics) over massive datasets.
Unlike transactional queries (or OLTP, Online Transaction Processing) that read and write just a few rows per query and, therefore, complete in milliseconds, analytics queries routinely process billions and trillions of rows.
In many use cases, analytics queries must be “real-time”, i.e. return a result in less than one second.
Row-oriented vs. column-oriented storage
Such a level of performance can only be achieved with the right data “orientation”.
Databases store data either row-oriented or column-oriented.
In a row-oriented database, consecutive table rows are sequentially stored one after the other. This layout allows to retrieve rows quickly as the column values of each row are stored together.
ClickHouse is a column-oriented databases. In such systems, tables are stored as a collection of columns, i.e. the values of each column are stored sequentially one after the other. This layout makes it harder to restore single rows (as there are now gaps between the row values) but column operations such as filters or aggregation becomes much faster than in a row-oriented database.
The difference is best explained with an example query which filters on three columns:
SELECT *
FROM table
WHERE time > ‘2024-09-01 13:14:15’
AND location = ‘Berlin’
AND `Mobile Phone` LIKE ‘%61010%`
Row-oriented DBMS
Column-oriented DBMS
Data Replication and Integrity
ClickHouse uses an asynchronous multi-master replication scheme to ensure that data is stored redundantly on multiple nodes. After being written to any available replica, all the remaining replicas retrieve their copy in the background. The system maintains identical data on different replicas. Recovery after most failures is performed automatically, or semi-automatically in complex cases.
Role-Based Access Control
ClickHouse implements user account management using SQL queries and allows for role-based access control configuration similar to what can be found in ANSI SQL standard and popular relational database management systems.
SQL Support
ClickHouse supports a declarative query language based on SQL that is identical to the ANSI SQL standard in many cases. Supported query clauses include GROUP BY, ORDER BY, subqueries in FROM, JOIN clause, IN operator, window functions and scalar subqueries.
Approximate calculation
ClickHouse provides ways to trade accuracy for performance. For example, some of its aggregate functions calculate the distinct value count, the median, and quantiles approximately. Also, queries can be run on a sample of the data to compute an approximate result quickly. Finally, aggregations can be run with a limited number of keys instead of for all keys. Depending on how skewed the distribution of the keys is, this can provide a reasonably accurate result that uses far fewer resources than an exact calculation.
Adaptive join algorithms
ClickHouse chooses the join algorithm adaptively, it starts with fast hash joins and falls back to merge joins if there’s more than one large table.
Superior query performance
ClickHouse is well known for having extremely fast query performance. To learn why ClickHouse is so fast, see the Why is ClickHouse fast? guide.