Still Using COUNT(*) to Count Rows? Explore Faster Alternatives!

When working with databases, one of the most common questions you may find yourself asking is, “How many rows are in this table?”

In PostgreSQL, the answer might not be as straightforward as it seems at first glance. Although the SQL command to count rows appears simple, the underlying mechanics and performance implications can be more complex.

In this article, we’ll explore the different methods to count rows in PostgreSQL, their performance characteristics, and the factors that influence the results.

Table of Contents

The Basic `COUNT(*)` Statement

The most direct way to count the rows in a table is by using the SQL COUNT(*) function. The basic syntax is:

Get Your Linux Course!

Join our Linux Course and discover the power of open-source technology. Enhance your skills and boost your career! Learn Linux today!

SELECT COUNT(*) FROM table_name;

This command will return a single integer representing the total number of rows in the specified table.

For example, if you run SELECT COUNT(*) FROM orders;, the result might be something like 345, indicating that there are 345 rows in the “orders” table.

However, while this might seem simple, there are important underlying mechanisms that determine how this count is calculated.

PostgreSQL’s Approach to `COUNT(*)`

PostgreSQL employs two primary techniques to compute the COUNT(*) function efficiently:

Sequential Scan: In a sequential scan, PostgreSQL reads every data block in the table, one after another, counting the rows as it goes. This method is effective when the table is not indexed or if the entire table needs to be scanned for some other purpose. This approach is often beneficial for disk access patterns as it ensures that PostgreSQL reads data in the most efficient way possible.
Index-Only Scan: An index-only scan can be faster than a sequential scan when an appropriate index exists for the table. In this case, PostgreSQL can leverage the index to count the rows directly, without needing to access the actual table data. The index-only scan can skip the visibility checks on table blocks that are already known to be visible to all sessions, improving performance significantly.

PostgreSQL automatically selects the most efficient method for counting rows based on factors like the table size, available indexes, and other optimization strategies.

Multi-Version Concurrency Control (MVCC) and Row Counting

One of the reasons row counting isn’t always a simple task in PostgreSQL is because of the Multi-Version Concurrency Control (MVCC) model. MVCC allows multiple transactions to run concurrently without interfering with each other.

This mechanism ensures that COUNT(*) queries do not conflict with other operations like inserts, updates, or deletes happening on the same table at the same time.

In the context of counting rows, MVCC means that PostgreSQL must perform visibility checks on each row during a COUNT(*) query.

Rows that have been updated or deleted after the query started must be excluded from the count to ensure that the result reflects a consistent snapshot of the table at the time the query was executed.

As a result, the outcome of a COUNT(*) query is valid only for a specific point in time, and the row count could change as new data is inserted or modified.

Estimating Row Counts

In some cases, a full row count might not be necessary, especially for large tables where performing a full count could be time-consuming. Instead, PostgreSQL provides ways to estimate row counts, which can be done much faster than executing a full COUNT(*) query.

Here are some ways you can obtain an estimated row count:

Using EXPLAIN: The EXPLAIN command provides a query planner’s estimate of the number of rows that will be processed in a query. For example:
```
EXPLAIN SELECT * FROM mytable;
```
The output might look something like this:
```
(cost=0.00..2640.00 rows=100000 width=97)
```
This indicates that the query planner estimates that the table contains 100,000 rows. While this is an estimate, it can give you a general sense of the table’s size.
Using pg_class: PostgreSQL maintains statistics about tables in the system catalog, which can be used to estimate row counts. The pg_class catalog table contains data about the number of rows and data blocks in a table, which is updated by the VACUUM process. To retrieve an estimate from pg_class, you can run the following query:
```
SELECT 
    (CASE WHEN reltuples > 0 THEN pg_relation_size(oid)*reltuples/(8192*relpages)
    ELSE 0
    END)::bigint AS estimated_row_count
FROM pg_class
WHERE oid = 'mytable'::regclass;
```
This query calculates an estimated row count using the number of data blocks and the number of rows stored in the table’s catalog data. Keep in mind that this estimate is based on statistics and can sometimes be inaccurate.

Creating a Function: If you’d like to create a reusable way to estimate row counts, you can define a custom SQL function that encapsulates the logic from the previous query. Here’s an example of how you can do that:

CREATE OR REPLACE FUNCTION estimated_row_count(text)
RETURNS bigint
LANGUAGE sql
AS $$
SELECT 
    (CASE WHEN reltuples > 0 THEN 
    pg_relation_size($1)*reltuples/(8192*relpages)
    ELSE 0
    END)::bigint
FROM pg_class
WHERE oid = $1::regclass;
$$;

Once created, you can use the function like this:

SELECT estimated_row_count('myschema.mytable');

Key Considerations

Performance: While COUNT(*) is straightforward, it can be slow for large tables, especially if the table lacks indexes or the database must perform a full sequential scan. Using an index-only scan, if possible, can improve the performance of row counting.
Estimates: Estimates of row counts can be generated quickly but may not always be accurate. They are based on statistical data that is periodically updated during the VACUUM process. Therefore, the accuracy of estimates can degrade over time if statistics are not kept up to date.
MVCC: The need to perform visibility checks due to PostgreSQL’s MVCC system adds complexity to the process of counting rows, especially in highly concurrent environments.
Locking: Certain methods, such as using pg_relation_size(), might require acquiring a lock on the table. If another transaction has an exclusive lock on the table, the estimation may need to wait for that lock to be released.

Conclusion

Counting rows in PostgreSQL can be as simple as running a SELECT COUNT(*) query, but the underlying mechanisms, such as MVCC and the choice of scan method, add complexity to the process.

By understanding the different approaches to row counting, including full counts and estimates, and being aware of factors such as performance, concurrency, and locking, you can make more informed decisions about how to manage and query your data efficiently.

Whether you need an exact count or a quick estimate, PostgreSQL provides powerful tools to help you get the information you need with optimal performance.

The Basic COUNT(*) Statement

PostgreSQL’s Approach to COUNT(*)