This probably most common example of doing statistics with data. All modern databases do (or can do – depends on settings) some basic or more advanced statistics on data. Main reason for it is optimization of queries. Relational databases depends heavily on statistics because their query planners need to determine the most optimal strategy for query processing. Wrongly calculated or out of date statistics are then real disaster for performance.

Statistics can cover information like:

  • table statistics:
    • number of rows,
    • number of data blocks,
    • average row length
    • number of read data blocks
    • number of written data blocks
    • number of read rows
    • number of written rows
    • number of deleted rows
    • number of updated rows
  • index statistics:
    • number of read data blocks
    • number of written data blocks
    • (B-tree indexes) – number of leaves, number of levels
  • table column statistics:
    • number of distinct values,
    • number of nulls,
    • histogram of data

Not all databases implement all those possibilities. And some of them can be optional because they would make sense only for specific usage.

Statistics are usually calculated based on samples from the table or index. Otherwise performance on really big tables would be disastrous.