%load_ext sql
from sqlalchemy import create_engine
%sql postgresql://postgres:****@localhost/dvdrental
engine = create_engine('postgresql://postgres:****@localhost/dvdrental')
An SQL query is written in a declarative format, but when it comes time to run a query, the order of execution is important to understand. Some queries are more efficient than others, even when the end result is the same. Efficiency becomes very important when working with large datasets.
SQL is executed in a systematic manner. The EXPLAIN
statement can be used in PostgreSQL to output how the query will be executed. Instead of returning the rows of data, a Query Plan is returned. Contained within the parentheses are some statistics related to the Query Plan.
Below is an example of the EXPLAIN
statement. The first part cost=0.00..253.96
, is stating that the computational effort started at 0 and ended at 253.96 computations. This is an internal arbitrary number used by PostgreSQL, and does not refer to real time.
The second part rows=14596 width=26
refers to the estimated amount of rows that will return, and the width, which is an expression of the estimated average width of rows that will output in bytes of data. It is important to note that these are estimations, the actual execution statistics may differ slightly.
By using the EXPLAIN
and ANALYSE
statements, the output states the time it took to build the execution plan (Planning Time), as well as the actual time it took to execute that plan (Execution Time), 1.777
and 1.311
milliseconds, respectively.
These statistics can be very useful when optimising the execution time when querying very large tables.
%%sql
EXPLAIN ANALYSE
SELECT *
FROM payment;
* postgresql://postgres:***@localhost/dvdrental 3 rows affected.
QUERY PLAN |
---|
Seq Scan on payment (cost=0.00..253.96 rows=14596 width=26) (actual time=0.013..0.917 rows=14596 loops=1) |
Planning Time: 1.777 ms |
Execution Time: 1.311 ms |
The next examples compare the execution time of the WHERE
statement versus not using it.
In this example using the WHERE
statement, the planning time and execution time actually increases:
WHERE
statement, the planning time is 0.126
milliseconds, versus 0.056
milliseconds when not using the WHERE
statementWHERE
statement, the excution time is 2.094
milliseconds, versus 1.553
milliseconds when not using the WHERE
statementThis means that the total estimated time can actually be less when returning more data if there are fewer steps in the execution plan.
%%sql
EXPLAIN ANALYSE
SELECT *
FROM payment
WHERE payment.amount > 10;
* postgresql://postgres:***@localhost/dvdrental 5 rows affected.
QUERY PLAN |
---|
Seq Scan on payment (cost=0.00..290.45 rows=107 width=26) (actual time=0.091..2.083 rows=107 loops=1) |
Filter: (amount > '10'::numeric) |
Rows Removed by Filter: 14489 |
Planning Time: 0.126 ms |
Execution Time: 2.094 ms |
%%sql
EXPLAIN ANALYSE
SELECT *
FROM payment;
* postgresql://postgres:***@localhost/dvdrental 3 rows affected.
QUERY PLAN |
---|
Seq Scan on payment (cost=0.00..253.96 rows=14596 width=26) (actual time=0.008..1.029 rows=14596 loops=1) |
Planning Time: 0.056 ms |
Execution Time: 1.553 ms |
Scanning is a linear operation, moving from one row to the next row, while applying an operation. This operation could be a filter, for example the WHERE
statement demonstrated above.
Scanning small tables with few rows can be efficient, however for large tables with many rows, scanning becomes inefficient. Scanning all the rows in a table is called a 'full table scan' or a 'sequential scan'.
In future posts I will introduce indexes that can combat the poor inefficiency of full table scans.