%load_ext sql
from sqlalchemy import create_engine
%sql postgresql://postgres:****@localhost/dvdrental
engine = create_engine('postgresql://postgres:****@localhost/dvdrental')
A table contains data. A table is used to organise data so that it can be manipulated and analysed in various ways applicable to the questions being asked of the data. A data table can be broken down further to its constituent parts:
Each row in a table represents one entry, in the sketch below, the row entry from left to right is, Jane, Lawyer, England and 27. Each column in a table represents a feature, the below example shows the "Country" feature. It is important to note that each entry in the column is of the same datatype.
The most fundamental keyword in SQL is the SELECT
command. This command allows retrieval of selected data from a database. Usually the SELECT
command is used in conjunction with the FROM
command.
SELECT
specifies columns of interestFROM
specifies where these columns are storedOften SQL scripts will work without terminating a statement with a semicolon, but it is best practice to end a statement with a semi colon (;). It is also good practice to write all commands in uppercase, such as SELECT
.
The SELECT
can specify what columns should be returned from what tables, or what records should be returned based on the value of certain columns. The following example uses SELECT
to select data FROM
the payment table, the query reads like plain english.
%%sql
SELECT * FROM payment
LIMIT 5;
* postgresql://postgres:***@localhost/dvdrental 5 rows affected.
payment_id | customer_id | staff_id | rental_id | amount | payment_date |
---|---|---|---|---|---|
17503 | 341 | 2 | 1520 | 7.99 | 2007-02-15 22:25:46.996577 |
17504 | 341 | 1 | 1778 | 1.99 | 2007-02-16 17:23:14.996577 |
17505 | 341 | 1 | 1849 | 7.99 | 2007-02-16 22:41:45.996577 |
17506 | 341 | 2 | 2829 | 2.99 | 2007-02-19 19:39:56.996577 |
17507 | 341 | 2 | 3130 | 7.99 | 2007-02-20 17:31:48.996577 |
The asterisk (*) is a wildcard, and returns all columns from a given table.
In the above query, the SELECT
statement uses the wildcard to return all of the columns FROM
the table called payment. The LIMIT
command at the end of the query restricts the rows affected to 5, in order not to return all of the rows, which could be many thousands.
Above is returned 6 columns of data, where the rows returned have been limited to 5. If the LIMIT
command was not used, 14596 rows would be returned.
There are times when retrieving all of the column data in a table is necessary, but in most cases this leads to slow performance. Especially when dealing with very large datasets. For this reason, the wildcard should be used sparingly when writing queries.
Note also the convention to use all lower case for column titles as well as underscores to seperate words.
Instead of using the wildcard to SELECT
all of the columns data, querys can be more specific. Explicit notation lists specific columns of interest, which can increase readability, and make code easier to troubleshoot.
The following query selects only the payment_id
and customer_id
columns from the payment table, so instead of all 6 columns, what is returned is the specified 2.
%%sql
SELECT payment_id, customer_id
FROM payment
LIMIT 5;
* postgresql://postgres:***@localhost/dvdrental 5 rows affected.
payment_id | customer_id |
---|---|
17503 | 341 |
17504 | 341 |
17505 | 341 |
17506 | 341 |
17507 | 341 |
A PRIMARY KEY
is a column that contains a unique identifier for the entire row. Its only job is to make sure each row has one piece of information that makes sure the entire row is uniquely identifiable. It does not have any other significance. A PRIMARY KEY
cannot contain NULL
values.
A FOREIGN KEY
is the link between two tables in a database that allows for a relational database. A FOREIGN KEY
is a field (or collection of fields) in one table, that refers to the PRIMARY KEY
in another table. FOREIGN KEY
constraints are useful in that they protect the integrity of the database by not allowing data to exist that does not relate to the PRIMARY KEY
constraint.
To demonstrate primary and foreign keys using PostgreSQL, the same as in the sketch above, in the following table called "rental", customer_id
is a FOREIGN KEY
that refers to the PRIMARY KEY
in the "payment" table. customer_id
is the relation between the two tables. This relational concept is very important to understand when working with relational databases.
%%sql
SELECT DISTINCT customer_id
FROM rental
Limit 5;
* postgresql://postgres:***@localhost/dvdrental 5 rows affected.
customer_id |
---|
87 |
184 |
477 |
273 |
550 |
%%sql
SELECT payment_id, customer_id
FROM payment
LIMIT 5;
* postgresql://postgres:***@localhost/dvdrental 5 rows affected.
payment_id | customer_id |
---|---|
17503 | 341 |
17504 | 341 |
17505 | 341 |
17506 | 341 |
17507 | 341 |