CS430 - Database Systems - Indexing Lab
Last updated : Thursday, November 18, 2004 at 09:34:39 EST
This lab is designed to explore how indexes can be used to improve
the performance of SQL statements, the different types of indexes
supported by Postgres, and the issues involved around indexes and the optimizer.
If you haven't read chapter 8 in the textbook do that before completing this lab. You will get much less out of the experience, if anything,
if you don't have a handle on the terminology and the fundamentals of
storage and indexing.
This lab is due on Monday November 22nd. See the instructions below for
turning in your work.
All of the files you need to get started on this lab are in ~charliep/courses/database-systems/indexing on the CS filesystem. Here's a rundown on what's there:
- create_tables.sql - creates the tables and populates the small tables.
- load.pl - populates the big tables.
Write SQL statements to answer each of the following questions using
the data model I supplied:
- How many part numbers in NYC have more than 90 parts on_hand?
- How many part numbers in either NYC or SFO have 95 parts
on_hand in either place?
- How many part numbers in NYC have between 90 and 95 parts
(inclusive) on hand?
- How many total parts are on_hand in SFO for suppliers whose name
begins with the letter 'B'?
- How many total parts are on_hand in NYC for suppliers whose name
ends with the letter 'y'?
- How many total parts on_hand, in both NYC and SFO, are Red?
- How many different part numbers are available in NYC from
suppliers who supply part 334 in NYC?
- List all the suppliers that have more total on_hand parts
in NYC than they do in SFO.
- List all suppliers that supply parts in NYC that aren't supplied
by anyone in SFO.
- Update all of the NYC on_hand values to on_hand - 3.
- Update all of the SFO supplier_name values to supplier_name (no net
change).
- Delete all parts from NYC where there are less than 10 on_hand.
Some notes about the data model and the questions:
- part_number is the primary key for each part table. It is not
unique across both locations (tables).
- If a part has the same number in NYC and SFO it is the same part,
regardless of color, etc.
- If I say, e.g. "Red parts", I mean color_name = 'Red' not
color = 0.
- Different suppliers may supply the same part in NYC and SFO.
Package all of those SQL statements up into one script, queries.sql, so that you can easily run them as a single batch from the command line using the Un*x time command, i.e.
$ time psql < queries.sql
You can also set it up so that you can run and time individual queries, or groups of queries, as part of your exploration. Postgres' psql \timing command can be helpful here. Be sure to use both timing mechanisms.
Write a script called create_indexes.sql that contains create statements
for all the indexes you decide are best for the queries you wrote. Developing
this is a very iterative process involving repeated test, measurement, and consideration steps. Make sure you follow the necessary precautions when measuring, e.g. no other loads on the system and averaging multiple measurements.
You should experiment with both B-tree and hash index types
(regardless of what the documentation says). You can find the Postgres
documentation on indexes at
http://www.postgresql.org/docs/7.4/static/indexes.html Be sure to
learn about and use the analyze command.
Some questions to consider in your write-up:
- How much faster does queries.sql run when your indexes are in place
as compared to having no indexes? Quantify the difference.
- How much longer do the update statements take with the indexes in
place compared to having no indexes? Quantify the difference.
- What data model changes would you suggest that might significantly
improve performance?
- Estimate the amount of disk space in bytes that each of the
tables and indexes (that you propose) occupies. Use the Postgres
data dictionary and "ls -l" to figure-out how much space each of
them really occupies. List your estimates (and how you developed
them) and the actual values for each database object (tables and indexes).
- What effect does wrapping each of the update statements in
"begin transaction" and "commit" have? Why? Quantify the difference.
- What is the fastest type of index to maintain during updates?
The slowest?
- What happens if you create your indexes before you load the data
in the two big tables?
- Are there any select statements which can't be helped by indexes?
- How much of a difference did analyze make? Where did you use
it? Compare plain tables, tables with indexes before analyze, and
then tables with indexes after analyze for a given set of select
statements.
- How can you reconcile the different timing information
provided by Un*x's time command and Postgres' psql \timing command?
Which operating system process(es) are using the time?
Turn in the following files via a tarball called e.g. charliep.tgz:
- queries.sql
- create_indexes.sql
- Typescript showing load.pl, create_indexes.sql, and queries.sql
being run one after the other. This output should include timing
information for each script.
- A write-up which addresses the questions I pose above.
Put the tarball in ~charliep/courses/homework/indexing on the CS filesystem when you are done.