CS430 - Database Systems - Indexing Lab

Last updated : Thursday, November 18, 2004 at 09:34:39  EST


This lab is designed to explore how indexes can be used to improve the performance of SQL statements, the different types of indexes supported by Postgres, and the issues involved around indexes and the optimizer.

If you haven't read chapter 8 in the textbook do that before completing this lab. You will get much less out of the experience, if anything, if you don't have a handle on the terminology and the fundamentals of storage and indexing.

This lab is due on Monday November 22nd. See the instructions below for turning in your work.

All of the files you need to get started on this lab are in ~charliep/courses/database-systems/indexing on the CS filesystem. Here's a rundown on what's there:

Write SQL statements to answer each of the following questions using the data model I supplied:

  1. How many part numbers in NYC have more than 90 parts on_hand?

  2. How many part numbers in either NYC or SFO have 95 parts on_hand in either place?

  3. How many part numbers in NYC have between 90 and 95 parts (inclusive) on hand?

  4. How many total parts are on_hand in SFO for suppliers whose name begins with the letter 'B'?

  5. How many total parts are on_hand in NYC for suppliers whose name ends with the letter 'y'?

  6. How many total parts on_hand, in both NYC and SFO, are Red?

  7. How many different part numbers are available in NYC from suppliers who supply part 334 in NYC?

  8. List all the suppliers that have more total on_hand parts in NYC than they do in SFO.

  9. List all suppliers that supply parts in NYC that aren't supplied by anyone in SFO.

  10. Update all of the NYC on_hand values to on_hand - 3.

  11. Update all of the SFO supplier_name values to supplier_name (no net change).

  12. Delete all parts from NYC where there are less than 10 on_hand.

Some notes about the data model and the questions:

Package all of those SQL statements up into one script, queries.sql, so that you can easily run them as a single batch from the command line using the Un*x time command, i.e.

You can also set it up so that you can run and time individual queries, or groups of queries, as part of your exploration. Postgres' psql \timing command can be helpful here. Be sure to use both timing mechanisms.

Write a script called create_indexes.sql that contains create statements for all the indexes you decide are best for the queries you wrote. Developing this is a very iterative process involving repeated test, measurement, and consideration steps. Make sure you follow the necessary precautions when measuring, e.g. no other loads on the system and averaging multiple measurements.

You should experiment with both B-tree and hash index types (regardless of what the documentation says). You can find the Postgres documentation on indexes at http://www.postgresql.org/docs/7.4/static/indexes.html Be sure to learn about and use the analyze command.

Some questions to consider in your write-up:

  1. How much faster does queries.sql run when your indexes are in place as compared to having no indexes? Quantify the difference.

  2. How much longer do the update statements take with the indexes in place compared to having no indexes? Quantify the difference.

  3. What data model changes would you suggest that might significantly improve performance?

  4. Estimate the amount of disk space in bytes that each of the tables and indexes (that you propose) occupies. Use the Postgres data dictionary and "ls -l" to figure-out how much space each of them really occupies. List your estimates (and how you developed them) and the actual values for each database object (tables and indexes).

  5. What effect does wrapping each of the update statements in "begin transaction" and "commit" have? Why? Quantify the difference.

  6. What is the fastest type of index to maintain during updates? The slowest?

  7. What happens if you create your indexes before you load the data in the two big tables?

  8. Are there any select statements which can't be helped by indexes?

  9. How much of a difference did analyze make? Where did you use it? Compare plain tables, tables with indexes before analyze, and then tables with indexes after analyze for a given set of select statements.

  10. How can you reconcile the different timing information provided by Un*x's time command and Postgres' psql \timing command? Which operating system process(es) are using the time?

Turn in the following files via a tarball called e.g. charliep.tgz:

  1. queries.sql
  2. create_indexes.sql
  3. Typescript showing load.pl, create_indexes.sql, and queries.sql being run one after the other. This output should include timing information for each script.
  4. A write-up which addresses the questions I pose above.

Put the tarball in ~charliep/courses/homework/indexing on the CS filesystem when you are done.