CS 450 Indexed Sequential/B-Tree Assignment

J. Dichter

 

General Instructions:

You are to write data management program which will offer to the user a front end enabling the user to display, update, insert, and delete data records. The front end should be a GUI, such as VC++ MFC or a Java Application/Applet. You may use a simpler package such as UNIX curses package, or even straight text interface. If you use straight text, try to make it as clear and neat as possible.

The program should handle a large number of records, at least 10,000. You should try to run your system with 100,000, if the system will support that much memory. The content of your records are up to you, but they should contain a numeric key value which will act as the file primary key. I suggest that the data portion of the record can be either a random generated text or, better still, some large data file taken from the internet - be creative! The data portion should be at least 50 bytes per record.

The program will access the data directly from the file using random file access. C code can use the fseek( ), ftell( ) function; C++ can use istream methods seekp( ) and seekg( ), and tell( ); Java can use RandomAccessFile native method seek(). All these options allow the program to manipulate data from a file. The organization of the file will be either of two different options:

1. Indexed Sequential Access

2. B-Tree Access

 

Specific Directions:

The following will be divided into 2 sections, one for each of the two implementation options.

1. Indexed Sequential Access

In this case you will create a file big enough to hold your record set. The file can be generated by an option of your menu, or it can be a separate program. The idea is that you will create a data file and a two-level index file (actually two separate files), each of which will be composed of logical blocks.

The data file: this file will have blocks which will have room for six logical records, and a pointer to the next record block, as well as the number of vacant logical record slots. As you load the data into the file, you will enter the data into the blocks in increasing key sequence, say key = 1001, 1004, 1006, 1012, 1023, etc. You will place the first 4 records into the first block, leaving two empty slots, marking the number of empty slots, and storing the pointer to the next block. For example, if you were to load in only 25 records, you would fill up the first 6 blocks, and one slot in the 7th block. The number of empy slots in the first 6 blocks would be 2, and in the 7th would be 5. Block 1 would point to block 2, block 2 to 3, and so forth. Block 7 would have a null pointer, since no more data would exist. You would have a data pointer referring to the 1st (offset 0) block, and a free block pointer referring to the 8th block. Block 8 would point to block 9, block nine to block 10, and so forth. The last block would have a null pointer, since no more free block exist. Clearly the blocking factor in this scheme is 6 because each block can hold 6 logical records.

The index file(s): this is a set of 2 files the 2nd level file (closest to the data file), and the 1st level further from the data file. You need to build the index files at the same time as the data files are being generated. The index file will have a blocking factor of 50, meaning that 50 logical index records can be placed in a block. The content of a 2nd level index block will be a key (which will represent the maximum key in a corresponding data block) as well as a pointer to the corresponding data block. The content of a 1st level index block will be a key (which will represent the maximum key in a corresponding index block) as well as a pointer to the corresponding 2nd level index block.

As records are deleted and inserted, you need to manage the data blocks, and the index blocks. For example, as the maximum key data record in a block is deleted, you must update the key value in the 2nd level index which led you to the data block. IF that key also happens to be the maximum key in the 1st level block, you will also need to update the 1st level index. If a record is deleted which is the last record in a data block, then you need to bypass this block from its data block predecessor. Also, you need to eliminate the block from the 2nd level index. If the 2nd level index has other higher keyed entries, you just need to "compress" the index block, eliminating the key to the recently deleted block. Also be sure to update the number of empty slots in the index block (it will increase by one). If the index entry was the maximum entry, then you will need to fix the 1st level index appropriately.

Additions to the data file will be handled in a similar way. You will follow the indexes to the block to which the data should belong. If that block has available room, add the data into the block, and reduce the number of empty slots in the block. If the block is already full, then you need to grab another block from the free blocks. You will need to remove it from the free block list, and link it into the data blocks, and then split the data records from the "full" block with the new block. After the block insertion, you should link the former full block to the new block, and the new block to the block that the former full block pointed to. This is a similar procedure to linked list insertion.

Display: When the user asks to display the record, you should display ALL the records in the data block. That means you will always display 1 - 6 records, depending on the fullness of the data block. If a block has in it records with key values 2002, 2122, and 2223, and you add record with a key 2130, then the next display of record with key 2122, should display the records with keys: 2002, 2122, 2130 and 2223.

 

2. B-Tree Access

In this alternative implementation, you can create the data file in the same way as in the other approach. You can even create a datafile with a single record per block. In this way each record would have a key in the B-tree. Now, you can create the B-Tree index as a separate step. Since a B-Tree is a tree, and construction of the B-Tree is well-defined, you can scan the file, and enter into the B-Tree node the key of the maximum record (in the case of blocking factor > 1) or the key of each record. The B-Tree will create its own shape and will grow as keys are entered into it.

Searching and Deleting a record is done in either of two ways:

a. In case one, where we have blocked records, we simply traverse the tree nodes following a well defined path until we get to a leaf node. The leaf node will then have a pointer to the block where the data should be found. Note that only leaf node will have block pointers.

b. In case two, one record per block, when we find a key in the B-Tree, we immediately follow to the record. Note that all nodes now need to have record pointers. If we get to a leaf node without a key match, the record is not in the system.

In case one, deleting the last record in a block, requires the node to be re-linked to the empty node list. Deleting a maximum keyed record will require an update to the key in the leaf node. In case two, deleting any node will require a recycle of the data slot. In addition, the B-Tree needs to be reorganized as a key entry in a node will need to be removed.

If you choose to use blocked records, then the display should show the entire block of records (see the explanation given to the Index-Sequential implementation). Otherwise, you can only show a single record. The advantage of the blocked method is that it is easier to cluster data into the blocks.

Note: You should allow the order of the B-Tree to be software configurable. You can work with powers of 2 (if you wish) , allowing the order parameter m to be 64, 128, 256, or 512. The logic for the implementation of B-Trees is available in most advanced Data Structures books.

Optional:

To show how the B-tree order affects the access time, you could measure the time it takes to process a set of transactions. For example, you can measure the time it takes to process a set of transactions. To do this, you would need to have a start time option in your interface, and an option to display the total time. A similar method can be applied to the Indexed Sequential approach.

Requirements:

Your program should be handed in with all design work, documentation, as well as a running version (software) which will be demonstrated in class. The B-Tree version will carry a higher difficulty, and those implementing it may score higher. Note, that it is possible to get a grade of A with either implementation.

Due: February 20, 2000