BSc CSIT in Nepal: The Pioneer IT Blog in Nepal.: CSC-410 Advanced Database: File Organization

File Organization:

Introduction

A file organization is a way of arranging the records in a file when the file is stored on disk. A file of records is likely to be accessed and modified in a variety of ways, and different ways of arranging the records enable different operations over the file to be carried out efficiently. For example, if we want to retrieve employee records in alphabetical order, sorting the file by name is a good file organization. On the other hand, if we want to retrieve all employees whose salary is in a given range, sorting employee records by name is not a good file organization. A DBMS supports several file organization techniques, and an important task of a DBA is to choose a good organization for each file, based on its expected pattern of use.

Three basic file organizations: files of randomly ordered records (i.e., heap files), files sorted on some field, and files that are hashed on some fields.

Each file organization makes certain operations efficient, but often we are interested in supporting more than one operation. For example, sorting a file of employee records on the name field makes it easy to retrieve employees in alphabetical order, but we may also want to retrieve all employees who are 55 years old; for this, we would have to scan the entire file. To deal with such situations, a DBMS builds an index.

An index on a file is designed to speed up operations that are not efficiently supported by the basic organization of records in that file

Cost Model

In this section we introduce a cost model that allows us to estimate the cost (in terms of execution time) of different database operations. We will use the following notation and assumptions in our analysis. There are B data pages with R records per page.

The average time to read or write a disk page is D, and the average time to process a record (e.g., to compare a field value to a selection constant) is C. In the hashed file organization, we will use a function, called a hash function, to map a record into a range of numbers; the time required to apply the hash function to a record is H.

Typical values today are D = 15 milliseconds, C and H = 100 nanoseconds; we therefore expect the cost of I/O to dominate. This conclusion is supported by current hardware trends, in which CPU speeds are steadily rising, whereas disk speeds are not increasing at a similar pace. On the other hand, as main memory sizes increase, a much larger fraction of the needed pages are likely to fit in memory, leading to fewer I/O requests.

We emphasize that real systems must consider other aspects of cost, such as CPU costs (and transmission costs in a distributed database). However, our goal is primarily to present the underlying algorithms and to illustrate how costs can be estimated. Therefore, for simplicity, we have chosen to concentrate on only the I/O component of cost. Given the fact that I/O is often (even typically) the dominant component of the cost of database operations, considering I/O costs gives us a good first approximation to the true costs.

Even with our decision to focus on I/O costs, an accurate model would be too complex for our purposes of conveying the essential ideas in a simple way. We have therefore chosen to use a simplistic model in which we just count the number of pages that are read from or written to disk as a measure of I/O. We have ignored the important issue of blocked access typically; disk systems allow us to read a block of contiguous pages in a single I/O request. The cost is equal to the time required to seek the first page in the block and to transfer all pages in the block.

Such blocked access can be much cheaper than issuing one I/O request per page in the block, especially if these requests do not follow consecutively: We would have an additional seek cost for each page in the block.

This discussion of the cost metric we have chosen must be kept in mind when we discuss the cost of various algorithms in this chapter and in later chapters. We discuss the implications of the cost model whenever our simplifying assumptions are likely to affect the conclusions drawn from our analysis in an important way.

Index overview

An index can be viewed as a collection of data entries, with an efficient way to locate all data entries with search key value k. Each such data entry, which we denote as k*, contains enough information to enable us to retrieve (one or more) data records with search key value k. Each index must contain the data entry together with the pointer to the data record.

There are various alternatives for data entries

A data entry k* allows us to retrieve one or more data records with key value k. We need to consider three main alternatives:

1. A data entry k* is an actual data record (with search key value k).

2. A data entry is a <k, rid> pair, where rid is the record id of a data record with search key value k.

3. A data entry is a <k, rid-list > pair, where rid-list is a list of record ids of data records with search key value k.

iv. Properties of Indexes ( Clustered vs. Unclustered, Dense

vs. Sparse Index, Primary and Secondary, Indexes Using

Composite Search Keys)