5-4 FILES AND RECORDS
**********************
(Thanks to Arne Vajhoej, without him this chapter would have been a mess,
to Yehavi Bourvine, and to Steve Lionel who checked the DEC information)
We will avoid here implementation-dependant details, e.g. the internal
data structures of the filesystem, the allocation of disk storage area
to individual files etc.
To be useful, a file must be kept with some information on its content:
name, date, size, etc, this information may be kept in various ways,
we will ignore this subject completely.
File organization vs. access method
-----------------------------------
In principle a file is just a series of bytes, however many operating
systems impose a logical structure by having all low level file I/O
performed by special system routines that interpret the bytes comprising
the file according to some predefined scheme.
Disk controllers (and even some tape drives) can access the recorded
bytes "randomly", in the I/O jargon that means the disk is divided into
"chunks" of predefined size, and you can read any "chunk" you want.
The random access capability of the hardware, together with the software
of the filesystem (the OS interface to the I/O hardware) makes it possible
to implement different file organizations and allow several access methods.
File organization and access methods are separate but related concepts,
organization refers to the internal structure, access method is an
"allowed method" to read/write from/to the file. It may be possible
to access a file in an access method other than the "natural" one.
Possible file organizations are:
1) Sequential - The info in file can be accessed only in the order
it was written. The writing order defines the
"natural" order of data, in simple cases the data
will reside on the disk in consecutive locations.
2) Relative - The file is a sequence of equal-sized "data cells",
you can access any "cell" you want using its serial
number, and the system will calculate the offset.
Relative files are just like arrays [of structures],
but instead of residing in main memory, they are
recorded on a magnetic media.
3) Indexed - The file is made of "data cells", not necessarily
of the same size, and contain "indexes", lists of
"pointers" to these cells arranged by some order.
Standard FORTRAN 77 doesn't require that indexed
files are to be implemented, but some vendors
supply this nice extension.
Access methods are classified by the way they find the location of
the data on the disk:
1) By physical address - The real hardware address, composed of
three components (at least): the number of the magnetic head used,
number of the track and number of the sector.
2) By physical/logical/virtual block number - This is the serial
number of the required disk block ("atomic" unit of disk area),
the three variants are different numbering methods.
3) Sequential - First data item is at the start of the file,
other items follow one after the other.
4) Direct - Data item location is calculated from its serial
number and the constant "cell size", this gives an offset
from the file's beginning.
5) Keyed - First one or more indexes are consulted (in a complex
process), they yield the address of the data item.
6) Memory mapping - The operating system creates an association
between the data in the file and a part of the main memory.
The system supports accesses to the "mapped" main memory as
if they were accesses to the file's data.
You can think of this process as if the system copied all of
the file contents to a large array residing in main memory,
but in order to conserve physical memory it is paging it in
and out as necessary.
7) Byte stream - The system buffers file accesses and let's you
read a specified number of bytes at a time.
Compatibility of different organizations and access methods:
Access \ Organization Sequential Relative Indexed
--------------------- ---------- -------- -------
Sequential + + +
Direct ? + ?
Keyed - - +
Physical address + + +
By block number + + +
Memory mapping + + +
Byte stream + + +
The two basic types of file-systems
-----------------------------------
There are two major types of file-systems, characterized by the way
they implement file I/O:
Byte-oriented file systems
--------------------------
In this type of file-system, a file is considered as a sequence
of bytes, the operating system supplies routines that can
read/write a specified number of bytes.
To have any structure in a file, system and application programs
accessing the file must adopt some convention that has to respected
by all programs. For example, a line-feed (ASCII 10) character
in a file containing text may denote an end-of-line.
Record-oriented file systems
----------------------------
In these file-systems a file is a sequence of records of the same
type. A record is a sequence of data bytes together with control
information about the record's size and maybe some attributes.
The unit of I/O operations is one or more record(s).
This structure is imposed by consistently interpreting the data
and control information kept in the records, by system routines
you call in order to perform file I/O.
Having all records in a file share the same type, makes it
reasonable to use files and records terms interchangeably.
Why use records?
----------------
This is one of those questions that start "religious wars".
Various vendors (IBM, CDC, UNIVAC, DEC) made this design
decision, and Fortran adopted it.
Using structured files gives the filesystem more information
on the possible data save/retrieval requests, the information
can be used to improve performance and simplify the runtime
libraries. The performance enhancement is especially useful
when accessing large databases.
It's agreed that indexed and relative files that are useful for
creating databases benefit from a record-oriented filesystem.
In fact as byte-oriented filesystems become dominant, database
vendors are driven to write their own filesystems.
As for sequential files, even when trying to simplify file I/O
as much as possible (e.g. UNIX) you have to introduce some
structure in text files (text files using the character
are really like working with delimited-variable-size/stream-LF
records), so why not optimize the filesystem for it?
We found it is useful to structure files intended for printing
or displaying on a terminal screen, as a sequence of variable
(or fixed) sized 'lines'.
Only binary sequential files are left to be "justified",
but this it is more difficult, the following points may
be relevant:
1) Again, using records may help I/O buffering, on the
other hand there is surely some processing overhead.
2) Using records you can "space" and "backspace" in
the file, but this could be done without records
if you had special I/O operations.
3) Using records makes it possible to create a text
file with embedded control characters, e.g. ,
but you could invent a method to "escape" the
delimiting character...
In other words, structuring files may make data retrieval simpler,
and data recording more flexible. It is clear in the case of indexed
and direct files, but it may also be true in the case of sequential
files.
Files/Records types
-------------------
There are at least four basic types of files/records:
Fixed-length (minimal structured)
---------------------------------
All records contain the same number of bytes. a typical value is
512 bytes (disk block size), the size information is kept outside
of the file, or assumed by convention.
Counted-variable-size (popular)
-------------------------------
Contains any number of bytes up to a specified limit. Records are
prefixed by a count field indicating the number of bytes in the
record. The count field may comprise 2-4 bytes on disk files and
4 bytes on tape files, and its size sets the record size limit.
The same prefix count field may be appended to the record to make
it easier to 'step back' in the file.
The count field is usually transparent to the user. On VMS and IRIX
it can be read with the non-standard 'Q' format specifier and then
used in further reading of the record.
Segmented (record package)
--------------------------
A FORTRAN-specific file type on VMS, an interesting invention that
removes the inherent size limitation of variable records.
Every single segmented record consists of one or more counted variable
size records called 'segments'. The segmented record can have any length
because each segment contains control information indicating its place
(first, last, the only segment, none of these). The control information
is kept in the attribute field that comprise the first two bytes of each
segment (see below).
Segmented records are the default type on VMS when writing unformatted
sequential files with a sequential access, this is probably because
the VMS count field is only 2 bytes long, is treated as a signed integer,
and counts bytes, so the maximal length of a single counted variable
size record is quite small. DEC's segmented records solve this problem,
and at the same time the size limit 'barrier' of variable records.
Delimited-variable-size (text files)
------------------------------------
Variable length records whose length is indicated by explicit record
terminators embedded in the data and not by a count field. These
terminators are automatically added when you write records to a
stream-type file and removed when you read records.
Obviously stream files can't serve as structured binary files.
There are at least 3 varieties of stream-type files:
1) Stream the terminator is the two character
sequence (ASCII 13,10)
2) Stream/CR the terminator is (ASCII 13)
3) Stream/LF the terminator is (ASCII 10)
Stream files are used to store printable characters,
they can't be used to keep binary data.
Record/Files types
==================
| Fixed-size | Variable | Segmented | Stream
---------------------|------------|-----------|-----------|---------
Binary capability | Yes | Yes | Yes | No
---------------------|------------|-----------|-----------|---------
Size limitation | Yes | Yes | No | No
---------------------|------------|-----------|-----------|---------
Processing overhead | Minimal | | | Minimal
---------------------|------------|-----------|-----------|---------
Portable? | Yes | | | Yes
---------------------|------------|-----------|-----------|---------
Commercial file systems/compiler support
----------------------------------------
FORTRAN 77 I/O is record oriented, reading and writing always start
at the beginning of a record and uses an integral number of records.
On byte oriented file-systems there is a problem, and the necessary
structure has to be somehow created. In this case formatted files are
usually some stream type, and support for unformatted files is done
by the FORTRAN compiler. The compiler creates variable-length records
by prefixing (and suffixing) a count-field to each write operation,
and reads the records using this information.
VMS is record-oriented and directly supports most record and
file types, except segmented files that are supported by
the FORTRAN compiler.
UNIX is byte-oriented. Text files are Stream/LF by convention
(the line-feed character is called the newline character).
Binary files are sequences of bytes, or fixed-length.
FORTRAN requires record-oriented I/O, so the FORTRAN compilers
has to supply that, unformatted files are implemented by
counted-variable-length files (see table below).
DOS is byte-oriented, using Stream (CR/LF) for text files,
binary files are sequences of bytes or fixed-length.
Macs use Stream/CR for text files, binary files are sequences
of bytes or fixed-length.
IBM mainframes support a wealth of file access modes, including
record-oriented and byte oriented.
Formatted vs. unformatted - two ways to store numbers
-----------------------------------------------------
There are two different ways to represent integer and floating
point numbers:
1) UNFORMATTED - is just the binary representation the computer
uses in the CPU registers and in the main memory. Such files
can be read/written efficiently because no translations are
needed, they also take less space on disk.
2) FORMATTED representation is the sequence of characters used to
describe the number in some radix (usually 10), it may contain
the ten digits and few more characters like '+', '-', or 'E'.
For character data there should be no difference between the formatted
and the unformatted methods of recording data.
For example, the integer 1024 may be represented as:
Unformatted representation
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
3 2 1
We assume here INTEGER*4 data type. remember that: 1024 = 2 ** 10
Formatted representation
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
|0|0|1|1|0|0|0|1| |0|0|1|1|0|0|0|0| |0|0|1|1|0|0|1|0| |0|0|1|1|0|1|0|0|
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
The four digits '1', '0', '2', '4' are kept each in a byte, the
bytes are ordered here from left to right.
An excerpt from the ASCII table will help here:
Character ASCII value ASCII in binary
--------- ----------- ---------------
0 48 00110000
1 49 00110001
2 50 00110010
3 51 00110011
4 52 00110100
5 53 00110101
6 54 00110110
7 55 00110111
8 56 00111000
9 57 00111001
+ 43 00101011
- 45 00101101
E 69 01000101
Floating point numbers will need '+', '-', 'E'
Formatted file contents are a sequence of printable characters and
may be viewed with a text editor etc, unformatted files contain
sequences of compiler/machine dependent representations of data values.
Formatted/Unformatted I/O methods
---------------------------------
When reading/writing unformatted files it is enough to specify the
variables you wish to read/write. Since you are using the compiler-
defined internal representations, the compiler has all information.
Reading/Writing formatted files requires specifying more information,
e.g. how many digits should be displayed for the mantissa of a REAL
number, the universal solution is a FORMAT SPECIFICATION.
Text and binary files
---------------------
A useful (?) classification - can be described in simple language as
human and machine readable files (this definition makes some excellent
programmers who can freely read hex dumps belong to the machines class,
but they would probably take this as a compliment).
Files for people - text files, are supposed to be record-oriented
(see below) and to contain only printable characters, maybe few
control characters like 'form-feed'.
Text files may be written and read with an editor program,
printed,emailed, etc.
Files for machines - binary files, are allowed to contain every
possible character, and have minimal internal structure,
a good example are executable program files.
A related distinction is between binary and text I/O operations
(e.g. with FTP transfer modes):
Binary I/O operations use the content of files as is, text I/O may
perform some kind of translation. For example, 'end of line'
characters may have to be translated to another 'magic combination'
or start a new record when reading, and translated back on writing.
Unformatted files on different machines
=======================================
| VMS Segmented | VMS Variable | IRIX | SunOS | OSF/1
-------------|-----------------|------------------|------|-------|---------
Default? | YES | NO | YES | YES | YES
-------------|-----------------|------------------|------|-------|---------
'Endianity' | little | little | BIG | BIG | little
Byte-order | 1234 | 1234 | 4321 | 4321 | 1234
-------------|-----------------|------------------|------|-------|---------
Count field | 2 | 2 | 4 | 4 | 4
size (Bytes) | | | | |
-------------|-----------------|------------------|------|-------|---------
Count field | Signed integer | Signed integer | | |
integer type | | | | |
-------------|-----------------|------------------|------|-------|---------
Attribute | YES | NO | NO | NO | NO
field | see table below | | | |
-------------|-----------------|------------------|------|-------|---------
Suffix count | NO | NO | YES | YES | YES
field | | | | |
-------------|-----------------|------------------|------|-------|---------
Alignment | NULL at end of | NULL at end of | NO | NO | NO
padding | odd size record | odd size record | | |
-------------|-----------------|------------------|------|-------|---------
End-of-file | count = FFFF | count = FFFF | NO | NO | NO
marker | | (not required) | | |
-------------|-----------------|------------------|------|-------|---------
Comments | The padding is | The padding is | | |
| not counted | not counted | | |
-------------|-----------------|------------------|------|-------|---------
A UNIX example
--------------
Let's initialize the variables to convenient values:
program unfor
REAL A,B,C
INTEGER D,E,F
a = 1.0
b = 2.0
c = 3.0
d = 1
e = 2
f = 3
WRITE(2) A,B,C
WRITE(2) D,E,F
end
Compiling and running on a Sun machine, we get a file called fort.2.
Using "od" (Octal Dump) with a suitable option, we get the following
result (in decimal!):
od -l fort.2
0000000 12 1065353216 1073741824 1077936128
0000020 12 12 1 2
0000040 3 12
0000050
The strange long integers are the 3 floats, with another od option
they come out right (again, it's base 10):
od -f fort.2
0000000 1.6815582e-44 1.0000000e+00 2.0000000e+00 3.0000000e+00
0000020 1.6815582e-44 1.6815582e-44 1.4012985e-45 2.8025969e-45
0000040 4.2038954e-45 1.6815582e-44
0000050
Let's omit the offset column of the first dump, and rearrange each
record on a separate line:
12 1065353216 1073741824 1077936128 12
12 1 2 3 12
We can see that every data record is prefixed and suffixed by the
count field (4 bytes interpreted as a signed or unsigned integer,
perhaps you find out and tell me?).
Note that the count doesn't include itself!
Some remarks on VMS
-------------------
The VMS 'End Of File' marker, is the count field of 'one after
the last' record, count fields on VMS are signed integers, so
an FFFF is equal to -1, an impossible value for a count.
The VMS EOF marker is not important as the offset to the end
of the file is kept in the file header (in [000000]INDEXF.SYS).
Segment control flags (attribute field values) on VMS
=====================================================
0000 None of the following (i.e. continuation segment)
0001 First segment
0002 Last segment
0003 One and only segment (First + Last)
An example of a segmented record composed of 3 variable-size
records. The count field (and length) of each sub-record may
be different:
+----+----+------+ +----+----+------+ +----+----+------+
|nnnn|0001| data | |nnnn|0000| data | |nnnn|0002| data |
+----+----+------+ +----+----+------+ +----+----+------+
/\ /\
|| ||
Count Attribute
Control information transparency (unformatted files)
----------------------------------------------------
The count field(s) is USER-TRANSPARENT on all systems. That means
that as long as you use FORTRAN I/O statements to read disk, diskette,
or tape files that were written using the corresponding FORTRAN I/O
statements on the same machine/OS combination, you don't have to
know anything about the count field(s).
The various VMS control fields are also user-transparent in that sense.
When support for FORTRAN files is provided by the compiler, the said
user-transparency doesn't extend to programs written in languages other
than FORTRAN. For example VMS C programs can read/write unformatted
FORTRAN files (except segmented), but UNIX C can't read/write such files.
The transparency also breaks down when you port unformatted files between
different systems, or read/write them with system routines. Porting
FORTRAN unformatted files between different systems is covered in
another chapter.
Remarks on VMS
--------------
Segmented records and files are not supported by the
record management services (RMS).
All control fields (except the segment attribute field) are
transparent to the user when reading/writing files in the
'normal' way - with RMS record I/O routines.
The control fields will be visible if you use RMS block I/O,
or read physical/logical/virtual blocks with SYS$QIO(W).
ANSI labeled tapes written on VMS don't contain the VMS
control fields (except the segment attribute field).
A remark
--------
Some file terms are often encountered, and the meaning is not always
clear. The following list is an ATTEMPT to define these terms. Please
correct me if i'm wrong.
ASCII FILE Same as TEXT FILE?
ANSI FILE Contains printable characters and ANSI
formatting escape sequences. May be
displayed on an ANSI terminal.
END-OF-FILE (EOF)
END-OF-RECORD (EOR)
END-OF-LINE (EOL)
Return to contents page