Files & records

 5-4  FILES AND RECORDS 
 **********************

 (Thanks to Arne Vajhoej, without him this chapter would have been a mess, 
  to Yehavi Bourvine, and to Steve Lionel who checked the DEC information)

 We will avoid here implementation-dependant details, e.g. the internal 
 data structures of the filesystem, the allocation of disk storage area 
 to individual files etc.
 
 To be useful, a file must be kept with some information on its content: 
 name, date, size, etc, this information may be kept in various ways, 
 we will ignore this subject completely.


 File organization vs. access method
 -----------------------------------
 In principle a file is just a series of bytes, however many operating
 systems impose a logical structure by having all low level file I/O 
 performed by special system routines that interpret the bytes comprising 
 the file according to some predefined scheme.

 Disk controllers (and even some tape drives) can access the recorded 
 bytes "randomly", in the I/O jargon that means the disk is divided into
 "chunks" of predefined size, and you can read any "chunk" you want.

 The random access capability of the hardware, together with the software
 of the filesystem (the OS interface to the I/O hardware) makes it possible 
 to implement different file organizations and allow several access methods.

 File organization and access methods are separate but related concepts,
 organization refers to the internal structure, access method is an 
 "allowed method" to read/write from/to the file. It may be possible
 to access a file in an access method other than the "natural" one.

 Possible file organizations are:

    1) Sequential - The info in file can be accessed only in the order
                    it was written. The writing order defines the 
                    "natural" order of data, in simple cases the data
                    will reside on the disk in consecutive locations.

    2) Relative   - The file is a sequence of equal-sized "data cells",
                    you can access any "cell" you want using its serial 
                    number, and the system will calculate the offset. 

                    Relative files are just like arrays [of structures],
                    but instead of residing in main memory, they are
                    recorded on a magnetic media.

    3) Indexed    - The file is made of "data cells", not necessarily 
                    of the same size, and contain "indexes", lists of 
                    "pointers" to these cells arranged by some order.

                    Standard FORTRAN 77 doesn't require that indexed 
                    files are to be implemented, but some vendors 
                    supply this nice extension.


 Access methods are classified by the way they find the location of
 the data on the disk: 

    1) By physical address - The real hardware address, composed of 
       three components (at least): the number of the magnetic head used, 
       number of the track and number of the sector.

    2) By physical/logical/virtual block number - This is the serial 
       number of the required disk block ("atomic" unit of disk area),
       the three variants are different numbering methods.

    3) Sequential - First data item is at the start of the file,
       other items follow one after the other.

    4) Direct - Data item location is calculated from its serial 
       number and the constant "cell size", this gives an offset
       from the file's beginning.

    5) Keyed - First one or more indexes are consulted (in a complex
       process), they yield the address of the data item.

    6) Memory mapping - The operating system creates an association
       between the data in the file and a part of the main memory.
       The system supports accesses to the "mapped" main memory as 
       if they were accesses to the file's data.

       You can think of this process as if the system copied all of 
       the file contents to a large array residing in main memory, 
       but in order to conserve physical memory it is paging it in 
       and out as necessary.

    7) Byte stream - The system buffers file accesses and let's you
       read a specified number of bytes at a time.


 Compatibility of different organizations and access methods:

    Access \ Organization   Sequential   Relative   Indexed
    ---------------------   ----------   --------   -------
    Sequential                  +           +          +
    Direct                      ?           +          ?
    Keyed                       -           -          +

    Physical address            +           +          +
    By block number             +           +          +
    Memory mapping              +           +          +
    Byte stream                 +           +          +



 The two basic types of file-systems
 -----------------------------------
 There are two major types of file-systems, characterized by the way 
 they implement file I/O: 


    Byte-oriented file systems
    --------------------------
       In this type of file-system, a file is considered as a sequence 
       of bytes, the operating system supplies routines that can 
       read/write a specified number of bytes. 

       To have any structure in a file, system and application programs 
       accessing the file must adopt some convention that has to respected 
       by all programs. For example, a line-feed (ASCII 10) character 
       in a file containing text may denote an end-of-line.

    Record-oriented file systems
    ----------------------------
       In these file-systems a file is a sequence of records of the same 
       type. A record is a sequence of data bytes together with control
       information about the record's size and maybe some attributes.
       The unit of I/O operations is one or more record(s).

       This structure is imposed by consistently interpreting the data
       and control information kept in the records, by system routines 
       you call in order to perform file I/O. 

       Having all records in a file share the same type, makes it 
       reasonable to use files and records terms interchangeably.

    Why use records?
    ----------------
       This is one of those questions that start "religious wars". 
       Various vendors (IBM, CDC, UNIVAC, DEC) made this design 
       decision, and Fortran adopted it. 

       Using structured files gives the filesystem more information 
       on the possible data save/retrieval requests, the information
       can be used to improve performance and simplify the runtime
       libraries. The performance enhancement is especially useful 
       when accessing large databases. 

       It's agreed that indexed and relative files that are useful for 
       creating databases benefit from a record-oriented filesystem.
       In fact as byte-oriented filesystems become dominant, database 
       vendors are driven to write their own filesystems.

       As for sequential files, even when trying to simplify file I/O 
       as much as possible (e.g. UNIX) you have to introduce some 
       structure in text files (text files using the  character 
       are really like working with delimited-variable-size/stream-LF 
       records), so why not optimize the filesystem for it?

       We found it is useful to structure files intended for printing 
       or displaying on a terminal screen, as a sequence of variable 
       (or fixed) sized 'lines'.

       Only binary sequential files are left to be "justified", 
       but this it is more difficult, the following points may 
       be relevant:

       1) Again, using records may help I/O buffering, on the 
          other hand there is surely some processing overhead.

       2) Using records you can "space" and "backspace" in 
          the file, but this could be done without records 
          if you had special I/O operations. 

       3) Using records makes it possible to create a text 
          file with embedded control characters, e.g. , 
          but you could invent a method to "escape" the 
          delimiting character...

       In other words, structuring files may make data retrieval simpler, 
       and data recording more flexible. It is clear in the case of indexed 
       and direct files, but it may also be true in the case of sequential 
       files.


 Files/Records types
 -------------------
 There are at least four basic types of files/records:


   Fixed-length (minimal structured)
   ---------------------------------
   All records contain the same number of bytes. a typical value is 
   512 bytes (disk block size), the size information is kept outside
   of the file, or assumed by convention. 

                        
   Counted-variable-size (popular)
   -------------------------------
   Contains any number of bytes up to a specified limit. Records are 
   prefixed by a count field indicating the number of bytes in the 
   record. The count field may comprise 2-4 bytes on disk files and 
   4 bytes on tape files, and its size sets the record size limit.
   The same prefix count field may be appended to the record to make
   it easier to 'step back' in the file.

   The count field is usually transparent to the user. On VMS and IRIX
   it can be read with the non-standard 'Q' format specifier and then 
   used in further reading of the record.


   Segmented (record package)
   --------------------------
   A FORTRAN-specific file type on VMS, an interesting invention that 
   removes the inherent size limitation of variable records.

   Every single segmented record consists of one or more counted variable 
   size records called 'segments'. The segmented record can have any length
   because each segment contains control information indicating its place
   (first, last, the only segment, none of these). The control information 
   is kept in the attribute field that comprise the first two bytes of each 
   segment (see below). 

   Segmented records are the default type on VMS when writing unformatted 
   sequential files with a sequential access, this is probably because 
   the VMS count field is only 2 bytes long, is treated as a signed integer, 
   and counts bytes, so the maximal length of a single counted variable 
   size record is quite small. DEC's segmented records solve this problem,
   and at the same time the size limit 'barrier' of variable records.


   Delimited-variable-size (text files)
   ------------------------------------
   Variable length records whose length is indicated by explicit record 
   terminators embedded in the data and not by a count field. These 
   terminators are automatically added when you write records to a 
   stream-type file and removed when you read records. 
   Obviously stream files can't serve as structured binary files.

   There are at least 3 varieties of stream-type files:

      1) Stream            the terminator is the two character 
                           sequence        (ASCII 13,10)

      2) Stream/CR         the terminator is   (ASCII 13)

      3) Stream/LF         the terminator is   (ASCII 10)

   Stream files are used to store printable characters, 
   they can't be used to keep binary data.


    Record/Files types
    ==================

                      | Fixed-size | Variable  | Segmented | Stream
 ---------------------|------------|-----------|-----------|---------
 Binary capability    |    Yes     |   Yes     |    Yes    |   No
 ---------------------|------------|-----------|-----------|---------
 Size limitation      |    Yes     |   Yes     |    No     |   No
 ---------------------|------------|-----------|-----------|---------
 Processing overhead  |   Minimal  |           |           | Minimal
 ---------------------|------------|-----------|-----------|---------
 Portable?            |    Yes     |           |           |  Yes
 ---------------------|------------|-----------|-----------|---------


 Commercial file systems/compiler support
 ----------------------------------------
 FORTRAN 77 I/O is record oriented, reading and writing always start 
 at the beginning of a record and uses an integral number of records.

 On byte oriented file-systems there is a problem, and the necessary 
 structure has to be somehow created. In this case formatted files are 
 usually some stream type, and support for unformatted files is done
 by the FORTRAN compiler. The compiler creates variable-length records
 by prefixing (and suffixing) a count-field to each write operation, 
 and reads the records using this information.

    VMS   is record-oriented and directly supports most record and 
          file types, except segmented files that are supported by 
          the FORTRAN compiler. 

    UNIX  is byte-oriented. Text files are Stream/LF by convention 
          (the line-feed character is called the newline character). 
          Binary files are sequences of bytes, or fixed-length. 

          FORTRAN requires record-oriented I/O, so the FORTRAN compilers 
          has to supply that, unformatted files are implemented by 
          counted-variable-length files (see table below).

     DOS  is byte-oriented, using Stream (CR/LF) for text files, 
          binary files are sequences of bytes or fixed-length.

    Macs  use Stream/CR for text files, binary files are sequences 
          of bytes or fixed-length.

     IBM  mainframes support a wealth of file access modes, including
          record-oriented and byte oriented.


 Formatted vs. unformatted - two ways to store numbers
 -----------------------------------------------------
 There are two different ways to represent integer and floating 
 point numbers: 

   1) UNFORMATTED - is just the binary representation the computer 
      uses in the CPU registers and in the main memory. Such files
      can be read/written efficiently because no translations are 
      needed, they also take less space on disk. 

   2) FORMATTED representation is the sequence of characters used to
      describe the number in some radix (usually 10), it may contain 
      the ten digits and few more characters like '+', '-', or 'E'.

 For character data there should be no difference between the formatted
 and the unformatted methods of recording data.

 For example, the integer 1024 may be represented as:

    Unformatted representation
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
       3                   2                   1

    We assume here INTEGER*4 data type. remember that: 1024 = 2 ** 10


    Formatted representation
    +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
    |0|0|1|1|0|0|0|1| |0|0|1|1|0|0|0|0| |0|0|1|1|0|0|1|0| |0|0|1|1|0|1|0|0|
    +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
     7 6 5 4 3 2 1 0   7 6 5 4 3 2 1 0   7 6 5 4 3 2 1 0   7 6 5 4 3 2 1 0

    The four digits '1', '0', '2', '4' are kept each in a byte, the 
    bytes are ordered here from left to right. 

    An excerpt from the ASCII table will help here:

            Character   ASCII value        ASCII in binary
            ---------   -----------        ---------------
              0             48                 00110000
              1             49                 00110001
              2             50                 00110010
              3             51                 00110011
              4             52                 00110100
              5             53                 00110101
              6             54                 00110110
              7             55                 00110111
              8             56                 00111000
              9             57                 00111001

              +             43                 00101011
              -             45                 00101101
              E             69                 01000101
    
    Floating point numbers will need '+', '-', 'E'

 Formatted file contents are a sequence of printable characters and 
 may be viewed with a text editor etc, unformatted files contain 
 sequences of compiler/machine dependent representations of data values.


 Formatted/Unformatted I/O methods
 ---------------------------------
 When reading/writing unformatted files it is enough to specify the
 variables you wish to read/write. Since you are using the compiler-
 defined internal representations, the compiler has all information.

 Reading/Writing formatted files requires specifying more information,
 e.g. how many digits should be displayed for the mantissa of a REAL
 number, the universal solution is a FORMAT SPECIFICATION.


 Text and binary files
 ---------------------
 A useful (?) classification - can be described in simple language as 
 human and machine readable files (this definition makes some excellent
 programmers who can freely read hex dumps belong to the machines class,
 but they would probably take this as a compliment).

    Files for people - text files, are supposed to be record-oriented 
       (see below) and to contain only printable characters, maybe few 
       control characters like 'form-feed'. 

       Text files may be written and read with an editor program, 
       printed,emailed, etc.

    Files for machines - binary files, are allowed to contain every 
       possible character, and have minimal internal structure, 
       a good example are executable program files.

 A related distinction is between binary and text I/O operations
 (e.g. with FTP transfer modes):

    Binary I/O operations use the content of files as is, text I/O may 
    perform some kind of translation. For example, 'end of line' 
    characters may have to be translated to another 'magic combination' 
    or start a new record when reading, and translated back on writing.



    Unformatted files on different machines
    =======================================

              | VMS Segmented   |   VMS Variable   | IRIX | SunOS | OSF/1
 -------------|-----------------|------------------|------|-------|---------
 Default?     |      YES        |        NO        | YES  |  YES  | YES
 -------------|-----------------|------------------|------|-------|---------
 'Endianity'  |    little       |     little       | BIG  |  BIG  | little
 Byte-order   |     1234        |      1234        | 4321 | 4321  | 1234
 -------------|-----------------|------------------|------|-------|---------
 Count field  |       2         |        2         |  4   |   4   |  4
 size (Bytes) |                 |                  |      |       |
 -------------|-----------------|------------------|------|-------|---------
 Count field  | Signed integer  |  Signed integer  |      |       |
 integer type |                 |                  |      |       |
 -------------|-----------------|------------------|------|-------|---------
 Attribute    |      YES        |        NO        | NO   |  NO   | NO
 field        | see table below |                  |      |       |
 -------------|-----------------|------------------|------|-------|---------
 Suffix count |       NO        |        NO        | YES  |  YES  | YES
 field        |                 |                  |      |       |
 -------------|-----------------|------------------|------|-------|---------
 Alignment    | NULL at end of  | NULL at end of   | NO   |  NO   | NO
 padding      | odd size record | odd size record  |      |       |
 -------------|-----------------|------------------|------|-------|---------
 End-of-file  |  count = FFFF   | count = FFFF     | NO   |  NO   | NO
 marker       |                 | (not required)   |      |       |
 -------------|-----------------|------------------|------|-------|---------
 Comments     | The padding is  | The padding is   |      |       | 
              | not counted     | not counted      |      |       | 
 -------------|-----------------|------------------|------|-------|---------


 A UNIX example
 --------------
 Let's initialize the variables to convenient values: 

      program unfor
      REAL 	A,B,C
      INTEGER	D,E,F
      a = 1.0
      b = 2.0
      c = 3.0
      d = 1
      e = 2
      f = 3
      WRITE(2) A,B,C
      WRITE(2) D,E,F
      end

 Compiling and running on a Sun machine, we get a file called fort.2.  
 Using "od" (Octal Dump) with a suitable option, we get the following
 result (in decimal!):

   od -l fort.2
   0000000           12  1065353216  1073741824  1077936128
   0000020           12          12           1           2
   0000040            3          12
   0000050

 The strange long integers are the 3 floats, with another od option 
 they come out right (again, it's base 10):

   od -f fort.2
   0000000   1.6815582e-44  1.0000000e+00  2.0000000e+00  3.0000000e+00
   0000020   1.6815582e-44  1.6815582e-44  1.4012985e-45  2.8025969e-45
   0000040   4.2038954e-45  1.6815582e-44
   0000050

 Let's omit the offset column of the first dump, and rearrange each 
 record on a separate line:

   12          1065353216  1073741824  1077936128          12
   12                   1           2           3          12

 We can see that every data record is prefixed and suffixed by the 
 count field (4 bytes interpreted as a signed or unsigned integer, 
 perhaps you find out and tell me?).  

 Note that the count doesn't include itself!



 Some remarks on VMS
 -------------------
 The VMS 'End Of File' marker, is the count field of 'one after
 the last' record, count fields on VMS are signed integers, so
 an FFFF is equal to -1, an impossible value for a count.
 The VMS EOF marker is not important as the offset to the end
 of the file is kept in the file header (in [000000]INDEXF.SYS).

   Segment control flags (attribute field values) on VMS
   =====================================================
   0000     None of the following (i.e. continuation segment)
   0001     First segment
   0002     Last segment
   0003     One and only segment (First + Last)

 An example of a segmented record composed of 3 variable-size
 records. The count field (and length) of each sub-record may 
 be different:

      +----+----+------+  +----+----+------+  +----+----+------+    
      |nnnn|0001| data |  |nnnn|0000| data |  |nnnn|0002| data |
      +----+----+------+  +----+----+------+  +----+----+------+  
        /\   /\
        ||   ||
     Count  Attribute


 Control information transparency (unformatted files)
 ----------------------------------------------------
 The count field(s) is USER-TRANSPARENT on all systems. That means 
 that as long as you use FORTRAN I/O statements to read disk, diskette, 
 or tape files that were written using the corresponding FORTRAN I/O 
 statements on the same machine/OS combination, you don't have to 
 know anything about the count field(s). 

 The various VMS control fields are also user-transparent in that sense.

 When support for FORTRAN files is provided by the compiler, the said
 user-transparency doesn't extend to programs written in languages other 
 than FORTRAN. For example VMS C programs can read/write unformatted 
 FORTRAN files (except segmented), but UNIX C can't read/write such files.

 The transparency also breaks down when you port unformatted files between 
 different systems, or read/write them with system routines. Porting 
 FORTRAN unformatted files between different systems is covered in 
 another chapter. 

    Remarks on VMS
    --------------
    Segmented records and files are not supported by the 
    record management services (RMS).

    All control fields (except the segment attribute field) are 
    transparent to the user when reading/writing files in the 
    'normal' way - with RMS record I/O routines. 

    The control fields will be visible if you use RMS block I/O, 
    or read physical/logical/virtual blocks with SYS$QIO(W).

    ANSI labeled tapes written on VMS don't contain the VMS 
    control fields (except the segment attribute field).


 
 A remark
 --------
 Some file terms are often encountered, and the meaning is not always 
 clear. The following list is an ATTEMPT to define these terms. Please 
 correct me if i'm wrong.

    ASCII FILE           Same as TEXT FILE?
    ANSI FILE            Contains printable characters and ANSI 
                         formatting escape sequences. May be 
                         displayed on an ANSI terminal.
    END-OF-FILE (EOF)    
    END-OF-RECORD (EOR)  
    END-OF-LINE (EOL)
Return to contents page