NOAA LogoNOAA/ OERD Data Decoding Project: e1 Compression Format

This project was started on September, 2008.  The objective was to decode the wave form data (e.g. seismic data) that were stored in the CSS e1 format.  The data were received from the CTBTO organization. 

The e1 data format uses a variable-difference Steim-type compression method to store the data.  It encodes the data by storing their consecutive differences.  Since the 2 neighboring wave data are not vary much, then their differences should be small.  By packing the bit patterns of the small numbers, the data file size will be reduced.

The general idea of the e1 data compression can be found in the IDC Documentation: Formats and Protocols for Continuous Data, CD-1.1, Revision 0.2, Chapter 3, pagers 33 to 38.  Unfortunately information about "e1 data compression " are not widely available through the internet (at least before September 2008).  With the helps from Dr. Delwayne Bohnenstiehl, enough information about the e1 data format was gathered so that the "Decoding the e1 data format" was able to be solved.

File Format for the e1 Compressed Data

The CSS waveform files are storing the data in a binary form and the file names have the suffixes: w, e.g. CSSdata.w  When e1 format is used, the CSS binary file will contain multiple records. Each has its record length in terms of number of bytes.  Here are e1 data file format:

The First 8 Bytes of each record contain the following information:

The 1st 2 Bytes form an integer indicates the Size of the Current Record (Record_Size) in Bytes including the first 8 bytes.
The 2nd 2 Bytes form an integer indicates the Number of Samples (N_Samples) or  Number of Data Points in the Current Record.

The 5th Byte shows the Number of Differences (N_DIFFER) Used in Compression.
N_DIFFER will be either 1 or 2 mostly or it even could be 3.

Last 3 Bytes form an 24-Bit Unsigned integer as the Check_Value.


Note that the instrument where the e1 data was encoded could be using either a Big- or Little-Endian format to represent the integer data; therefore, unless the decoding computer is using the same Endian format; otherwise, the byte orders of integer values: Record_Size, N_Samples and the Check_Value must be Swapped.

Also the Check_Value is an unsigned integer and it must be converted into a value in a + and - range before it can be used to check against the Last Uncompressed Sample of the Current Record.  To do the conversion, when the Check_Value is > than 223,  negate the Check_Value by doing ( Check_Value - 224 ).  After the conversion is done, the Check Value can be used to compare to the Last Uncompressed Sample of the Current Record.   They must be the same values; otherwise, there is an error in the procedure.

The Compressed Bit-Patterns of the Data Points:

With the current Record_Size in Bytes is known, the next N = ( Record_Size - 8 ) bytes can be retrieved and they contains the N_Samples of the compressed data.  The data are compressed and coded into either 4-byte or 8-byte words.  The first 1, 2, 3, or 4 bits in the word define the format for that word.

  1st 4 Bits
1)   0      7   9-bit samples = 1 + 7 x  9 = 64 Bits = 8-Byte words to store Seven  9-bit data points.
2)   10     3  10-bit samples = 2 + 3 x 10 = 32 Bits = 4-Byte words to store Three 10-bit data points.
3)   1100   4   7-bit samples = 4 + 4 x  7 = 32 Bits = 4-Byte words to store Four   7-bit data samples.
4)   1101   5  12-bit samples = 4 + 5 x 12 = 64 Bits = 8-Byte words to store Five  12-bit data samples.
5)   1110   4  15-bit samples = 4 + 4 x 15 = 64 Bits = 8-Byte words to store
Four  15-bit data samples.
6)   1111   2  28-bit sample  = 4 + 1 x 28 = 32 Bits = 4-Byte words to store One   28-bit data point.


For example, after the First 8 Bytes header above, the next or the 9th byte and its 1st 4 bits are matched to case 6) above, the bit patterns of 2nd half of the 9th byte and all bits in 10th to 12th bytes form one 28-bit unsigned integer.  This integer must be byte swapped if current machine's endian format is different from the original encoding instrument and convert it into + and - range.

The next subset of the data samples will begin at the 13th bytes.  Its subsegment length will be either 4 or 8 bytes depended on the 1st 4 Bit Pattern listed in 1) to 6) above.  This process is repeated until all N bytes are processed.

Decompress the Samples into the Original Data Points

After all whole series of the Data Points (based on  the CSS*.wfdisc file's record) are retrieved,  they must be decompressed using the following looping process.

  FOR S = 1, N_DIFFER DO  BEGIN  ; where N_DIFFER = 1 or 2 mostly or 3.
      FOR i = 1, N_DATA - 1 DO  BEGIN
          DATA[i] = DATA[i] + DATA[i-1]
      ENDFOR
  ENDFOR   ; where DATA[i], i = 0, 1, ...,
N_DATA - 1.

Note that since that is no ways to know what N_DIFFER will be, the 2 looping process above will cover all situations but will be more time consuming to run.

The information of the Decoding process above are from the Matlab program: cnv_e1.m   It is written by Mark Harris, mharris@sandia.gov Copyright (c) 1996-2001 Sandia National Laboratories. All rights reserved.   The program is part of the SEIA software: MatSeis 1.6.

Important Points

Byte Swapping, Converting unsigned integer of different bit size into their respective + and - ranges.

The document above shows only the steps needed to decode the data. Not all the details or the mathematical concept are shown. Also please read the disclaimer below.


Disclaimer:

The contents of this page are intended as a documentation for the e1 Data Decoding project and how e1 Compressed data can be decoded.  It is Not a Full How-To Manual for the public.

Back to
[T-K Andy Lau] [T-Phase Project] [Publications]