NOAA/
OERD
Data Decoding Project: e1 Compression Format
This project was started on September, 2008. The objective was to
decode the wave form data (e.g.
seismic data) that were stored
in the CSS e1 format. The data were received from the CTBTO organization.
The
e1 data format uses a variable-difference Steim-type compression method
to store the data. It
encodes the data by storing their
consecutive differences. Since the 2 neighboring wave data are not
vary much, then their differences should be small. By packing the
bit patterns of the small numbers, the data file size will be reduced.
The general idea of the e1 data compression can be found in the IDC
Documentation: Formats and Protocols for Continuous Data, CD-1.1,
Revision 0.2, Chapter 3, pagers 33 to 38. Unfortunately information
about "e1 data compression " are not widely available through the
internet (at least before September 2008). With the helps from Dr.
Delwayne Bohnenstiehl, enough
information about the e1 data format was gathered so that the "Decoding
the e1 data format" was able to be solved.
File Format for the e1 Compressed Data
The CSS waveform files are storing the data in a binary form and the
file names have the suffixes: w, e.g. CSSdata.w When e1 format is
used, the CSS binary file will contain multiple records. Each has its
record length in terms of number of bytes. Here are e1 data file
format:
The First 8 Bytes of each record contain the following
information:
The 1st 2 Bytes form an integer indicates the Size of the
Current Record (Record_Size) in Bytes including the first 8 bytes.
The 2nd 2 Bytes form an integer indicates the Number of Samples
(N_Samples) or Number of Data Points in the Current Record.
The 5th Byte shows the Number of Differences (N_DIFFER) Used in
Compression.
N_DIFFER will be either 1 or 2
mostly or it even could be 3.
Last 3 Bytes form an 24-Bit Unsigned integer as the Check_Value.
Note that the instrument where the e1 data was encoded could be using
either a Big- or Little-Endian format to represent the integer data;
therefore, unless the decoding computer is using the same Endian
format; otherwise, the byte orders of integer values: Record_Size,
N_Samples and the Check_Value must be Swapped.
Also the Check_Value is an unsigned integer and it must be converted
into a value in a +
and - range before it can be used to check against the Last
Uncompressed Sample of the Current Record. To
do the conversion, when the Check_Value is > than 223,
negate the Check_Value by doing ( Check_Value - 224
). After the conversion is done, the Check Value can be used to
compare to the Last Uncompressed Sample of the Current Record. They
must be the same values; otherwise, there is an error in the procedure.
The Compressed Bit-Patterns of the Data Points:
With the current Record_Size in Bytes is known, the next N = (
Record_Size - 8 ) bytes can be retrieved and they contains the
N_Samples of the compressed data. The data are compressed and coded
into either 4-byte or 8-byte
words. The first 1, 2, 3, or 4 bits in the word define the format for
that word.
1st 4 Bits
1) 0 7 9-bit samples = 1 + 7 x 9 = 64 Bits =
8-Byte words to store Seven 9-bit data points.
2) 10 3 10-bit samples = 2 + 3 x 10 = 32 Bits = 4-Byte
words to store Three 10-bit data points.
3) 1100 4 7-bit samples = 4 + 4 x 7 = 32 Bits = 4-Byte
words to store Four 7-bit data samples.
4) 1101 5 12-bit samples = 4 + 5 x 12 = 64 Bits = 8-Byte words to
store Five 12-bit data samples.
5) 1110 4 15-bit samples = 4 + 4 x 15 = 64 Bits = 8-Byte
words to store Four 15-bit data samples.
6)
1111 2 28-bit sample = 4 + 1 x 28 = 32 Bits = 4-Byte
words to store One 28-bit data point.
For example, after the First 8 Bytes header above, the next or
the 9th byte and its 1st 4 bits are matched to case 6) above, the bit
patterns of 2nd half of the 9th byte and all bits in 10th to 12th bytes
form one 28-bit unsigned integer. This integer must be byte swapped if
current machine's endian format is different from the original encoding
instrument and convert it into + and - range.
The next subset of the data samples will begin at the 13th bytes. Its
subsegment length will be either 4 or 8 bytes depended on the 1st 4 Bit
Pattern listed in 1) to 6) above. This process is repeated until all N
bytes are processed.
Decompress the Samples into the Original Data Points
After all whole series of the Data Points (based on the CSS*.wfdisc
file's record) are retrieved, they must be decompressed using the
following looping process.
FOR S = 1, N_DIFFER DO BEGIN ; where N_DIFFER
= 1 or 2 mostly or 3.
FOR i = 1, N_DATA - 1 DO BEGIN
DATA[i] = DATA[i] + DATA[i-1]
ENDFOR
ENDFOR ; where DATA[i], i = 0, 1, ..., N_DATA
- 1.
Note that since that is no ways to know what N_DIFFER will be, the 2
looping process above will cover all situations but will be more time
consuming to run.
The information of the Decoding process above are from the Matlab
program: cnv_e1.m It is
written by Mark Harris, mharris@sandia.gov Copyright (c)
1996-2001 Sandia National Laboratories. All
rights reserved. The program is part of the SEIA software: MatSeis
1.6.
Important Points
Byte Swapping, Converting unsigned integer of different bit size into
their respective + and - ranges.
The document above shows only the steps needed to
decode the data.
Not all the details or the mathematical concept are shown. Also
please read the disclaimer
below.
Disclaimer:
The contents of this page are intended as a
documentation
for the e1 Data Decoding project and how e1 Compressed data can be
decoded. It is Not a Full How-To Manual for the public.
Back to
[T-K
Andy Lau]
[T-Phase
Project]
[Publications]