The joys of writing a WAV file

Tuesday April 12 2022
audio python

I recently had an interesting challenge. I needed to write out a WAV file from GNURadio. Any output in GNURadio goes through something called a "sink". There is a built in WAV file sink, but it writes out a single WAV file. What I needed was to write out a series of WAV files each with the same time period. The purpose of this was to allow me to create WAV files containing entire FT8 transmissions which I could later decode. The WAV file format is uncompressed so it can be written out quickly without much CPU being utilized.

So I needed to write my own sink for this to happen. I assumed I would be able to use a library to handle the WAV file format stuff, but it did not turn out that way. Python has a built in wave library, but it doesn't do everything I wanted. So this describes what I had to learn along the way. GNURadio is written primarily in C++, but you can also extend it using Python. So I opted to write my code in Python. My use case is straightforward. Since I am using this to store audio from a radio receiver, I only need to write a single audio channel at a constant sample rate.

What is the WAV file format anyways?

As it turns out, there is no such thing as the WAV file format. Instead, the WAV format is actually contained within a RIFF file. The way this works is you write out a RIFF file that indicates it contains a WAVE file. RIFF is always written in little-endian byte order. RIFF consists of the following

chunk descriptor
fmt sub-chunk
data sub-chunk

All of these sections are mandatory. This table describes the start all of those sections

Reference	Section	Byte offset	Length	Type	Description
A	chunk descriptor	0	4	ASCII string	Chunk ID
B	chunk descriptor	4	4	uint32	Chunk Size
C	chunk descriptor	8	4	ASCII string	Chunk Format
D	fmt	12	4	ASCII string	sub chunk ID
E	fmt	16	4	uint32	sub chunk size
F	fmt	20	2	uint16	audio format
G	fmt	22	2	uint16	number of channels
H	fmt	24	4	uint32	sample rate
J	fmt	28	4	uint32	byte rate
K	fmt	32	2	uint16	block alignment
M	fmt	34	2	uint16	bits per sample
P	data	36	4	ASCII string	sub chunk ID
Q	data	40	4	uint32	sub chunk size

After the final item comes the actual data samples.

Item A - Chunk ID

This is always the ASCII string "RIFF" without a terminator.

Item B - Chunk size

This is a 32 bit unsigned integer that is the size of the file, minus the first 8 bytes.

Item C - Chunk format

This is always the ASCII string "WAVE" without a terminator.

Item D - Sub chunk ID

This is always the ASCII string "fmt " without a terminator. Notice the trailing space.

Item E - Sub chunk size

This is size of the chunk, minus the first 8 bytes. As a result it is always the value 16.

Item F - Audio format

This describes the audio format. This is always the value 1 which indicates "linear quantization". Samples in this format are stored completely uncompressed

Item G - Number of channels

This value indicates the number of audio channels in the file. Since I only have a single audio channel, this value is always 1

Item H - Sample rate

This value indicates the sample rate. The sample rate is the number of audio samples in exactly one second. For a regular audio file the sample rate is rarely higher that 44100 Hz, which is the sample rate used by compact discs. For my application I used a sample of 8000 Hz.

Item J - Byte Rate

This is the number of bytes required to store one second of audio. This is computed as

\((S * N * B)/8\)

Where

\(S\) - sample rate, item H
\(N\) - number of channels, item G. This is always 1 for my application
\(B\) - bits per sample, item M. This is always 16

Since \(N\) and \(B\) are known in advance for my application, this can actually be computed as

\((S * 16)/8 = S*2\)

So this value is just twice the sample rate.

Item K - Block alignment

This is the number of bytes for one sample. This is computed as

\((N * B)/8\)

Using the same values from calculating Item J. Since N & B are both constants in my application, this is computed as

\(16/8 = 2\)

So it is always the value 2 in my application.

Item M - Bits per sample

This item is the number of bits in a single sample. If you ever hear someone referring to 8-bit audio or 16-bit audio this is what they mean. The number of bits of precision in a single audio sample. It's possible to write 8-bit WAV files and use half the space, but I have no reason to do this. So for my application I always use 16

Item P - Sub Chunk ID

This is always the ASCII string "data" without a terminator.

Item Q - Sub chunk Size

This is the length of the data, computed as

\((T * N * B )/8\)

Where

\(T\) - the total number of audio samples
\(N\) - number of channels, item G. This is always 1 for my application
\(B\) - bits per sample, item M. This is always 16

Since \(N\) & \(B\) are always constant we can compute this as

\((T * 1 * 16)/8 = T * 2\)

So this value is just twice the number of samples.

Streaming out a WAV file

Once I understand the header of a WAV file, I realized that this file format is designed to be written out assuming you already have all the audio samples. I need to actually record audio samples as they come in, then close the file once I have recorded the desired number of samples. If you look at items B, E, & Q they all require knowledge of the remainder of the file to compute. Since I don't know this, I have to write out placeholder values then come back and write the correct values when closing the file.

Writing samples

Actually writing out data samples is straightforward. Since I am writing out 16-bit data samples, each sample is a 16 bit signed integer in the range [-32768, 32767]. In GNURadio, audio samples are always delivered as a numpy array. Each value in the array is a floating point number in the range [-1.0,1.0]. So I need to convert this to an integer before writing it out. This example program shows how I did that

import numpy
data = numpy.array([-1.0, 1.0, 0.5, -0.5, -0.1, 0.1], dtype=float) # Pretend this is audio sample data

output = numpy.array([0], dtype=numpy.short)
output.resize(len(data))

numpy.multiply(data, 32767, out=output, casting='unsafe')

print(output)

This produces the following output

[-32767  32767  16383 -16383  -3276   3276]

So each of the values that are originally in the range [-1.0, 1.0] has been rescaled to the correct range for writing out into the WAV file. Once you have a series of numbers that are in the correct range you can use the python struct module to pack each value into a byte array. This is fine if you're just writing out numbers from a regular python list. But if you're using numpy there is a method on a numpy array called .tofile that just dumps the entire contents to a file. So in my example I could write out my output by calling output.tofile(fout, sep='') where fout is a regular file opened with the python built in open() function.

Closing the file

To close the file, there a couple things we need to do. Technically speaking, the file needs to be padded to have a length that is a multiple of 2. Since all samples written out are 16-bit, this is already the case. So no padding needs to be added. The remaining steps are

The next step is to populate the chunk size. First seek back to an offset of 4 bytes, then write out the length of the file minus 8 bytes.

The second step is to populate the data sub chunk size. Seek back to an offset of 40 bytes, then write out twice the number of samples.

That's it, no additional steps are required.

Example code

To recap everything what we need to do is

Write out a header, with placeholders for any values we don't know yet
Scale each audio sample to a range of [-32768, 32767] and write it out as 16-bit integers
Go back and populate all the placeholders when done writing samples

I'm using Python, specifically Python 3.8 for this. This is what I came up with to write out the file:

import struct

LITTLE_ENDIAN_INT = struct.Struct('<I')
LITTLE_ENDIAN_SIGNED_SHORT = struct.Struct('<h')
CHUNK_SIZE_OFFSET = 4
DATA_SUB_CHUNK_SIZE_SIZE_OFFSET = 40
class WaveWriter(object):
  def __init__(self, fout, sample_rate):
    self.sample_count = 0
    self.sample_rate = sample_rate
    self.fout = fout
    self._write_header()

  def _write_header(self):
    self.fout.write(b'RIFF')
    self.fout.write(b'\xFF\xFF\xFF\xFF') # Place holder for chunk size
    self.fout.write(b'WAVE')
    self.fout.write(b'fmt ')
    self.fout.write(b'\x10\x00\x00\x00') # Sub chunk size, little endian 16
    self.fout.write(b'\x01\x00') # audio format, always little endian 1
    self.fout.write(b'\x01\x00') # number of channels, always 1
    self.fout.write(LITTLE_ENDIAN_INT.pack(self.sample_rate)) # sample rate
    self.fout.write(LITTLE_ENDIAN_INT.pack(self.sample_rate * 2)) # bytes per sample
    self.fout.write(b'\x02\x00') # block alignment
    self.fout.write(b'\x10\x00') # bits per sample
    self.fout.write(b'data')
    self.fout.write(b'\xFF\xFF\xFF\xFF') # place holder for sub chunk size

  def write_samples(self, samples):
    self.sample_count += len(samples)
    for v in samples:
      self.fout.write(LITTLE_ENDIAN_SIGNED_SHORT.pack(v))

  def write_sample(self, sample):
    self.write_samples((sample,))

  def finish(self):
    v = self.fout.tell() - 8
    self.fout.seek(CHUNK_SIZE_OFFSET)
    self.fout.write(LITTLE_ENDIAN_INT.pack(v))

    self.fout.seek(DATA_SUB_CHUNK_SIZE_SIZE_OFFSET)
    v = 2 * self.sample_count
    self.fout.write(LITTLE_ENDIAN_INT.pack(v))

This is significantly simpler than it seems. Almost everything in the header winds up being a constant value.

Now the question comes how to test this code. The easiest way is to write out a file with a pure tone in it. I'll also add in a small amount of random noise, which just soudns like a quiet hiss. This is much simpler than you might think. All you really need to do is compute a value Python's math.sin for every sample. Then it needs to be scaled to the correct range.

sample_rate = 10000
duration_seconds = 6.23
number_of_samples = math.ceil(duration_seconds * sample_rate)
target_freq_hz = 232

rng = random.Random(1)
with open('example_output.wav', 'wb') as fout:
  wav_writer = WaveWriter(fout, sample_rate)
  for i in range(number_of_samples):
    v = math.sin(2 * math.pi * i * target_freq_hz / sample_rate) * 0.9
    v += rng.random() * 0.099
    v *= (2**15 - 1)
    v = int(v)
    wav_writer.write_sample(v)
  wav_writer.finish()

example_232_hz.wav 124.6 kB

Example output file containing a 232 Hz tone and background noise

You can play the output of this with your favorite media player, or you can visualize it if you have ffmpeg installed you can visualize it using this command

ffmpeg -y -nostdin -i ./example_232_hz.wav -lavfi showspectrumpic=s=1920x1080:mode=separate:orientation=horizontal:color=rainbow:scale=lin:start=20:stop=300 example_232_hz.png

That creates an image like this, showing us the waterfall plot of the audio file that we have created

The bright line indicates where most of the sound energy is concentrated at, in our case 232 Hz.

Adding metadata

Now that I had a way to write out a WAV file, my main objective was complete. The secondary problem is since I am recording audio from a radio receiver it would be nice to know when the file was recorded and what frequency the radio was receiving at that time. It's common in audio files to include information like the artist & track name for music. So this was possible.

It turns out this is done by writing out a LIST chunk with an INFO sub-chunk after the RIFF chunk. Inside of the INFO sub-chunk you can write out any number of pieces of metadata. The overall format of this chunk is the same as the RIFF chunk. So the first twelve bytes can just be written out as b"LIST\xFF\xFF\xFF\xFFINFO" in Python. After that comes the metadata entries. Each piece of metadata has the following format

Section	Byte offset	Type	Description
identifier	0	ASCII string, length 4	the identifier for this metadata
value-length	4	uint32	the length of the value that follows, including the null terminator
value	8	ASCII string, null terminated	the metadata value

The biggest difference here is that since the chunk descriptor comes after all the samples, you need to record the file offset of the placeholder so it can be overwritten with the final length. This is done as part of closing the file after all the audio samples have been recorded. One important detail is that after each piece of metadata is writen the output needs to be padded to an even number of bytes. So the above example can be changed by just adding all this logic into the finish() method

  def finish(self, metadata = None):
    if metadata is not None:
      list_chunk_start = self.fout.tell()
      self.fout.write(b'LIST')
      list_chunk_size_offset = self.fout.tell()
      self.fout.write(b'\xFF\xFF\xFF\xFF')  # Place holder for chunk size
      self.fout.write(b'INFO')

      for identifier_enum, value in metadata:
        self.fout.write(identifier_enum.value)
        self.fout.write(LITTLE_ENDIAN_INT.pack(len(value) + 1))
        self.fout.write(value.encode())
        self.fout.write(b'\x00')
        self._pad_output_to_even()

      end_of_list = self.fout.tell()
      self._pad_output_to_even()
      self.fout.seek(list_chunk_start + 4)
      self.fout.write(LITTLE_ENDIAN_INT.pack(end_of_list - list_chunk_start - 8))

    v = self.fout.tell() - 8
    self.fout.seek(CHUNK_SIZE_OFFSET)
    self.fout.write(LITTLE_ENDIAN_INT.pack(v))

    self.fout.seek(DATA_SUB_CHUNK_SIZE_SIZE_OFFSET)
    v = 2 * self.sample_count
    self.fout.write(LITTLE_ENDIAN_INT.pack(v))

The identifiers are ASCII strings of length 4. I used a Python enumeration to capture all of the known ones.

class WavMetadataIdentifier(Enum):
  TRACK_TITLE = b'INAM'
  ALBUM_TITLE = b'IPRD'
  ARTIST = b'IART'
  CREATION_DATE = b'ICRD'
  TRACK_NUMBER = b'ITRK'
  COMMENT = b'ICMT'
  KEYWORDS = b'IKEY'
  CREATING_SOFTWARE = b'ISFT'
  ENGINEER = b'IENG'
  TECHNICIAN = b'ITCH'
  GENRE = b'IGNR'
  COPYRIGHT = b'ICOP'
  SUBJECT = b'ISBJ'
  CREATOR_NAME = b'ISRC'

Then I ran this code to produce an output that slowly sweeps

sample_rate = 7000
duration_seconds = 10.0
number_of_samples = math.ceil(duration_seconds * sample_rate)
sweep_start_hz = 200
sweep_end_hz = 650
delta_hz = sweep_end_hz - sweep_start_hz

rng = random.Random(1)
metadata = (
  (WavMetadataIdentifier.CREATING_SOFTWARE, 'write_wav_with_metadata.py'),
  (WavMetadataIdentifier.COMMENT, "sweep from %d Hz to %d Hz" % (sweep_start_hz, sweep_end_hz,)),
  (WavMetadataIdentifier.COPYRIGHT, "creative commons attribution")
)

with open('example_output_sweep.wav', 'wb') as fout:
  wav_writer = WaveWriter(fout, sample_rate)
  for i in range(number_of_samples):
    i_hz = delta_hz * (i/number_of_samples)
    v = math.sin(2 * math.pi * i * (sweep_start_hz + i_hz) / sample_rate) * 0.8
    v += rng.random() * 0.19
    v *= (2**15 - 1)
    v = int(v)
    wav_writer.write_sample(v)
  wav_writer.finish(metadata)

example_200hz_to_650hz_sweep.wav 140.2 kB

Example output with a rising tone and metadata in the file

To see the actual metadata in this file you can use ffprobe.

$ ffprobe example_200hz_to_650hz_sweep.wav

Input #0, wav, from 'example_200hz_to_650hz_sweep.wav':
  Metadata:
    encoder         : write_wav_with_metadata.py
    comment         : sweep from 200 Hz to 650 Hz
    copyright       : creative commons attribution
  Duration: 00:00:10.00, bitrate: 112 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 7000 Hz, 1 channels, s16, 112 kb/s

mediainfo bugs

Above, I used ffprobe to view the metadata in the output file. I should be able to use mediainfo to get the same information.

$ mediainfo site/joys-of-wav-file/example_200hz_to_650hz_sweep.wav 
General
Complete name                            : example_200hz_to_650hz_sweep.wav
Format                                   : Wave
File size                                : 137 KiB
Duration                                 : 10 s 0 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 112 kb/s
IsTruncated                              : Yes

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 10 s 0 ms
Bit rate mode                            : Constant
Bit rate                                 : 112 kb/s
Channel(s)                               : 1 channel
Sampling rate                            : 7 000 Hz
Bit depth                                : 16 bits
Stream size                              : 137 KiB (100%)

But the metadata doesn't appear here, instead we see that IsTruncated is Yes according to mediainfo. So it seems that a problem is present with the file. After about an hour of trying to figure this out, I decided mediainfo must have a bug. All media players I tried played the file fine and could view the metadata. As far as I can tell, mediainfo expects the LIST chunk to come before the DATA chunk. This appears to be an informal aspect of the file format as it isn't called for explicitly anywhere.

Source code

All of this code is in a library available on github.

Sources

All of the information presented here is taken from the following sources