I often need to save a series of arrays in which one dimension varies in length---sometimes called a ragged array [1]. For example, I'm running particle tracking experiments, and I need to save the 2D coordinates of all particles in each video frame. The number of particles in each frame will vary due to movement across the edges of the frame and velocity components normal to the focal plane; so, I can't simply save a (dense) 3D array. Instead, I just store this data in a Python list of N-by-2 numpy arrays, where N is the number of particles in a frame and varies for each array.

The question is: How do you save this list of arrays? In my first attempt, I saved each array individually (as separate keys in an .npz file); this approach gave slow save/load times and larger file sizes. A better approach is to stack all the ragged arrays along the dimension that varies in length---i.e. the ragged dimension. Then, I use numpy's .npz file to save the array data.

Stacking and splitting arrays

numpy provides a number of functions to stack arrays: concatenate, hstack, vstack, and dstack. The main difference here is that we want to save the starting indices of the sub-arrays so that we can slice them back out later:

import numpy as np

def stack_ragged(array_list, axis=0):
    lengths = [np.shape(a)[axis] for a in array_list]
    idx = np.cumsum(lengths[:-1])
    stacked = np.concatenate(array_list, axis=axis)
    return stacked, idx

This returns the stacked array and the starting index of each sub-array. To use stack_ragged, just pass in a list of arrays:

>>> array_list = [np.array([(0, 0), (1, 1)]),
...               np.array([(2, 2), (3, 3), (4, 4)]),
...               np.array([(5, 5)])]
>>> stacked, idx = stack_ragged(array_list)
>>> print idx
[2 5]
>>> print stacked
[[0 0]
 [1 1]
 [2 2]
 [3 3]
 [4 4]
 [5 5]]

Here, the ragged arrays are stacked vertically, since axis = 0 by default.

To split up this array back into a list of ragged arrays, just pass in the stacked array and the starting indices (and the axis, if necessary) to numpy's split function:

>>> for array in np.split(stacked, idx):
...     print array
[[0 0]
 [1 1]]
[[2 2]
 [3 3]
 [4 4]]
[[5 5]]

which returns our original list of arrays. (Note: the loop is just for prettier printing.)

Saving and loading

So stacking turns our list of arrays into a single array, which we can easily save using numpy's save (single array) or savez (dict of arrays) functions. If we want to get back our original arrays, however, we also need to save the start indices:

def save_stacked_array(fname, array_list, axis=0):
    stacked, idx = stack_ragged(array_list, axis=axis)
    np.savez(fname, stacked_array=stacked, stacked_index=idx)

Finally, we define a function to load our original list of arrays:

def load_stacked_arrays(fname, axis=0):
    npzfile = np.load(fname)
    idx = npzfile['stacked_index']
    stacked = npzfile['stacked_array']
    return np.split(stacked, idx, axis=axis)

Alternatively, the save function could store the stacking-axis in the .npz file so that you don't have to specify it in the load function. Another improvement would be to guess the stacking axis in stack_ragged by checking which axis varies in size (this would fail, however, for constant N). And finally, you can use savez_compressed instead of savez to reduce storage.

P.S. After implementing this approach, I learned that NetCDF files support ragged arrays out of the box (using VLEN types)---it's not the first time I've reinvented the wheel; it won't be the last.



comments powered by Disqus