Technical Articles

New Functions for Vectorizing Operations on Any Data Type

By Vadim Teverovsky, MathWorks


Vectorization is one of the core concepts of MATLAB. With one command it lets you process all elements of an array, avoiding loops and making your code more readable and efficient.

For data stored in numerical arrays, most MATLAB functions are inherently vectorized. Often, however, your data may not be stored in a simple numerical array. Instead, it could be stored in cell arrays, structures, or structure arrays. For example:

  • You may have a cell array containing both string and non-string data, and need to know which elements of the array are strings and which are not.
  • You may have a structure array, and need to know which elements contain data above a given threshold.
  • You may have sensor data stored in a structure where each field of the structure corresponds to data for a particular sensor, and need to compute the standard deviation for each sensor, or need to "clean" the data by removing any NaN values.
  • You may have a structure array filled with data that corresponds to the locations of various tokens (or substrings) in multiple files, with each file containing multiple lines of text (for example, M-code), and you need to present the data in a relatively transparent format.

Previous versions of MATLAB had limited support for processing data stored in those ways. Typically you would write one or more for loops, pre-allocate storage for the output, and so on. While the amount of code written may not have been large, such code is usually repetitive and error-prone. In addition, the practice goes against the MATLAB concept of vectorization. The only tool for generalized operations on arrays, which was available in prior versions of MATLAB, was a function called cellfun. This function operated on every element of a cell array, but handled only a few operations.

New Capabilities

In MATLAB 7.1, the capabilities of cellfun have been generalized, and two new functions in the same family added: arrayfun and structfun.

These functions include the following benefits:

  • In general, they enable you to focus on the algorithm, rather than the mechanics of either creating a new structure or iterating through loops.
  • They are relatively generic, and so are applicable to a variety of problems.
  • Many simple examples use anonymous functions, and fit on one line.
  • The ErrorHandler parameter lets you introduce your own error-handling functions, so that if a particular set of inputs causes an error on one of the calls to the underlying function, the whole computation is not aborted.
  • Arrayfun and cellfun can take multiple input arguments and produce multiple output arguments.
  • The functions are built in for greater speed.

Example 1: cellfun

Cellfun now takes any function handle (including anonymous functions) as its first argument and one or more cell arrays of the same size as subsequent arguments. It then applies the function to each cell in the array. The output is either in the form of another cell array or in a “uniform” array, such as an array of doubles.

To find out which cells in a cell array contain strings, formerly you might have written code such as the following:

cellArray = {'abcde', 3; [5 6], 'mnopqr'};
 
b = true(size(cellArray));
for i = 1:size(cellArray,1)
for j = 1:size(cellArray,2)
b(i,j) = ischar(cellArray{i,j});
end end
b

b =
     1    0
     0    1

Now you could use code like this:

b = cellfun(@ischar, cellArray)

b =
     1    0
     0    1

The smaller amount of code is clearer, making it more obvious that the function ischar is applied to each cell of the array and that a logical array is returned.

Example 2: arrayfun

arrayfun is similar to cellfun but operates on one or more MATLAB arrays and on each element of an array. Applying arrayfun with a cell array as an input will perform the operation on each cell of the array as opposed to cellfun, which operates on the contents of each cell. The difference is essentially the difference between array(i) and array{i}. arrayfun is most commonly used for structure arrays.

Consider a structure array with the following data:

sArray(1).Data = [12 5 10]; sArray(2).Data = []; sArray(3).Data = [4];
sArray(4).Data = [12];

If you had to find which elements of sArray contain data greater than some value, X, formerly you might have written code such as the following:

output = true(size(sArray));
X = 5;
for i = 1: length(sArray)
output(i) = ~isempty(find(sArray(i).Data > X)); end
output

output =
         1    0    0    1

But with the new functionality of arrayfun, you can now write code like this:

output = arrayfun(@(y) ~isempty(find(y.Data > 5)), sArray)

output =
         1    0    0    1

Here an anonymous function describes the desired operation, and arrayfun applies the function to each structure in the array. With neither pre-allocation nor loops, the rationale for this line of code is more transparent than it would have been otherwise.

Example 3: structfun

Structfun operates on each field of a single scalar structure. Consider sensor data stored in a scalar structure, where each field of the structure contains data from a single sensor. Here is a simple example:

sensorData.sensor1 = [12 34 23 28 43];
sensorData.sensor2 = [14 38 44 38 56];

You can perform certain analyses regardless of which sensor is in use. For example, you may need to compute the standard deviation of the data values for each sensor. The function std can do that for a single vector of data, but to operate on each sensor, you can now write something like this:

result = structfun(@std, sensorData)

result =
         11.6404
         15.2971

Alternatively, if you need to retain the link between the sensor’s name and the data, you can set the UniformOutput flag to false so that the return value is a new structure with the same field names as the original data.

result = structfun(@std, sensorData, 'UniformOutput', false)

result =
         sensor1: 11.6404
         sensor2: 15.2971

Another example is the need to "clean" data by replacing all NaNs with the average of the rest of the data. You can easily write a function that will do so for a single vector of data.

function output = cleanNaN(data)
% Error checking and complexity deliberately left out.
 
nonNans = find(~isnan(data));
output(nonNans) = data(nonNans);
average = mean(data(nonNans));
output(isnan(data)) = average; end

If your original data has NaNs in it, such as the following:

sensorData.sensor1 = [12 34 23 NaN 43];
sensorData.sensor2 = [14 NaN 44 NaN 56];

You could create a new structure with the same fields, but with cleaned-up data:

cleanedData = structfun(@cleanNaN, sensorData, 'UniformOutput', false)

cleanedData =
              sensor1: [12 34 23 28 43]
              sensor2: [14 38 44 38 56]

Again, these functions let you focus on the algorithm rather than the mechanics of creating a new structure or iterating through loops.

Example 4: cellfun and arrayfun

These functions can be used in combination if you need to perform operations on each field of a structure and for each element of an array.

Consider data corresponding to the positions of various "tokens" (or substrings) in several files, where the data is in the following form:

subStringData = {struct('location', {[51 2 12], [62 21 31]}, 'filename', 'foo.m'), ...
struct('location', {[43 5 26], [72 22 43]}, 'filename', 'bar.m')}

The location field contains the line number and the start and end columns of the token. SubStringData is a cell array of structure arrays, one cell for each file, and one structure for each token in the file. For presentation purposes, you may need a more user-friendly way of storing and displaying the data. You have a function, tokenReorg, that takes a scalar structure as an input argument and separates out the location into three different fields, creating a new scalar structure. Note that tokenReorg is a function that focuses entirely on the algorithm. In this case, the algorithm is the transformation of a single token location into a more user-friendly format.

function sTransform = tokenReorg(input)
% Reorganize a single record of a location structure sTransform.line = input.location(1);
sTransform.start = input.location(2);
sTransform.end = input.location(3);
sTransform.filename = input.filename; end

To transform the entire set of data, you can now write code like this:

cs = cellfun(@(x) arrayfun(@tokenReorg, x), subStringData,'UniformOutput',false);
cs{1}(1)
cs{1}(2)
cs{2}(1)
cs{2}(2)

ans =
      line:       51
      start:      2
      end:        13
      filename:   'foo.m'

ans = 
      line:       62
      start:      21
      end:        31
      filename:   'foo.m'

ans = 
      line:       43
      start:      5
      end:        26
      filename:   'bar.m'

ans = 
      line:       72
      start:      22
      end:        42
      filename:   'bar.m'

What did this do? Remember that we have a cell array of structure arrays, and a function that operates on a single scalar structure. To reorganize all of our data, we must reorganize each structure in each cell of the data. Starting from the inside, arrayfun applies the tokenReorg function to each structure in a structure array. The result of the call to arrayfun (that is, of the function call: arrayfun(@tokenReorg, x)) is to produce a new structure array containing the required format for one cell (which is equivalent to one file). For example, the following line:

a = arrayfun(@tokenReorg, subStringData{1})

gives information regarding the file, foo.m:

a(1)	 
line:       51
start:      2
end:        12
filename:   'foo.m'
 
a(2)	 
line:       62
start:      21
end:        31
filename:   'foo.m'

Each call to arrayfun produces such a result. The call to cellfun applies the anonymous function, @(x) arrayfun(@tokenReorg, x), to each cell of the cell array. If you set the UniformOutput flag to false, cellfun returns a cell array of outputs. Thus, the result of the call to cellfun is a cell array in which each cell contains the structure array for a given file.

This example is adapted from code used within The MathWorks to analyze M-code. Although the simplified sample data structures can be represented in other, perhaps simpler ways, they serve as an example of what can be done with the new set of functions.

Conclusion

The newly generalized functionality of cellfun and the new functions arrayfun and structfun let you vectorize code that you could not have vectorized previously. Using these functions can result in code that contains fewer loops and lets you to focus on the algorithm rather than the programming infrastructure. Combining these functions with each other can give you additional flexibility in handling complicated arrays.

A selection of the capabilities of these functions has been discussed here. Please refer to the documentation for more information and examples (cellfun, arrayfun, structfun).

Published 2006