Masking and Missing Values

The astropy.table package provides support for masking and missing values in a table by using the numpy.ma masked array package to define masked columns and by supporting Mixin Columns that provide masking. This allows handling tables with missing or invalid entries in much the same manner as for standard (unmasked) tables. It is useful to be familiar with the masked array documentation when using masked tables within astropy.table.

In a nutshell, the concept is to define a boolean mask that mirrors the structure of a column data array. Wherever a mask value is True, the corresponding entry is considered to be missing or invalid. Operations involving column or row access and slicing are unchanged. The key difference is that arithmetic or reduction operations involving columns or column slices follow the rules for operations on masked arrays.

Important

Changes in astropy 4.0

In astropy 4.0 the behavior of masked tables was changed in a way that could impact program functionality. See Masking Change in astropy 4.0 for details.

Note

Reduction operations like numpy.sum or numpy.mean follow the convention of ignoring masked (invalid) values. This differs from the behavior of the floating point NaN, for which the sum of an array including one or more NaN's will result in NaN.

See this page for information on NumPy Enhancement Proposals 24, 25, and 26.

Table Creation

A masked table can be created in several ways:

Create a new table object and specify masked=True

>>> from astropy.table import Table, Column, MaskedColumn
>>> Table([(1, 2), (3, 4)], names=('a', 'b'), masked=True, dtype=('i4', 'i8'))
<Table masked=True length=2>
  a     b
int32 int64
----- -----
    1     3
    2     4

Notice the table attributes mask and fill_value that are available for a masked table.

Create a table with one or more columns as a MaskedColumn object

>>> a = MaskedColumn([1, 2], name='a', mask=[False, True], dtype='i4')
>>> b = Column([3, 4], name='b', dtype='i8')
>>> Table([a, b])
<Table length=2>
  a     b
int32 int64
----- -----
    1     3
   --     4

The MaskedColumn is the masked analog of the Column class and provides the interface for creating and manipulating a column of masked data. The MaskedColumn class inherits from numpy.ma.MaskedArray, in contrast to Column which inherits from numpy.ndarray. This distinction is the main reason there are different classes for these two cases.

Notice that masked entries in the table output are shown as --.

Create a table with one or more columns as a ``numpy`` MaskedArray

>>> from numpy import ma  # masked array package
>>> a = ma.array([1, 2])
>>> b = [3, 4]
>>> t = Table([a, b], names=('a', 'b'))

Add a MaskedColumn object to an existing table

>>> t = Table([[1, 2]], names=['a'])
>>> b = MaskedColumn([3, 4], mask=[True, False])
>>> t['b'] = b

Prior to astropy 4.0, adding the first MaskedColumn resulted in converting the entire table to be masked, which meant converting every existing Column to MaskedColumn. An informational warning was issued:

INFO: Upgrading Table to masked Table. Use Table.filled() to convert to unmasked table. [astropy.table.table]

In astropy 4.0 and later, existing columns are not changed.

Add a new row to an existing table and specify a mask argument

>>> a = Column([1, 2], name='a')
>>> b = Column([3, 4], name='b')
>>> t = Table([a, b])
>>> t.add_row([3, 6], mask=[True, False])

Convert an existing table to a masked table

>>> t = Table([[1, 2], ['x', 'y']])  # standard (unmasked) table
>>> t = Table(t, masked=True, copy=False)  # convert to masked table

This operation will convert every Column to MaskedColumn and ensure that any subsequently added columns are masked.

Table Access

Nearly all of the standard methods for accessing and modifying data columns, rows, and individual elements also apply to masked tables.

There are two minor differences for the Row object that is obtained by indexing a single row of a table:

  • For standard tables, two such rows can be compared for equality, but in masked tables this comparison will produce an exception.

Both of these differences are due to issues in the underlying numpy.ma.MaskedArray implementation.

Masking and Filling

Both the Table and MaskedColumn classes provide attributes and methods to support manipulating tables with missing or invalid data.

Mask

The mask for a column can be viewed and modified via the mask attribute:

>>> t = Table([(1, 2), (3, 4)], names=('a', 'b'), masked=True)
>>> t['a'].mask = [False, True]  # Modify column mask (boolean array)
>>> t['b'].mask = [True, False]  # Modify column mask (boolean array)
>>> print(t)
 a   b
--- ---
  1  --
 --   4

Masked entries are shown as -- when the table is printed. You can view the mask directly, either at the column or table level:

>>> t['a'].mask
array([False,  True]...)

>>> t.mask
<Table length=2>
  a     b
 bool  bool
----- -----
False  True
 True False

To get the indices of masked elements, use an expression like:

>>> t['a'].mask.nonzero()[0]  
array([1])

Filling

The entries which are masked (i.e., missing or invalid) can be replaced with specified fill values. In this case the MaskedColumn or masked Table will be converted to a standard Column or table. Each column in a masked table has a fill_value attribute that specifies the default fill value for that column. To perform the actual replacement operation the filled() method is called. This takes an optional argument which can override the default column fill_value attribute.

>>> t['a'].fill_value = -99
>>> t['b'].fill_value = 33

>>> print(t.filled())
 a   b
--- ---
  1  33
-99   4

>>> print(t['a'].filled())
 a
---
  1
-99

>>> print(t['a'].filled(999))
 a
---
  1
999

>>> print(t.filled(1000))
 a    b
---- ----
   1 1000
1000    4

Masking Change in astropy 4.0

In astropy 4.0 a change was introduced in the behavior of Table that impacts the handling of masked columns.

Prior to 4.0, in order to include one or more MaskedColumn columns in a table, it was required that every column be masked, even those with no missing or masked data. This was a holdover from the original implementation of Table that used a numpy structured array as the underlying container for the column data. Since astropy 1.0, the Table object is an ordered dictionary of columns (Table Implementation Details) and there is no requirement that column types be homogenous.

Starting with 4.0, a Table can contain both Column and MaskedColumn columns, and by default the column type is determined solely by the data for each column.

The details of this change are discussed in the sections below.

Note

For most applications, even those with masked column data, we now recommend using the default Table behavior which allows heterogenous column types. This implies creating tables without specifying the masked keyword argument.

Meaning of the masked Table Attribute

The Table object has a masked attribute which determines the table behavior when adding a new column:

  • masked=True : non-mixin columns or data are always converted to MaskedColumn, and mixin columns have a mask attribute added if necessary.

  • masked=False : each column is added based on the type or contents of the data.

The behavior associated with the masked attribute has not changed in version 4.0. What has changed is that from 4.0 onward a table with masked=False may contain MaskedColumn columns.

It is important to recognize that the masked attribute for a table does not imply whether any of the column data are actually masked. A table can have masked=True but not have any masked elements in any table column. Starting with version 4.0 there are two table properties which give more useful information about masking:

  • has_masked_columns : table has at least one MaskedColumn column. This does not check if any data values are actually masked.

  • has_masked_values : table has one or more column data values which are masked. This may be relatively slow for large tables as it requires checking the mask values of each column.

Starting with version 4.0 the term “masked table” should be reserved for the narrow and less-common case of a table created with masked=True. In most cases there should be no need worry about “masked” or “unmasked” at the table level, but instead focus on the individual columns.

Auto-upgrade to Masked

Prior to version 4.0, adding a MaskedColumn or a new row with masked elements to a table with masked=False would set masked=True and automatically “upgrade” other columns to be masked. In many cases this upgrade of the other columns was unnecessary and an annoyance.

Starting with 4.0, new columns are added using the column type which is appropriate for the data. For instance, if a numpy masked array is added, then that will turn into a MaskedColumn, but no other columns will be affected and the masked attribute will remain as False.

A commonly-encountered implication of this change is that tables read with read will always have masked=False, and only columns with masked values will be MaskedColumn. Prior to 4.0 if the input table had any masked values then the returned table would have masked=True and all MaskedColumn columns. An example is in the next section.

Recovering the Pre-4.0 Behavior

For code that requires every existing or newly added column to be masked, it is now required to explicitly specify masked=True when creating the table. Previously the table would be auto-upgraded to use MaskedColumn for all columns as soon as the first masked column was added. If the table already exists (e.g., after using read to read a data file), then you need to make a new table:

>> dat = Table.read('data.fits')
>> dat = Table(dat, masked=True, copy=False)  # Convert to masked table
>> dat['new_column'] = [1, 2, 3, 4, 5]  # Will be added as a MaskedColumn

For most applications this should not be necessary, and the preferred idiom is the more explicit version below:

>> dat = Table.read('data.fits')
>> dat['new_column'] = np.ma.MaskedArray([1, 2, 3, 4, 5])