Module: featurize

cesium.featurize.assemble_featureset(…[, …]) Transforms raw feature data (as returned by featurize_single_ts) into a pd.DataFrame.
cesium.featurize.featurize_single_ts(ts, …) Compute feature values for a given single time-series.
cesium.featurize.featurize_time_series(…) Versatile feature generation function for one or more time series.
cesium.featurize.featurize_ts_files(…[, …]) Feature generation function for on-disk time series (.npz) files.
cesium.featurize.generate_dask_graph(t, m, e)
cesium.featurize.impute_featureset(fset[, …]) Replace NaN/Inf values with imputed values as defined by strategy.
cesium.featurize.load_featureset(path) Load feature DataFrame from .npz file.
cesium.featurize.save_featureset(fset, path, …) Save feature DataFrame in .npz format.
cesium.featurize.TimeSeries([t, m, e, …]) Class representing a single time series of measurements and metadata.

assemble_featureset

cesium.featurize.assemble_featureset(features_list, time_series=None, meta_features_list=None, names=None)

Transforms raw feature data (as returned by featurize_single_ts) into a pd.DataFrame.

Parameters:

features_list : list of pd.Series

List of series (one per time series file) with (feature name, channel) multiindex.

time_series : list of TimeSeries

If provided, the name and metafeatures from the time series objects will be used, overriding the meta_features_list and names values.

meta_features_list : list of dict

If provided, the columns of metadata will be added to the featureset.

names : list of str

If provided, the (row) index of the featureset will be set accordingly.

Returns:

pd.DataFrame

DataFrame with columns containing feature values, indexed by name.

featurize_single_ts

cesium.featurize.featurize_single_ts(ts, features_to_use, custom_script_path=None, custom_functions=None, raise_exceptions=True)

Compute feature values for a given single time-series. Data is returned as dictionaries/lists of lists.

Parameters:

ts : TimeSeries object

Single time series to be featurized.

features_to_use : list of str

List of feature names to be generated.

custom_functions : dict, optional

Dictionary of custom feature functions to be evaluated for the given time series, or a dictionary representing a dask graph of function evaluations. Dictionaries of functions should have keys feature_name and values functions that take arguments (t, m, e); in the case of a dask graph, these arrays should be referenced as ‘t’, ‘m’, ‘e’, respectively, and any values with keys present in features_to_use will be computed.

raise_exceptions : bool, optional

If True, exceptions during feature computation are raised immediately; if False, exceptions are supressed and np.nan is returned for the given feature and any dependent features. Defaults to True.

Returns:

dict

Dictionary with feature names as keys, lists of feature values (one per channel) as values.

featurize_time_series

cesium.featurize.featurize_time_series(times, values, errors=None, features_to_use=[], meta_features={}, names=None, custom_script_path=None, custom_functions=None, scheduler=<function get>, raise_exceptions=True)

Versatile feature generation function for one or more time series.

For a single time series, inputs may have the form:

  • times: (n,) array or (p, n) array (for p channels of measurement)
  • values: (n,) array or (p, n) array (for p channels of measurement)
  • errors: (n,) array or (p, n) array (for p channels of measurement)

For multiple time series, inputs may have the form:

  • times: list of (n,) arrays, list of (p, n) arrays (for p channels of measurement), or list of lists of (n,) arrays (for multichannel data with different time values per channel)
  • values: list of (n,) arrays, list of (p, n) arrays (for p channels of measurement), or list of lists of (n,) arrays (for multichannel data with different time values per channel)
  • errors: list of (n,) arrays, list of (p, n) arrays (for p channels of measurement), or list of lists of (n,) arrays (for multichannel data with different time values per channel)

In the case of multichannel measurements, each channel will be featurized separately, and the index of the output featureset will contain a channel coordinate.

Parameters:

times : array, list of array, or list of lists of array

Array containing time values for a single time series, or a list of arrays each containing time values for a single time series, or a list of lists of arrays for multichannel data with different time values per channel

values : array or list of array

Array containing measurement values for a single time series, or a list of arrays each containing (possibly multivariate) measurement values for a single time series, or a list of lists of arrays for multichannel data with different time values per channel

errors : array or list/tuple of array, optional

Array containing measurement error values for a single time series, or a list of arrays each containing (possibly multivariate) measurement values for a single time series, or a list of lists of arrays for multichannel data with different time values per channel

features_to_use : list of str, optional

List of feature names to be generated. Defaults to an empty list, which will result in only meta_features features being stored.

meta_features : dict/Pandas.Series or list of dicts/Pandas.DataFrame

dict/Series (for a single time series) or DataFrame (for multiple time series) of metafeature information; features are added to the output featureset, and their values are consumable by custom feature scripts.

names : str or list of str, optional

Name or list of names for each time series, if applicable; will be stored in the (row) index of the featureset.

custom_script_path : str, optional

Path to Python script containing function definitions for the generation of any custom features. Defaults to None.

custom_functions : dict, optional

Dictionary of custom feature functions to be evaluated for the given time series, or a dictionary representing a dask graph of function evaluations. Dictionaries of functions should have keys feature_name and values functions that take arguments (t, m, e); in the case of a dask graph, these arrays should be referenced as ‘t’, ‘m’, ‘e’, respectively, and any values with keys present in features_to_use will be computed.

scheduler : function, optional

dask scheduler function used to perform feature extraction computation. Defaults to dask.threaded.get.

raise_exceptions : bool, optional

If True, exceptions during feature computation are raised immediately; if False, exceptions are supressed and np.nan is returned for the given feature and any dependent features. Defaults to True.

Returns:

pd.DataFrame

DataFrame with columns containing feature values, indexed by name.

featurize_ts_files

cesium.featurize.featurize_ts_files(ts_paths, features_to_use, custom_script_path=None, custom_functions=None, scheduler=<function get>, raise_exceptions=True)

Feature generation function for on-disk time series (.npz) files.

By default, computes features concurrently using the dask.threaded.get scheduler. Other possible options include dask.local.get for synchronous computation (e.g., when debugging), or dask.distributed.Executor.get for distributed computation.

In the case of multichannel measurements, each channel will be featurized separately, and the index of the output featureset will contain a channel coordinate.

Parameters:

ts_paths : list of str

List of paths to time series data, stored in numpy .npz format. See time_series.load for details.

features_to_use : list of str, optional

List of feature names to be generated. Defaults to an empty list, which will result in only meta_features features being stored.

custom_script_path : str, optional

Path to Python script containing function definitions for the generation of any custom features. Defaults to None.

custom_functions : dict, optional

Dictionary of custom feature functions to be evaluated for the given time series, or a dictionary representing a dask graph of function evaluations. Dictionaries of functions should have keys feature_name and values functions that take arguments (t, m, e); in the case of a dask graph, these arrays should be referenced as ‘t’, ‘m’, ‘e’, respectively, and any values with keys present in features_to_use will be computed.

scheduler : function, optional

dask scheduler function used to perform feature extraction computation. Defaults to dask.threaded.get.

raise_exceptions : bool, optional

If True, exceptions during feature computation are raised immediately; if False, exceptions are supressed and np.nan is returned for the given feature and any dependent features. Defaults to True.

Returns:

pd.DataFrame

DataFrame with columns containing feature values, indexed by name.

generate_dask_graph

cesium.featurize.generate_dask_graph(t, m, e)

impute_featureset

cesium.featurize.impute_featureset(fset, strategy='constant', value=None, max_value=1e+20, inplace=False)

Replace NaN/Inf values with imputed values as defined by strategy. Output should satisfy sklearn.validation.assert_all_finite so that training a model will not produce an error.

Parameters:

strategy : str, optional

The imputation strategy. Defaults to ‘constant’.

  • ‘constant’: replace all missing with value
  • ‘mean’: replace all missing with mean along axis
  • ‘median’: replace all missing with median along axis
  • ‘most_frequent’: replace all missing with mode along axis

value : float or None, optional

Replacement value to use for strategy=’constant’. Defaults to None, in which case a very large negative value is used (a good choice for e.g. random forests).

max_value : float, optional

Maximum (absolute) value above which values are treated as infinite. Used to prevent overflow when fitting sklearn models.

inplace : bool, optional

If True, fill in place. If False, return a copy.

Returns:

pd.DataFrame

Feature data frame wth no missing/infinite values.

load_featureset

cesium.featurize.load_featureset(path)

Load feature DataFrame from .npz file.

Feature information is returned as a single DataFrame, while any other arrays that were saved (class labels/predictions, etc.) are returned in a single dictionary.

Parameters:

path : str

Path where feature data is stored.

Returns:

pd.DataFrame

Feature data frame to be saved.

dict

Additional variables passed to save_featureset, including labels, etc.

save_featureset

cesium.featurize.save_featureset(fset, path, **kwargs)

Save feature DataFrame in .npz format.

Can optionally store class labels/targets and other metadata. All other keyword arguments will be passed on to np.savez; data frames are saved as record arrays and converted back into data frames by load_featureset.

Parameters:

fset : pd.DataFrame

Feature data frame to be saved.

path : str

Path to store feature data.

kwargs : dict of array or data frame

Additional keyword arguments, e.g.: labels -> class labels preds -> predicted class labels pred_probs -> (n_sample, n_class) data frame of class probabilities

TimeSeries

class cesium.featurize.TimeSeries(t=None, m=None, e=None, label=None, meta_features={}, name=None, path=None, channel_names=None)

Bases: object

Class representing a single time series of measurements and metadata.

A TimeSeries object encapsulates a single set of time-domain measurements, along with any metadata describing the observation. Typically the observations will consist of times, measurements, and (optionally) measurement errors. The measurements can be scalar- or vector-valued (i.e., “multichannel”); for multichannel measurements, the times and errors can also be vector-valued, or they can be shared across all channels of measurement.

Attributes

time ((n,) or (p, n) array or list of (n,) arrays) Array(s) of times corresponding to measurement values. If measurement is two-dimensional, this can be one-dimensional (same times for each channel) or two-dimensional (different times for each channel). If time is one-dimensional then it will be broadcast to match measurement.shape.
measurement ((n,) or (p, n) array or list of (n,) arrays) Array(s) of measurement values; can be two-dimensional for multichannel data. In the case of multichannel data with different numbers of measurements for each channel, measurement will be a list of arrays instead of a single two-dimensional array.
error ((n,) or (p, n) array or list of (n,) arrays) Array(s) of measurement errors for each value. If measurement is two-dimensional, this can be one-dimensional (same times for each channel) or two-dimensional (different times for each channel). If error is one-dimensional then it will be broadcast match measurement.shape.
label (str, float, or None) Class label or regression target for the given time series (if applicable).
meta_features (dict) Dictionary of feature names/values specified independently of the featurization process in featurize.
name (str or None) Identifying name for the given time series (if applicable). Typically the name of the raw data file from which the time series was created.
path (str or None) Path to the file where the time series is stored on disk (if applicable).
channel_names (list of str) List of names of channels of measurement; by default these are simply channel_{i}, but can be arbitrary depending on the nature of the different measurement channels.

Methods

channels() Iterates over measurement channels (whether one or multiple).
save([path]) Store TimeSeries object as a single .npz file.
sort() Sort times, measurements, and errors by time.
__init__(t=None, m=None, e=None, label=None, meta_features={}, name=None, path=None, channel_names=None)

Create a TimeSeries object from measurement values/metadata.

See TimeSeries documentation for parameter values.

channels()

Iterates over measurement channels (whether one or multiple).

save(path=None)

Store TimeSeries object as a single .npz file.

Attributes are stored in the following arrays:
  • time
  • measurement
  • error
  • meta_feat_names
  • meta_feat_values
  • name
  • label

If path is omitted then the path attribute from the TimeSeries object is used.

sort()

Sort times, measurements, and errors by time.