Module: `featurize`

`cesium.featurize.assemble_featureset`(...[, ...])	Transforms raw feature data (as returned by featurize_single_ts) into a pd.DataFrame.
`cesium.featurize.featurize_single_ts`(ts, ...)	Compute feature values for a given single time-series.
`cesium.featurize.featurize_time_series`(...)	Versatile feature generation function for one or more time series.
`cesium.featurize.featurize_ts_files`(...[, ...])	Feature generation function for on-disk time series (.npz) files.
`cesium.featurize.generate_dask_graph`(t, m, e)
`cesium.featurize.impute_featureset`(fset[, ...])	Replace NaN/Inf values with imputed values as defined by strategy.
`cesium.featurize.load_featureset`(path)	Load feature DataFrame from .npz file.
`cesium.featurize.save_featureset`(fset, path, ...)	Save feature DataFrame in .npz format.
`cesium.featurize.TimeSeries`([t, m, e, ...])	Class representing a single time series of measurements and metadata.

assemble_featureset

cesium.featurize.assemble_featureset(features_list, time_series=None, meta_features_list=None, names=None)

Transforms raw feature data (as returned by featurize_single_ts) into a pd.DataFrame.

Parameters:

features_listlist of pd.Series: List of series (one per time series file) with (feature name, channel) multiindex.
time_serieslist of TimeSeries: If provided, the name and metafeatures from the time series objects will be used, overriding the meta_features_list and names values.
meta_features_listlist of dict: If provided, the columns of metadata will be added to the featureset.
nameslist of str: If provided, the (row) index of the featureset will be set accordingly.

Returns:

pd.DataFrame: DataFrame with columns containing feature values, indexed by name.

featurize_single_ts

cesium.featurize.featurize_single_ts(ts, features_to_use, custom_script_path=None, custom_functions=None, raise_exceptions=True)

Compute feature values for a given single time-series. Data is returned as dictionaries/lists of lists.

Parameters:

tsTimeSeries object: Single time series to be featurized.
features_to_uselist of str: List of feature names to be generated.
custom_functionsdict, optional: Dictionary of custom feature functions to be evaluated for the given time series, or a dictionary representing a dask graph of function evaluations. Dictionaries of functions should have keys feature_name and values functions that take arguments (t, m, e); in the case of a dask graph, these arrays should be referenced as ‘t’, ‘m’, ‘e’, respectively, and any values with keys present in features_to_use will be computed.
raise_exceptionsbool, optional: If True, exceptions during feature computation are raised immediately; if False, exceptions are supressed and np.nan is returned for the given feature and any dependent features. Defaults to True.

Returns:

dict: Dictionary with feature names as keys, lists of feature values (one per channel) as values.

featurize_time_series

cesium.featurize.featurize_time_series(times, values, errors=None, features_to_use=[], meta_features={}, names=None, custom_script_path=None, custom_functions=None, scheduler=<function get>, raise_exceptions=True)

Versatile feature generation function for one or more time series.

For a single time series, inputs may have the form:

times: (n,) array or (p, n) array (for p channels of measurement)
values: (n,) array or (p, n) array (for p channels of measurement)
errors: (n,) array or (p, n) array (for p channels of measurement)

For multiple time series, inputs may have the form:

times: list of (n,) arrays, list of (p, n) arrays (for p channels of measurement), or list of lists of (n,) arrays (for multichannel data with different time values per channel)
values: list of (n,) arrays, list of (p, n) arrays (for p channels of measurement), or list of lists of (n,) arrays (for multichannel data with different time values per channel)
errors: list of (n,) arrays, list of (p, n) arrays (for p channels of measurement), or list of lists of (n,) arrays (for multichannel data with different time values per channel)

In the case of multichannel measurements, each channel will be featurized separately, and the index of the output featureset will contain a channel coordinate.

Parameters:

timesarray, list of array, or list of lists of array: Array containing time values for a single time series, or a list of arrays each containing time values for a single time series, or a list of lists of arrays for multichannel data with different time values per channel
valuesarray or list of array: Array containing measurement values for a single time series, or a list of arrays each containing (possibly multivariate) measurement values for a single time series, or a list of lists of arrays for multichannel data with different time values per channel
errorsarray or list/tuple of array, optional: Array containing measurement error values for a single time series, or a list of arrays each containing (possibly multivariate) measurement values for a single time series, or a list of lists of arrays for multichannel data with different time values per channel
features_to_uselist of str, optional: List of feature names to be generated. Defaults to an empty list, which will result in only meta_features features being stored.
meta_featuresdict/Pandas.Series or list of dicts/Pandas.DataFrame: dict/Series (for a single time series) or DataFrame (for multiple time series) of metafeature information; features are added to the output featureset, and their values are consumable by custom feature scripts.
namesstr or list of str, optional: Name or list of names for each time series, if applicable; will be stored in the (row) index of the featureset.
custom_script_pathstr, optional: Path to Python script containing function definitions for the generation of any custom features. Defaults to None.
custom_functionsdict, optional: Dictionary of custom feature functions to be evaluated for the given time series, or a dictionary representing a dask graph of function evaluations. Dictionaries of functions should have keys feature_name and values functions that take arguments (t, m, e); in the case of a dask graph, these arrays should be referenced as ‘t’, ‘m’, ‘e’, respectively, and any values with keys present in features_to_use will be computed.
schedulerfunction, optional: dask scheduler function used to perform feature extraction computation. Defaults to dask.threaded.get.
raise_exceptionsbool, optional: If True, exceptions during feature computation are raised immediately; if False, exceptions are supressed and np.nan is returned for the given feature and any dependent features. Defaults to True.

Returns:

pd.DataFrame: DataFrame with columns containing feature values, indexed by name.

featurize_ts_files

cesium.featurize.featurize_ts_files(ts_paths, features_to_use, custom_script_path=None, custom_functions=None, scheduler=<function get>, raise_exceptions=True)

Feature generation function for on-disk time series (.npz) files.

By default, computes features concurrently using the dask.threaded.get scheduler. Other possible options include dask.local.get for synchronous computation (e.g., when debugging), or dask.distributed.Executor.get for distributed computation.

In the case of multichannel measurements, each channel will be featurized separately, and the index of the output featureset will contain a channel coordinate.

Parameters:

ts_pathslist of str: List of paths to time series data, stored in numpy .npz format. See time_series.load for details.
features_to_uselist of str, optional: List of feature names to be generated. Defaults to an empty list, which will result in only meta_features features being stored.
custom_script_pathstr, optional: Path to Python script containing function definitions for the generation of any custom features. Defaults to None.
custom_functionsdict, optional: Dictionary of custom feature functions to be evaluated for the given time series, or a dictionary representing a dask graph of function evaluations. Dictionaries of functions should have keys feature_name and values functions that take arguments (t, m, e); in the case of a dask graph, these arrays should be referenced as ‘t’, ‘m’, ‘e’, respectively, and any values with keys present in features_to_use will be computed.
schedulerfunction, optional: dask scheduler function used to perform feature extraction computation. Defaults to dask.threaded.get.
raise_exceptionsbool, optional: If True, exceptions during feature computation are raised immediately; if False, exceptions are supressed and np.nan is returned for the given feature and any dependent features. Defaults to True.

Returns:

pd.DataFrame: DataFrame with columns containing feature values, indexed by name.

generate_dask_graph

cesium.featurize.generate_dask_graph(t, m, e)

impute_featureset

cesium.featurize.impute_featureset(fset, strategy='constant', value=None, max_value=1e+20, inplace=False)

Replace NaN/Inf values with imputed values as defined by strategy. Output should satisfy sklearn.validation.assert_all_finite so that training a model will not produce an error.

Parameters:

strategystr, optional

The imputation strategy. Defaults to ‘constant’.

‘constant’: replace all missing with value
‘mean’: replace all missing with mean along axis
‘median’: replace all missing with median along axis
‘most_frequent’: replace all missing with mode along axis

valuefloat or None, optional

Replacement value to use for strategy=’constant’. Defaults to None, in which case a very large negative value is used (a good choice for e.g. random forests).

max_valuefloat, optional

Maximum (absolute) value above which values are treated as infinite. Used to prevent overflow when fitting sklearn models.

inplacebool, optional

If True, fill in place. If False, return a copy.

Returns:

pd.DataFrame: Feature data frame wth no missing/infinite values.

load_featureset

cesium.featurize.load_featureset(path)

Load feature DataFrame from .npz file.

Feature information is returned as a single DataFrame, while any other arrays that were saved (class labels/predictions, etc.) are returned in a single dictionary.

Parameters:

pathstr: Path where feature data is stored.

Returns:

pd.DataFrame: Feature data frame to be saved.
dict: Additional variables passed to save_featureset, including labels, etc.

save_featureset

cesium.featurize.save_featureset(fset, path, **kwargs)

Save feature DataFrame in .npz format.

Can optionally store class labels/targets and other metadata. All other keyword arguments will be passed on to np.savez; data frames are saved as record arrays and converted back into data frames by load_featureset.

Parameters:

fsetpd.DataFrame: Feature data frame to be saved.
pathstr: Path to store feature data.
kwargsdict of array or data frame: Additional keyword arguments, e.g.: labels -> class labels preds -> predicted class labels pred_probs -> (n_sample, n_class) data frame of class probabilities

`TimeSeries`

class cesium.featurize.TimeSeries(t=None, m=None, e=None, label=None, meta_features={}, name=None, path=None, channel_names=None)

Bases: object

Class representing a single time series of measurements and metadata.

A TimeSeries object encapsulates a single set of time-domain measurements, along with any metadata describing the observation. Typically the observations will consist of times, measurements, and (optionally) measurement errors. The measurements can be scalar- or vector-valued (i.e., “multichannel”); for multichannel measurements, the times and errors can also be vector-valued, or they can be shared across all channels of measurement.

Attributes:

time(n,) or (p, n) array or list of (n,) arrays: Array(s) of times corresponding to measurement values. If measurement is two-dimensional, this can be one-dimensional (same times for each channel) or two-dimensional (different times for each channel). If time is one-dimensional then it will be broadcast to match measurement.shape.
measurement(n,) or (p, n) array or list of (n,) arrays: Array(s) of measurement values; can be two-dimensional for multichannel data. In the case of multichannel data with different numbers of measurements for each channel, measurement will be a list of arrays instead of a single two-dimensional array.
error(n,) or (p, n) array or list of (n,) arrays: Array(s) of measurement errors for each value. If measurement is two-dimensional, this can be one-dimensional (same times for each channel) or two-dimensional (different times for each channel). If error is one-dimensional then it will be broadcast match measurement.shape.
labelstr, float, or None: Class label or regression target for the given time series (if applicable).
meta_featuresdict: Dictionary of feature names/values specified independently of the featurization process in featurize.
namestr or None: Identifying name for the given time series (if applicable). Typically the name of the raw data file from which the time series was created.
pathstr or None: Path to the file where the time series is stored on disk (if applicable).
channel_nameslist of str: List of names of channels of measurement; by default these are simply channel_{i}, but can be arbitrary depending on the nature of the different measurement channels.

Methods

`channels`()	Iterates over measurement channels (whether one or multiple).
`save`([path])	Store TimeSeries object as a single .npz file.
`sort`()	Sort times, measurements, and errors by time.

__init__(t=None, m=None, e=None, label=None, meta_features={}, name=None, path=None, channel_names=None)

Create a TimeSeries object from measurement values/metadata.

See TimeSeries documentation for parameter values.

channels(): Iterates over measurement channels (whether one or multiple).

save(path=None)

Store TimeSeries object as a single .npz file.

Attributes are stored in the following arrays:

time
measurement
error
meta_feat_names
meta_feat_values
name
label

If path is omitted then the path attribute from the TimeSeries object is used.

sort(): Sort times, measurements, and errors by time.

Module: featurize

assemble_featureset

featurize_single_ts

featurize_time_series

featurize_ts_files

generate_dask_graph

impute_featureset

load_featureset

save_featureset

TimeSeries

Module: `featurize`

`TimeSeries`