Introducing cesium

From the reading of electroencephalograms (EEGs) to earthquake seismograms to light curves of astronomical variable stars, gleaning insight from time series data has been central to a broad range of scientific and medical disciplines. When simple analytical thresholds or models suffice, technicians and experts can be easily removed from the process of inspection and discovery by employing custom algorithms. But when dynamical systems are not easily modeled (e.g., through standard regression techniques), classification and anomaly detection have traditionally been reserved for the domain expert: digitally recorded data are visually scanned to ascertain the nature of the time variability and find important (perhaps life-threatening) outliers. Does this person have an irregular heartbeat? What type of supernova occurred in that galaxy? Even in the presence of sensor noise and intrinsic diversity of the samples, well-trained domain specialists show a remarkable ability to make discerning statements about the complex data.

png

In an era when more time series data are being collected than can be visually inspected by domain experts, however, computational frameworks must necessarily act as human surrogates. Capturing the subtleties that domain experts intuit in time series data (and perhaps even besting the experts) is a non-trivial task. In this respect, machine learning (ML) has already been used to great success in several disciplines, including text classification, image retrieval, segmentation of remote sensing data, internet traffic classification, video analysis, and classification of medical data. Even if the results are similar, some obvious advantages over human involvement are that ML algorithms are tunable, repeatable, and deterministic. A computational framework built with elasticity can scale, whereas experts (and even crowdsourcing) cannot.

Machine learning for domain-science time series inference

Despite the importance of time series in scientific research, there are few resources available that allow domain scientists to easily build robust computational inference workflows for their own data, let alone data gathered more broadly in their field.

Traditional statistical models of time series are often too rigid to explain complex time domain behavior, while popular machine learning packages deal almost exclusively with ‘fixed-width’ datasets requiring a uniform number of pre-computed features. cesium allows researchers to apply modern machine learning techniques to time series data in a way that is simple, easily reproducible, and extensible. Being a modern data-driven scientist should not, we believe, require an army of software engineers, machine learning experts, statisticians and production operators.

Simplifying the analysis workflow

Building a functioning machine learning pipeline involves much more than choosing a mathematical model for your data. The goal of cesium is to simplify the analysis pipeline so that scientists can spend less time solving technical computing problems and more time answering scientific questions. cesium comes with a number of out-of-the-box feature engineering workflows, such as periodogram analysis, that transforms raw time series data to pull signal from the noise. By also streamlining the process of fitting models and studying relationships within datasets, cesium allows researchers to iterate rapidly and quickly answer new questions that arise out of previous lines of inquiry. We also aim to make analyses using cesium easily shareable and reproducible, so that an entire process of discovery can be shared with and reproduced by other researchers. Saved cesium workflows are meant to be production-ready, meaning that comprehensive machine learning can be applied not just to data in retrospect but to live, streaming data as well.

Scaling to large datasets

In many fields, the volumes of time series data available can be immense. cesium makes the process of analyzing time series easily parallelizable and scaleable; scaling an analysis from a single system to a large cluster should be easy and accessible to non-technical experts. The analysis workflow of cesium can be used in two forms: a web front end which allows researchers to upload their data, perform analyses, and visualize their models all within the browser; and a Python library which exposes more flexible interfaces to the same analysis tools. The code for both the web frontend and Python library are open source and available on Github. In our next post, we’ll show how the Python library can be used to quickly solve a classic EEG classification problem.

With thanks to the National Science Foundation and the Gordon and Betty Moore Foundation for support.


Written by Cesium Developers in cesium on Fri 08 July 2016.

Comments

comments powered by Disqus