How to annotate and label your time series data?

Bora Kizil
8 min readMar 2, 2023

(The original article was written in Nov 2022 by Julien Muller, CTO of Ezako and my co-founder. I have made in Feb 2023 several contributions to complete the article.)

What is time series data?

At Ezako we are experts in time series data. Time series data is a specific type of data that differs greatly from other types of data in the way they are produced and used. Their properties are also different from other types of data. They are however a part of our daily life. For example, we can think of exploiting the data from a thermometer, an ECG or the voltmeter of an electronic component. The data can be tracked for monitoring (for servers), prediction (in finance) or predictive maintenance (in the industry). The application cases are extremely large, and new use cases are constantly appearing with the invention of new objects (such as IOTs) and the digitalization of production chains.

Time series have technical specificities, which differentiate them from other types of data, we can see in particular:

  1. A sequence of values that change over time
  2. Often a large number of points (generated by machines), both in frequency and number of series
  3. Often a great imbalance of labels and/or classes. For example, the number of anomalous points is very low compared to normal data

Labeling or annotation for time series

Labeling or annotation of time series data is the action of enriching one or more time series or adding metadata to the series. It can be carried out point by point, for a period, or for an entire series.

The type of label and the techniques used will directly depend on the objective sought. Indeed, we label for a specific and precise purpose. Labeling for predictive maintenance is not the same as classifying different states of a system. In the first case, the label might be a point-by-point remaining lifetime, and in the second, a simple class like “ON”, “OFF”, “SWITCHING”…

Why label data?

To provide more information on the time series that are available to us. The objective can be classification, formalising the understanding of behaviour, and learning from anomalous events to detect or explain them.

In Machine Learning, the interest is almost obvious. Indeed, for the data scientist, addressing a supervised problem is much easier than an unsupervised one. With an equivalent dataset, the results of supervised approaches are incredibly better than unsupervised ones.

For what?

Firstly because the 2 approaches are not on equal terms. Supervised approaches have much more information than unsupervised ones. Second, unsupervised approaches offer qualities that are believed to outperform supervised approaches. For example, an unsupervised anomaly detection algorithm should be able to detect unknown anomalies. But hybrid approaches can also achieve this goal. We can consider different types of hybrid approaches between the unsupervised and the supervised:

  1. Unsupervised model
  2. Unsupervised learning, and evaluation of detection quality based on labeled data (supervised evaluation)… other supervised actions
  3. Semi-supervised: Approaches such as unsupervised algorithms that make assumptions about the data: for example an auto-encoder that considers a whole learning set to be normal. Or more generally, algorithms that use a restricted set of annotations.
  4. Self-supervised, where the model performs a 2-step learning process, it builds its own pseudo labels and learns on its own labels. A simple example is AI playing chess against itself.
  5. Totally supervised approach

In general, in machine learning, having a labeled dataset is advantageous because it allows the evaluation of several supervised approaches clearly with metrics such as precision, recall, f-score, etc. However, a pitfall common to the latter must be addressed: the risk of over-learning: the “overfitting”. This risk is all the more present when there are few labels, but being aware of it makes it possible to put in place mechanisms that mitigate it, such as “k-fold cross-validation”. If despite this, we are faced with a situation of over-learning, we can change our approach and switch, for example, from supervised models to models with unsupervised learning and optimization of hyperparameters.

How to label?

You can think of different approaches to annotating your dataset. Among others:

  1. The manual approach
  2. The field expert approach
  3. Coding-based (python)
  4. With a labeling tool like Upalgo Labeling

Let’s work on a small practical case with a 50,000-line time-series CSV file. This file has a timestamp column and 4 sensors. The vibration sensors show significant peaks that we want to annotate to create a classification model. The objective is to create a new column containing a text of one of the 3 classes: “peak high”, “peak low” or “normal”.

  • The manual approach

For the manual approach, we will use Excel. After importing the CSV file, Excel offers a graph feature, which is very practical to facilitate the work. However, it turns out to be of little use. Of the 50,000 points, we were only able to display 5,000 points simultaneously.

Creating a “label” column is very simple, but the task turns out to be actually very difficult. It is necessary to locate on the graph the timestamp of the peaks to then look for them in the table and annotate the column. Then move on to the next 5,000 points. Once this task has been completed, the table can be exported in a CSV format.

  • Field expert approach

For this simplistic example, we can imagine that the expert told us that the peaks are often large, so we will set a limit and automate it. The implementation language doesn’t matter, python or otherwise, for graphical reasons we used an Excel formula:

If this solution has the advantage of being very compact and quick to implement, it quickly shows its limits. Indeed, the peaks not being made up of a single value, we have low-quality labels. For the following pattern, we expect to have around 15 “low peak” points, but the result is quite erratic.

Moreover, this rule is only applicable to this specific dataset, and must be redefined for a new dataset even with a similar need of finding peaks in the dataset:

Of course, we can go back over each event and validate the data, but that will take a long time. In addition, we make a strong assumption by setting a limit to 2, which will probably lead to ignoring the small events.

  • Coding-based (python)

Here we will try to exploit the capabilities of pandas to make a quick labeling operation with better quality. This will address some of the limitations of the previous approach, in particular the predefinition of an anomaly threshold.

We identify that the peaks are outliers, that is to say, that all the peaks do not have the same amplitude, we, therefore, define an algorithm which seeks to isolate the peaks by determining the mean as well as the normal standard deviation on sliding windows. This will determine a variable threshold according to these parameters:

The choice of the size of the window and the number of deviations from the mean which triggers the separation remains to be configured during the execution of the “rolling_std” algorithm, here the couple (100.5) is not the only good setting, but one among many. The next step is to create the ‘class’ column and assign the calculated annotations to it.

Then we plot the result:

We will then proceed to the transformation of the digital annotations which separated our data into two classes (0,10) equivalent to (normal, peak) in a triplet (normal, peak_high, peak_low).

We see a labeling of rather good quality and more generalizable than the fixed rule-based approach for the detection of peaks in the data, if we trust the algorithm used, we, therefore, transfer the responsibility to the developer and the martingale he has identified.

  • With a labeling tool like Upalgo Labeling

Upalgo Labeling is a tool with a graphical interface to visualize time series and quickly label series with human validation. This allows for more finesse in the annotation while guaranteeing faster work.

Here are the steps for a good annotation with the tool:

Step 1: In the Upalgo Labeling tool it is possible to manually view and label an event:

It is also possible to use the automatic feature of proposing areas to annotate, which will quickly provide several quality labels:

Step 2: Then we ask Upalgo Labeling to propagate the previously defined “high peak” and “low peak” classes, which finalizes the annotation of the entire file:

Step 3: We can export the result in a CSV file containing a new “label” column:

Here, we see that the data scientist has total control over what he annotates, which guarantees better quality labeling.

Conclusion

We can see that labeling with tools, such as Excel, the annotation of time series is possible, but will be quite laborious. A programmatic approach based on field expertise will be quick but shows quite obvious shortcomings. The python approach is already much smarter but requires effort and programming knowledge.

Using a tool like Upalgo Labeling is much more efficient and easy. This allows automatic and fast annotation without having to code dedicated algorithms. However, there is a risk of bias with these tools. To remove this risk it is important to use a big number of datasets, and human confirmation.

--

--

Bora Kizil

Co-founder at Ezako (www.ezako.com), the time-series solutions company. We help our clients with anomaly detection, labeling and forecasting problems.