Time series data is omnipresent in our lives. It is everywhere, from weather forecasts and stock prices to sensors and monitoring systems in industrial automation environments. What is time series data? What are the best databases, time series data visualization tools and techniques to use?
Time series data is omnipresent in our lives. We can encounter it in pretty much any domain: sensors, monitoring, weather forecasts, stock prices, exchange rates, application performance, and a multicity of other metrics that we rely on in our professional and daily lives.
In this article, we will summarize all theoretical information about time series data. This basic knowledge is required for you to use time series data sets as a way to gain business insights or conduct a study. We will cover the main concepts related to gathering, organizing and usage of time series data. You will learn about the types and formats of time series data, ways to store and collect it. As well as diving into the fundamentals of analysis and visualization techniques explained on multiple time series data examples.
Let’s begin by clearly defining what time series data actually is, as well as what it isn’t.
Basically, time series data is any type of information presented as an ordered sequence. It can be defined as a collection of observations for a single subject assembled over different, generally equally spaced, time intervals. Or, to put it simply, time series is data (observations or behavior) collected at different points in time and organized chronologically.
Time is the central attribute that distinguishes time series from other types of data. The time intervals applied to assemble the collected data in a chronological order are called the time series frequency. So having time as one of the main axes would be the main indicator that a given dataset is time-series.
Now, let’s clarify what kinds of data are NOT time series to get this out of the way. Even though it is very common to use time as an axis in datasets, not all collected data is time-series. Data can be divided into several types based on time or, in other words, how and when it was recorded.
Here are three basic types of data, divided by the role of time in a dataset presentation:
Since time series are used for multiple purposes and in various fields, the datasets can vary. Here are the three main characteristics that would allow you to easily identify such datasets:
Being able to analyze and access insightful data is quickly becoming a necessity for almost any organization, including businesses, public and educational institutions. Time series data is a valuable commodity, and its value in today’s world appreciates mightily.
It wouldn’t be an overstatement to say that data—and our ability to collect and organize data with previously unseen speed and efficiency—is a key component of technological development. Being able to utilize gathered data allows us to build better systems, improve the efficiency of virtually all processes and predict future events in real time.
In today’s world, time series data models are truly ubiquitous. Time-axed datasets are widely used across various industries, from finance and economics to sciences, healthcare, weather forecasting and engineering.
As data rapidly becomes one of the most valuable commodities in today’s business world, the importance of time series datasets is increasing as well. Organizations across the globe face the need to implement more IT systems, sensors and tools, all of which are able to produce and/or collect time series data.
This is why it is so important to understand what time series data is, and how using it in the right way can allow an organization to improve and optimize virtually all work processes.
Since time is fundamental and the most integral part of our life, it plays a role in gathering any observable information. In other words, it won’t be wrong to say that you can find it everywhere.
Being so omnipresent, time series obviously has a myriad of different applications across industries. In order for you to understand how it is used in practice, here are several examples.
Time series data can be divided into a number of types based on different criteria. Knowing the differences between the types of data is important as it affects the way you can interact with time series data and how the database can store and compress it. Let’s go through some of the most common ways to categorize, illustrations and examples of time series data.
One characteristic of time series data that is important to understand is based on the relationship to the regularity of measurements collected in the series. Depending on whether the data is collected at regular or irregular time intervals, it can be classified as metrics or events.
Metrics is a type of measurement gathered at regular time intervals.
Events is a type of measurement gathered at irregular time intervals.
Regular data that is evenly distributed across time and thus can be modeled or used for processes such as forecasting is called metrics. You need to have metrics in order to use modeling, forecasting and producing aggregates of all kinds.
Here are several examples of metrics.
Heart rate monitoring (ECG) and brain monitoring (EEG)
The above-mentioned heart rate monitoring, also known as electrocardiogram (ECG) and brain monitoring (electroencephalogram, EEG), as well as other similar health-related monitoring methods, are examples of metrics: they are measuring the activity of human organs at regular time periods.
All kinds of weather conditions monitoring, such as daily temperature, wind, air pressure, heights of ocean tides, etc., is another example of metrics that are used to build data models, which allow us to predict weather changes that will occur in the future.
Stock price changes
Measurements of stock price fluctuations are also gathered at regular time intervals.
Events measurements are gathered at irregular time intervals. Because they are unpredictable, intervals between events are inconsistent, which means that this data can’t be used for forecasting and modeling as this would lead to unreliable future projections.
In simpler words, events are a type of data that is focused on capturing the required data whenever it is generated, which could happen with varying time intervals or as a burst.
Here are several examples of events.
Bank account deposits/withdrawals
Financial monitoring of bank accounts and ATMs is one example of events time series when the data gathered is the deposits and/or withdrawals, which typically occur at unpredictable time periods.
In computing systems, log files serve as a tool that records either various kinds of events that occur in an operating system and other software, or messages between different users of a communication software. Since all of this happens at irregular time intervals, the information in logs can be classified as events.
Another way to classify it, is by the functional forms of data modeling, which divides it into linear and nonlinear.
In linear time series, each data point can be viewed as a linear combination of past or future values. In other words, a linear time series is generated by a linear equation and can be modeled as a linear AR (auto-regressive) model.
One way to specify if the time series is linear is by looking at how X and Y change in accordance to each other. When X increases by 1 and Y increases by a constant rate accordingly, then the data is linear.
Here’s an example of a linear graph.
Nonlinear time series, on the other hand, are generated by nonlinear dynamic equations and have features that can’t be modelled by linear processes, such as breaks, thresholds, asymmetric cycles and changing variances of different kinds. This is why generating and processing nonlinear time series data is typically a lot more difficult and requires complex modeling processes.
Here’s one example of a nonlinear data:
Moving further, let’s take a look at several key aspects of time series data that are important for you to understand in order to be able to work with and use it.
One approach that is fundamental to time series data, and applied across all computing systems in general, is immutability. Which is quite a simple concept to understand.
Immutability is the idea that data (objects in functional programming languages) should not be changed after a new data record was created. Instead, all new records should only be added to the existing data.
Information updates in time series data models are always recorded as a new entry, coming in a time order. In line with the immutability principle, entries in time series data are usually not being changed after they were recorded and are simply added to all the previous records in a database.
So now it should be fairly easy to understand why immutability is a core concept to time series data. Time series database is always (or ‘typically,’ to be more precise) treated as a time-ordered series of immutable objects. Having immutability incorporated allows to keep time series databases consistent and unchanging as all queries are performed at a particular point-in-time.
Immutability as a core principle is something that differentiates time series data sets from relational data. As opposed to time series, the relational data, which is stored in relational databases, is typically mutable. It is used for online transaction processing and other requirements where new events take place on a generally random basis. Although these events still occur in chronological order (events of any kind just can’t exist outside of time), time is not relevant to such data and therefore is not used as an axis. For time series data, on the other hand, time changes are always essential.
Now let’s talk in more detail about the three basic types of data that we mentioned previously, and how time series data is different from cross-sectional, pooled and panel data. Understanding these concepts is another key element to being able to utilize time series data with multiple benefits.
As we learnt, time series data is a collection of observations for a single subject assembled over different time intervals. Having time as an axis is a distinctive feature of time series data.
Now let’s compare it to cross-sectional and pooled/panel data.
Cross-sectional data serves as an opposite concept, as this type relies on collecting and organizing various kinds of data at a single point in time. Cross-sectional data doesn’t rely on natural ordering of the observations so the data can be entered in any order.
Data on max temperature, wind or humidity at any day in a year can be a good example of cross-sectional data. As well as closing prices of stocks on a stock market, or sales of a store inventory at any given day.
Pooled (or cross-sectional) and panel (also called longitudinal) data are two rather close concepts, both used to describe data where time series and cross-sectional measurements are combined.
Essentially, both panel and pooled data rely on combining cross-sectional and time series measurements. The only difference is in the relationship to the units (entities), around which the data is collected (e.g. companies, cities, industries, etc.).
Panel data refers to multi-dimensional data that uses samples of the same cross-sectional units observed at multiple points in time.
Pooled data uses random samples of cross-functional units in different time periods, so each sample of cross-sectional data taken can be populated by different units.
Undoubtedly, these data type explanations can still leave some people confused. Here’s an easy way to differentiate these data types from each other.
Time series data is typically stored in time series databases (TSDBs) that are specifically built or optimized for working with timestamp data, be it metrics or events. Since time series data is frequently monitored and collected in huge volumes, it needs a database that can handle massive amounts of data.
TSDBs are also required to be able to support functions that are crucial to the efficient utilization of time series records, such as data summarization, data lifecycle management, and time-related queries.
Here is a list of features and capabilities that a modern purpose-built or optimized time series database should have:
Apache Druid is an open-source column-oriented time series database designed for high-speed analytics applications that require fast ingestion of large amounts of data and the ability to run low-latency operational queries on that data with high concurrency.
The project to develop the Druid database was originally started in 2011 by four developers—Eric Tschetter, Fangjin Yang, Gian Merlino and Vadim Ogievetsky—who needed to create a database for the Metamarkets project. In 2012 the project was transferred to GPL open-source license, and moved to an Apache License in 2015. The database name, Druid, was inspired by the Druid class of characters in many role-playing games, reflecting the fact that the shape of this database can quickly change to address different types of data problems.
Nowadays Apache Druid is commonly used in high-load business intelligence applications to stream data from message buses such as Kafka or Amazon Kinesis, and to batch load files from data lakes such as HDFS or Amazon S3.
Apache Pinot is an open-source time series database designed to answer OLAP queries with low latency on immutable and mutable data.
The project to develop Apache Pinot was originally started by an internal team at LinkedIn with the goal to create a database that would meet the social network’s requirements for rich real-time data analytics. It was open-sourced in 2015 under the Apache 2.0 license and donated to the Apache Software Foundation, which manages it now, in 2019.
Apache Pinot is a column-oriented database that supports compression schemes such as Run Length or Fixed Bit Length and pluggable indexing technologies like Sorted Index, Bitmap Index, or Inverted Index. It is able to ingest time series data from both batch (Hadoop HDFS, Amazon S3, Azure ADLS) and stream (Apache Kafka) data sources.
Amazon Timestream, part of Amazon Web Services (AWS), is a purpose-built time series database designed for collecting, storing, and processing time-series data in IoT and operational applications.
Fast time series data processing for real-time analytics is one distinctive feature of Amazon Timestream. The service was designed to provide 1,000 times faster query performance compared to relational databases, and includes features such as scheduled queries, multi-measure records, and data storage tiering. Automating scaling is another benefit of Amazon Timestream deriving from the cloud nature of this service. The solution scales automatically in accordance with the needs of user applications.
Amazon Timestream has SQL support with built-in time series functions for smoothing, approximation, and interpolation. It also supports advanced aggregates, window functions, and complex data types such as arrays and rows.The scheduled queries feature offers a serverless solution for calculating and storing aggregates, rollups, and other real-time analytics. Timestream queries are expressed in a SQL grammar with extensions for time series-specific data types and functions. Queries are then processed by an adaptive and distributed query engine that uses metadata from the tile tracking and indexing service to seamlessly access and combine data across data stores at the time the query is issued. Queries are run by a dedicated fleet of workers, the number of which is determined by query complexity and data size. Performance for complex queries over large data sets is achieved through massive parallelism, both on the query execution fleet and the storage fleets of the system.
Graphite is an open-source database and monitoring solution designed for fast real-time processing, storage, retrieval and visualization of time series data. Originally developed by Orbitz Worldwide company in 2006, Graphite was released under the open source Apache 2.0 license in 2008.
Graphite is most commonly used in production and by e-commerce companies to track the performance of servers, applications, websites and software solutions. The tool relies on Whisper, a simple database library component, to store time series data. Graphite isn’t a data collection agent, providing multiple easy integrations with third-party data tools instead. It uses Carbon as its primary backend daemon, which listens for any time-series data sent to the system and takes care of its processing. An additional Graphite webapp manages data visualizations, rendering graphs on-demand using Cairo library.
OpenTSDB is a distributed open-source time series database originally based on the Apache HBase open-source database. It is designed to address the need to collect, store and visualize time series data at a large scale. OpenTSDB is freely available under the GNU Lesser General Public License (LGPL) and GNU GPL licenses.
OpenTSDB was designed for maximum scalability, supporting the collection of time series data from tens of thousands of sources within an industrial automation network at a high rate. OpenTSDB databases consist of Time Series Daemons (TSD) and a set of command line utilities. TSDs run in each container and are responsible for redirecting newly collected data into the storage tier immediately upon its receival. Each TSD in an OpenTSDB database is independent from others and uses either HBase database or hosted Google Bigtable service to store and access time-series data. The communication with Time Series Daemons can be executed via a built-in GUI, an HTTP API, or using any telnet-style communication protocol.
InfluxDB is an open-source time series database developed by InfluxData company and designed to handle large volumes of time-stamped data from multiple sources, including sensors, applications and infrastructure. InfluxDB uses Flux—fourth-generation programming language designed for data scripting, ETL, monitoring and alerting—as its query language.
Written in the Go programming language, InfluxDB was designed for high-performance time series data storage. InfluxDB can handle millions of data points per second and allows for high throughput ingest, compression and real-time querying. InfluxDB automatically compacts, compresses, and downsamples data to minimize storage costs, keeping high-precision raw data for a limited time and storing the lower-precision, summarized data for much longer. The InfluxDB platform comprises a multi-tenanted time series database, background processing, a monitoring agent, UI and dashboarding tools for data visualization.
InfluxDB’s user interface includes a Data Explorer, dashboarding tools, and a script editor. Data Explorer allows users to quickly browse through the metric and collected event data, applying common transformations. The script editor is designed to help users quickly learn Flux query language with easily accessible examples, auto-completion and real-time syntax checking.
IBM Informix is a family of relational database management system (RDBMS) solutions originally developed by Informix Corporation and designed for high-frequency OLTP (Online Transaction Processing) applications in industries such as manufacturing, retail, finance, energy and utilities, etc. In 2001, Informix database was acquired by IBM and is now a part of IBM's Information Management division. In 2017, IBM outsourced the maintenance and support of Informix to India-based HCL Technologies.
There are multiple editions of the Informix database, including free versions for developers, small and medium-sized businesses, etc. It is optimized for networks with low database administration capabilities, incorporating multiple self-management and automated administrative features.
IBM Informix supports the integration of time series, SQL, NoSQL, BSON, JSON, and spatial data. The database includes Informix TimeSeries feature designed for fast and easy work with time series data generated by various smart devices in IoT networks.
MongoDB is a non-relational document-oriented database designed for the storage of JSON-like documents and other types of unstructured data. The project to develop MongoDB was started by the 10gen company (which later changed its name to MongoDB Inc) in 2007. The database is currently licensed under the Server Side Public License (SSPL), a non-free source-available license developed by the project.
MongoDB supports field, range, and regular-expression queries, being able to return entire documents, specific fields of documents, or random samples in the query results. It uses sharding, a horizontal partition of data in a database, for load balancing and scaling. The data stored in a MongoDB database is split into ranges based on the shard key selected by the user, and gets distributed across different shards.
Replica sets are utilized by MongoDB to provide high availability of data stored in the database. Each replica set consists of two or more copies of the data, with every set member able to act as primary or secondary replica at any moment in time.
TimescaleDB is an open-source relational database for time-series data, built on top of PostgreSQL. It uses full SQL and is easy to use like a classic relational database, but is able to provide scaling and other features that are typically reserved for purpose-built NoSQL databases.
TimescaleDB was created as an all-in-one database solution for data-driven applications, providing users with a purpose-built time-series database combined with a classic relational (PostgreSQL) database with full SQL support. Accelerated performance is one of the main distinctive features of TimescaleDB. According to the developers, this database is able to run queries 10 to 100 times faster compared to PostgreSQL, InfluxDB, and MongoDB. They also claim TimescaleDB supports up to 10 times faster inserts and is able to ingest more metrics per second per server for high-cardinality workloads.
The ability to handle massive amounts of data is another key feature of TimescaleDB as it was designed to effectively record and store IoT data. TimescaleDB allows users to store hundreds of billions of rows and dozens of terabytes of data per server. The database uses data type‑specific compression, which allows it to increase storage capacity up to 16 times. Another key advantage of TimescaleDB is the support of both relational and time-series databases, which allows users to simplify their technology stacks and store relational data alongside time‑series data.
Prometheus is an open-source systems monitoring and alerting application originally created by SoundCloud, an online audio distribution platform and music sharing website. Prometheus collects and stores real-time metrics in a time series database. Time series collection is enabled by using a HTTP pull model. SoundCloud started developing Prometheus in 2012 as an open source project, upon realizing that existing monitoring tools and metrics are not fully applicable to their needs.
Prometheus databases store time series data in memory and on local disk in an efficient custom format. Scaling is achieved by functional sharding and federation. Prometheus's local storage is limited to a single node's scalability and durability. Instead of trying to solve clustered storage in Prometheus itself, Prometheus offers a set of interfaces that allow integrating with remote storage systems.
Prometheus provides a functional query language called PromQL (Prometheus Query Language) that lets the user select and aggregate time series data in real time. The results of their operations with data can be visualized as graphs or viewed as tabular data in Prometheus's expression browser. It can also be exported to external systems via the HTTP API.
Data or process historian (also sometimes being called operational historian or simply “historian”) is a set of time-series database applications designed and typically used for collecting and storing data related to industrial operations. Read more about data historians.
Data historians were originally developed in the second half of the 1980s to be used with industrial automation systems such as SCADA (supervisory control and data acquisition). Primarily, they were utilized for the needs of the process manufacturing sector in industries such as oil and gas, chemicals, pharmaceuticals, pipelines and refining, etc.
Today, however, process historians are widely used across industries, serving as an important tool for performance monitoring, supervisory control, analytics, and quality assurance. They allow industrial facility managers and stakeholders, as well as engineers, data scientists and various machinery operators, to access the data collected from a variety of automated systems and sensors. The collected data can be utilized for performance monitoring, process tracking or business analytics. Modern-day historians also often include other features related to the utilization of collected data, such as reporting capabilities that allow users to generate automated or manual reports.
Now, after we learnt what time series data is and how it is stored, the next logical step would be to talk about the actual utilization of time series data models. This is where time series analysis comes into play.
Time series analysis refers to a specific way of analyzing a time series dataset—or simply a sequence of data points gathered over a period of time—to extract insights, meaningful statistics and other characteristics of this data.
Naturally, time series analysis requires time series data, which has a natural temporal ordering and has to be recorded at consistent time intervals. This is something that differentiates time series analysis from cross-sectional data studies where data is tied to one specific point in time and can be entered in any order.
You should also not confuse time series analysis with forecasting, which is a type of time series analysis, as it basically uses historical TS datasets to make predictions using the same approach.
Time series data analysis relies on identifying and observing patterns that serve as a way to extract actual information from a dataset using one of the available models.
Here are some of the most common patterns observed in time series data.
Detecting these and other patterns by applying various models to time series databases is how you can use time series analysis to achieve different goals.
Here are some of the most common types of time series analysis that can be utilized depending on the end-goal of your study.
As we learnt earlier, forecasting uses historical time series data to make predictions. Historical data is used as a model for the same data in the future, predicting various kinds of scenarios with future plot points.
Descriptive analysis is the main method used to identify the above-described patterns in time series data (trends, cycles, and seasonality).
Explanative analysis models attempt to understand data, relationships between data points, what caused them and what was the effect.
Interrupted (intervention) analysis
Interrupted (intervention) time series analysis, also known as quasi-experimental analysis, is used to detect changes in a long-term time series from before to after a specific interruption (any external influence or a set of influences) took place, potentially affecting the underlying variable. Or, in simpler words, it studies how a time series data can be changed by a curtain event.
Regression analysis is most commonly used as a way to test relationships between one or more different time series. As opposed to regression analysis, regular time series analysis refers to testing relationships between different points in a single time series.
Exploratory analysis highlights main features identified in time series data, typically in visual format.
Association analysis is used to identify associations between any two features in a time series dataset.
Classification is used to identify and assign properties to time series data.
Segmentation analysis is applied to split the time series data into segments based on assigned properties.
Curve fitting is used to study the relationships of different variables within a dataset by organizing them along a curve.
Now let’s talk a little bit about models that are commonly used for time series data. There is a huge variety of models, representing different forms and stochastic processes. Not to get into too much detail, in this article we will only describe a few of the most popular and common models.
When looking at the process level, three broad classes of linear time series models should be specified.
Here they are:
In order to perform time series analysis, you would normally require a fairly large number of data points as a way to ensure reliability and consistency.
One problem that can limit your ability to utilize time series data for analytics, forecasting and other purposes is the absence of high-quality extensive datasets. This issue is easily solvable, however, as you can find a fair amount of time series data sets that include enormous volumes of data, publicly available online.
Here are several great sources where you can find large numbers of highly variable datasets.
Despite having all these wonderful sources of time series data around the world, depending on your specific needs, you may still face a challenge of not having enough data. In such a case, the only solution would be to create your own time series dataset.
Now, creating and using time series datasets is a very broad topic that would require an extensive guide or tutorial on its own to cover all the aspects of this work. We can say, however, that collecting time series data, generating datasets and utilizing them for various purposes gets easier day by day thanks to a number of great libraries and frameworks that are designed to make it easier to work with time series data.
Let’s take a look at several great Python libraries and packages that you can use to create time series datasets, as well as using them afterwards to build models, generate predictions, etc.
Another crucial element of utilizing time series data is visualization. In order to extract valuable information and insights, your data has to be presented as temporal visualization to showcase the changes at different points in time.
Time series data visualization is typically performed with specialized tools that provide users with multiple visualization types and formats to choose from. Let’s take a look at some of the most common data visualization options.
Graph is a visual representation of data in an organized manner. Time series graphs, also called time series charts or time series plots, are probably the most common data visualization instrument used to illustrate data points at a temporal scale where each point corresponds to both time and the unit of measurement.
Even though these two terms are frequently used interchangeably, they are not exactly the same thing. Charts is the term for all kinds of representations of time series datasets in order to make the information clear and more understandable. Graphs, being mathematical diagrams showing the relationship between the data units over a period of time, are commonly used as a type of charts when visualizing time series data.
Real time graphs, also known as data streaming charts, are used to display time series data in real time. This means that a real time graph will automatically update after every several seconds or when the new data point is received from the server.
Here are some of the most common types of time series visualizations.
Line graph is probably the most simple way to visualize time series data. It uses points connected to illustrate the changes. Being the independent variable, time in line graphs is always presented as the horizontal axis.
Histogram charts visualize time series data by grouping it into bins, where bins are displayed as segmented columns. All bins in a histogram have equal width while their height is proportional to the number of data points in the bin.
Dot plots or dot graphs present data points vertically with dot-like markers, with the height of each marker group representing the frequency of the elements in each interval.
Scatter plots or charts use the same dot-like markers that are scattered across the chart area of the plot, representing each data point. Scatter plots are usually used as a way to visualize random variables in time series data.
Trend line is based on standard line and plot graphs, adding a straight line that has to connect at least two points on a chart, extending forward into the future to identify areas of support and resistance.
Time series data can be translated into charts, graphs, and other kinds of consumable analytical information that provides organizations with (often invaluable) insights. Time series visualization tools, which can come in the form of software or SaaS solutions, are required to analyze and present the data in a digestible form.
Time series data visualization tools allow organizations to create dashboards with easy to understand visualizations of key trends and KPIs. It is also increasingly common for modern data visualization solutions to have a simple drag-and-drop interface, allowing users with no technical and/or coding skills to work with time series data and create dashboards. These tools are typically designed specifically to visualize already organized time series data and are not intended to be used for time series data analysis. Some of them, however, may include additional features, allowing users to perform multiple types of activities, such as exploring, organizing and interacting with data, as well as collaborating and sharing data between members of the same organization.
When choosing a data visualization tool for use, check if it meets all the main general requirements for such a solution:
Here’s a list of the best tools for your time series data visualization based on the requirements listed above.
Hopefully, this article helped you to gain a better understanding of the fundamentals of time series data. Today, being aware about the true value of data is crucially important. You have probably heard the phrase “data is the new gold.” In the current era of digital transformation and Industry 4.0/5.0 solutions, the ability to collect, store and use time series data is one of the decisive factors and keys to business success.
Clarify is a solution that provides industrial teams with a next-gen level of time series data intelligence, helping to make data points from data historians, SCADA and IoT devices useful for the whole workforce, from field workers to data scientists.
Clarify makes it easy to visualize time series data from your IoT network, access it on web and mobile devices, share data amongst teammates and collaborate on it in real time. With Clarify, your organization can focus on key business goals instead of configuring dashboards or developing expensive custom solutions.