An Overview of Amazon Timestream

An Overview Of Amazon Timestream

By Derrick Cheng

Amazon Timestream is a powerful purpose-driven time series database that is designed for storing time-based data. This blog post will give a brief overview of how Timestream works, as well as how Timestream was utilized to store IoT sensor data for the Health Platform project. In this project, timestamped biometric and environmental sensor data is collected and sent to an IoT endpoint and stored in Timestream for epileptic seizure detection for use with individuals on the autism spectrum who are often at risk of experiencing unpredictable epileptic seizures. Originally, DynamoDB was the primary database for storing data. However, we switched to Timestream because it was easier to use and more cost-effective for time series data processing. As such, I will be highlighting some of the advantages Timestream has over DynamoDB.

Data Format

Timestream is designed to store continuous measurements, such as heart rate readings. Therefore, it is not possible to update, delete, or make any changes to already ingested data. There are only writes and reads when it comes to Timestream. Each data entry has a timestamp, dimensions, measure name, value, and type. Dimensions are metadata values that are in a constant format across all data entries, for example in the Health Platform project it was the sensor id and the patient id. It is possible to ingest either single or multi-measure records according to application needs. For the Health Platform project, since data was being sent to IoT in large batches, data was ingested all at once using multi-measure records. When using multi-measure records, ensure that the schema for all data points is the same. If the application receives data at a less frequent rate with single data points, single-measure records may be a better choice.

The format for Timestream is kind of the opposite of DynamoDB. There is a set schema for Timestream data so that the dimensions, measure name, and timestamp serve as keys for querying the singular measure value. On the other hand, DynamoDB holds a flexible amount of attributes which are all identified under a singular partition key. The data format for Timestream makes a lot more sense for the Health Platform project as this makes searching for sensor data using multiple parameters much easier.

Writing and Querying

When writing records to Timestream, there is a max of 100 records per ingestion request. If there are more than 100 records, the data will have to be split up into chunks of 100 data points. Writing into Timestream is very fast, there isn’t any throttling of data when writing large batches of data and writes were practically instantaneous. For events coming in for the Health Platform project, writes were completed in less than a tenth of a second. Writing into Timestream is also very cheap, 1 million record ingestions will only set you back $0.50 USD. If there happen to be any records that Timestream rejects either due to an invalid format or any other reason, an S3 bucket can be set up for Timestream to send any rejected records to it.

Querying from Timestream is very simple. Since there is a set schema that has been defined, everything can be queried easily using a single SQL statement. There is no limit on how many records can be queried, but adding WHERE clauses that limit the number of records that make it into the query will make it cheaper and faster. Compared to querying from DynamoDB, Timestream is much better when it comes to this. SQL is much easier to write and understand than DynamoDB query operations, so the switch made processing the health data much simpler. The charge for querying is $0.01 USD per GB scanned when querying, and this will most likely take up a good chunk of the costs for a project utilizing Timestream. 

Below is the main SQL statement for querying user data for the Health Platform project.

    SELECT to_iso8601(BIN(time, ${period})) AS binned_timestamp,
                patient_id,
                measurement_type,
                ROUND(${statisticQueryVal}, 2) AS measure_val
            FROM HealthDatabase.HealthMetricsData
            WHERE patient_id IN ('${patientIds.join("', '")}')
            AND time BETWEEN from_iso8601_timestamp('${start}') AND from_iso8601_timestamp('${end}')
            GROUP BY BIN(time, ${period}), patient_id, measurement_type
            ORDER BY patient_id, measurement_type, binned_timestamp ASC

Storage

There are 2 different storages when it comes to Timestream, a memory store and a magnetic store. The memory store stores recent data, while the magnetic store stores historical data. Therefore, the memory store is optimized for fast point-in-time queries and the magnetic store is optimized for fast analytical queries. The data retention policies for each store are customizable, so the user can set how long data should stay in each store. The archival of data between memory and magnetic stores, the deletion of data after reaching the magnetic store limit, and the auto-scaling of stored data are all fully managed by AWS which makes Timestream require very little user maintenance. All data that is written into Timestream is also automatically encrypted, so there’s no need to manually encrypt the data.

Example Typescript CDK for creating a Timestream table with memory and magnetic stores:

       const dataTable = new cdk.CfnResource(this, 'HealthMetricsData', {
            type: 'AWS::Timestream::Table',
            properties: {
                DatabaseName: healthDatabase.ref,
                MagneticStoreWriteProperties: {
                    EnableMagneticStoreWrites : true,
                    MagneticStoreRejectedDataLocation : {
                        S3Configuration : {
                            BucketName : timestreamRejectedDataBucket.bucketName,
                            EncryptionOption : "SSE_S3"
                        }
                    }
                },
                RetentionProperties: {
                    MemoryStoreRetentionPeriodInHours : "2160",
                    MagneticStoreRetentionPeriodInDays : "365"
                },
                TableName: 'HealthMetricsData',
            },
        });

How Timestream was used in the Health Platform project

Image of how Timestream was used in the Health Platform project through an architecture diagram. There are numbered steps in green or blue indicating the IoT Event Flow or the Frontend Website Event Flow respectively.
The architecture diagram for the Health Platform project gives an insight into two different event flows: 1) IoT sensor data and API-based data, and 2) user-initiated events from the frontend dashboard. Green numbers indicate event flow from IoT sensors and APIs while blue numbers indicate event flow from user interaction with the frontend Health Platform website

IoT Event Flow

  1. Sensor data flows into the AWS cloud via Amazon IoT. The iOS application leverages the Amazon Cognito user pool to authenticate the user while the DIY Arduino Gas Sensor authenticates into the solution using private certificate keys.
  2. The Airthings API is called every 5 minutes by a Lambda function. The authentication occurs via a token that is obtained at the Airthings integration APIs.
  3. The Biostrap API is called every 5 minutes by a Lambda function. The authentication occurs via an organization API key from the Biostrap dashboard.
  4. The incoming data’s sensor id is mapped with the corresponding patient ID stored in a DynamoDB table; which is manually populated at the web interface.
  5. The data along with the matching patient ID is written into the Amazon Timestream database.
  6. The data along with the matching patient ID is also converted to parquet format via Kinesis Firehose and the conversion schema is stored in a Glue table. The parquet file is saved in an S3 bucket with the path data/year/month/day/hour/{filename}, to create a data lake with the patient biometric information to later use in the model creation.

UI User Initiated Event Flow

  1. The frontend website enables managing the user roles: admins and caregivers, who are the ones effectively authenticating into the application that is based on the Amazon Cognito user pool. The caregivers register the patients and their device ID into the system, which is stored in Amazon Timestream and S3. The solution backend API is based on GraphQL leveraging AWS Appsync
  2. Whenever the user clicks the search button to update the dashboard, a Lambda function queries Timestream for the data based on the timeframe parameters specified by the user. This displays the most up-to-date data for the user to see.
  3. Newly created users are stored in the DynamoDB users table. Existing users and their information such as the patients they manage are also stored in this table.
  4. Indexed events are searched using OpenSearch which will display the search results on the frontend website for the user.
  5. Events created using the website are indexed by Amazon OpenSearch so that they can be searched by users later. OpenSearch is a search engine that will allow the user to match search keywords with event data. The event data is also converted to parquet format via Kinesis Firehose and the conversion schema is stored in a Glue table. The parquet file is saved in an S3 bucket with the path data/year/month/day/hour/{filename}, to create a data lake with the patient event data to later use in the model creation.
  6. Users can download all of a patient’s sensor readings via a button on the frontend website. A Glue database points to the data in the Health Platform Metrics S3 bucket and Health Platform Events S3 bucket. A GraphQL query then triggers a Lambda function which triggers Athena to query the Glue database. The patient data is exported to the Patient Export Data bucket and returned as a CSV file available for download through a pre-signed S3 URL link that expires after 5 minutes.

View the full Health Platform Project https://github.com/UBC-CIC/health-platform