Data Aggregation Across Data Sources

We have 3 different sources of data:

  1. Our sensor data: that has the Indoor Air Quality and Indoor Environmental Data.

  2. SINAICA: Outdoor Air Quality Monitoring Data from the Government.

  3. OpenWeatherData: Outdoor Environmental Data.

We need it to be available that data to the models we plan to train. In the following sections this process is detailed.

Indoor Data

Outdoor Air Quality Data

Outdoor Weather Data

Merging the 3 Datasets: Indoor Data, Outdoor Air Quality Data, Outdoor Weather Data.

Merging Air Quality and Weather Data

Merging Indoor and Outdoor (Air Quality and Weather) Data

Imputations

We found that the resulting dataframe after merging 2 datasets (Outdoor Data that is sampled every 1 hour and Indoor Data that is sampled every 3 seconds) contains repeated records on the columns of hourly data: SINAICA Gov't Air Quality Monitoring and OpenWeatherData.

We think that the repeated data can be an issue, as the data moves very abruptly from a record call it at 10:57 and 11:00. This is relevant as the real world is not represented by the data correctly. Temperature, pressure and general natural features move slowly from one value to other. But we don't have that data, and it's not easily obtainable.

Therefore, we propose an approach similar to the imputations using the interpolation incorporating noise, that could avert the overfitting issue on our machine learning and deep learning training.

Here we can see the first and last data points to create the interpolation for the first and last values:

Resampling

To reduce training time we propose to have a resampling of the data.

In the following subsections we create those resampled-data dataframes.

1 Minute Resampling

2 Minute Resampling

5 Minute Resampling

10 Minute Resampling

15 Minute Resampling

30 Minute Resampling

References