Los datos del sensor se pueden procesar de manera secuencial. Como una serie de tiempo.
Dado que no tenemos todos los datos, es decir, hay faltantes debido a lo que se describe a continuación, vamos a ver cómo imputar datos de la mejor manera, y así evitar estos "huecos" en nuestras observaciones.
import pandas as pd
from plotnine import *
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display, Markdown
airdata = pd.read_pickle("data/airdata/air.pickle")
airdata
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | datetime | year | month | day | hour | minute | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.54 | 777.41 | 43.93 | 151328 | 37.5 | 1 | 2021-02-12 06:04:09.089621067 | 2021 | 2 | 12 | 6 | 4 |
1 | 21.56 | 777.41 | 43.89 | 152702 | 35.6 | 1 | 2021-02-12 06:04:12.087778807 | 2021 | 2 | 12 | 6 | 4 |
2 | 21.53 | 777.41 | 43.97 | 151328 | 37.5 | 1 | 2021-02-12 06:04:15.072475433 | 2021 | 2 | 12 | 6 | 4 |
3 | 21.51 | 777.41 | 44.03 | 151464 | 38.5 | 1 | 2021-02-12 06:04:18.070170164 | 2021 | 2 | 12 | 6 | 4 |
4 | 21.51 | 777.41 | 44.05 | 152425 | 36.9 | 1 | 2021-02-12 06:04:21.061994791 | 2021 | 2 | 12 | 6 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2068162 | 29.56 | 777.20 | 22.02 | 1544202 | 26.7 | 2 | 2021-04-24 22:16:08.900718689 | 2021 | 4 | 24 | 22 | 16 |
2068163 | 29.57 | 777.22 | 22.10 | 1527541 | 27.5 | 2 | 2021-04-24 22:16:11.896806479 | 2021 | 4 | 24 | 22 | 16 |
2068164 | 29.57 | 777.20 | 22.23 | 1521493 | 28.2 | 2 | 2021-04-24 22:16:14.893116951 | 2021 | 4 | 24 | 22 | 16 |
2068165 | 29.57 | 777.20 | 22.32 | 1511236 | 29.0 | 2 | 2021-04-24 22:16:17.889270782 | 2021 | 4 | 24 | 22 | 16 |
2068166 | 29.58 | 777.20 | 22.38 | 1509540 | 29.6 | 2 | 2021-04-24 22:16:20.885603666 | 2021 | 4 | 24 | 22 | 16 |
2068167 rows × 12 columns
Creamos una diferencia de tiempo, para ver las lecturas faltantes, dado que el lector debe entregar cada 3 segundos una observación.
Si no hay observación, la diferencia entre la última y la primera será mayor a dichos 3 segundos, en consecuencia.
#airdata["minute"] = [dt.minute for dt in airdata.datetime]
#airdata["second"] = [dt.second for dt in airdata.datetime]
airdata["datetime-1"] = airdata["datetime"].shift(1)
airdata["delta"] = airdata["datetime"] - airdata["datetime-1"]
airdata["delta"] = airdata["delta"].dt.seconds
airdata["imputated"] = False
# descartamos las primeras lecturas que tuvieron ciertos
# detalles de faltantes por reinicios inesperados.
# afinamos el software y no son 6 observaciones a
# descartar
airdata = airdata.iloc[6:].reset_index(drop=True)
# eliminamos las que no se pudieron obtener lags
#airdata.drop(airdata.head(1).index,inplace=True)
#airdata.drop(airdata.tail(1).index,inplace=True)
airdata.head(10)
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | datetime | year | month | day | hour | minute | datetime-1 | delta | imputated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.51 | 777.41 | 44.04 | 152149 | 34.7 | 1 | 2021-02-12 06:05:35.846304417 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:29.856916904 | 5.0 | False |
1 | 21.51 | 777.41 | 43.98 | 152841 | 33.6 | 1 | 2021-02-12 06:05:38.837326527 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:35.846304417 | 2.0 | False |
2 | 21.54 | 777.41 | 43.73 | 153259 | 31.5 | 1 | 2021-02-12 06:05:47.812360048 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:38.837326527 | 8.0 | False |
3 | 21.53 | 777.41 | 43.70 | 152841 | 31.5 | 1 | 2021-02-12 06:05:50.803695202 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:47.812360048 | 2.0 | False |
4 | 21.52 | 777.41 | 43.70 | 153399 | 30.2 | 1 | 2021-02-12 06:05:53.795462847 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:50.803695202 | 2.0 | False |
5 | 21.54 | 777.41 | 43.77 | 152702 | 30.9 | 1 | 2021-02-12 06:05:56.786891460 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:53.795462847 | 2.0 | False |
6 | 21.55 | 777.40 | 43.76 | 152980 | 30.7 | 1 | 2021-02-12 06:05:59.778601646 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:56.786891460 | 2.0 | False |
7 | 21.59 | 777.40 | 43.61 | 152841 | 30.8 | 1 | 2021-02-12 06:06:02.770255804 | 2021 | 2 | 12 | 6 | 6 | 2021-02-12 06:05:59.778601646 | 2.0 | False |
8 | 21.59 | 777.41 | 43.56 | 152980 | 30.6 | 1 | 2021-02-12 06:06:05.761730671 | 2021 | 2 | 12 | 6 | 6 | 2021-02-12 06:06:02.770255804 | 2.0 | False |
9 | 21.63 | 777.43 | 43.45 | 153679 | 28.8 | 1 | 2021-02-12 06:06:08.753019810 | 2021 | 2 | 12 | 6 | 6 | 2021-02-12 06:06:05.761730671 | 2.0 | False |
(
ggplot(airdata) +
geom_histogram(aes(x="delta"), bins=20)
)
<ggplot: (8789789878077)>
#range(airdata[airdata["delta"] != 3].min(), airdata[airdata["delta"] != 3].max(), 1)
display(Markdown("Valores máximos y mínimos de lecturas que dejamos de ver:"))
display(Markdown(f"* Mínimo: {airdata['delta'].min()} segundos."))
display(Markdown(f"* Máximo: {airdata['delta'].max()} segundos."))
Valores máximos y mínimos de lecturas que dejamos de ver:
Esta falta de lecturas típicamente se dan por las siguientes razones:
El mínimo de 2 segundos fue porque el sensor tomo una referencia de tiempo mínima, y leyó cada 2 segundos, en vez de cada 3 segundos. Esto fue debido a que el sensor hace un redondeo de segundos, en conjunto con el "job" que guarda los datos. Así mismo nuestro sistema no es de tiempo real, como los sistemas de control donde el disparo de eventos es determinístico en el tiempo.
Valores mayores se debieron a reinicios: reboots del sistema, fallas en el suministro eléctrico. Los cuales son eventos típicamente aislados como se muestran a continuación.
airdata[airdata["delta"] > 3]
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | datetime | year | month | day | hour | minute | datetime-1 | delta | imputated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.51 | 777.41 | 44.04 | 152149 | 34.7 | 1 | 2021-02-12 06:05:35.846304417 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:29.856916904 | 5.0 | False |
2 | 21.54 | 777.41 | 43.73 | 153259 | 31.5 | 1 | 2021-02-12 06:05:47.812360048 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:38.837326527 | 8.0 | False |
11471 | 19.95 | 778.34 | 43.60 | 124814 | 236.8 | 1 | 2021-02-12 15:38:09.454870701 | 2021 | 2 | 12 | 15 | 38 | 2021-02-12 15:37:33.558219671 | 35.0 | False |
11495 | 20.41 | 778.38 | 42.94 | 122095 | 243.2 | 1 | 2021-02-12 15:39:24.238069534 | 2021 | 2 | 12 | 15 | 39 | 2021-02-12 15:39:18.254687548 | 5.0 | False |
194038 | 25.24 | 777.47 | 25.27 | 188409 | 28.7 | 1 | 2021-02-18 23:30:02.871531487 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:23:30.376312494 | 392.0 | False |
711281 | 24.44 | 782.68 | 29.77 | 361856 | 57.8 | 3 | 2021-03-08 21:39:54.148881435 | 2021 | 3 | 8 | 21 | 39 | 2021-03-08 21:39:39.176842928 | 14.0 | False |
711381 | 24.80 | 782.67 | 30.52 | 501364 | 25.0 | 0 | 2021-03-08 21:45:12.908920765 | 2021 | 3 | 8 | 21 | 45 | 2021-03-08 21:44:50.610525370 | 22.0 | False |
711383 | 24.67 | 782.67 | 30.52 | 461738 | 25.0 | 0 | 2021-03-08 21:45:35.030911922 | 2021 | 3 | 8 | 21 | 45 | 2021-03-08 21:45:15.906122684 | 19.0 | False |
1225137 | 25.72 | 781.48 | 25.65 | 499500 | 25.0 | 0 | 2021-03-26 17:03:22.019438267 | 2021 | 3 | 26 | 17 | 3 | 2021-03-26 17:02:30.955729723 | 51.0 | False |
1816217 | 28.23 | 780.46 | 32.44 | 850727 | 25.0 | 0 | 2021-04-16 04:41:09.900250912 | 2021 | 4 | 16 | 4 | 41 | 2021-04-16 04:40:10.764022350 | 59.0 | False |
(
ggplot(airdata[airdata["delta"] > 3],
aes(x="delta")) +
geom_histogram(bins=10) #+
#scale_x_discrete(labels=scale, name="delta")
)
<ggplot: (8789789279977)>
airdata.describe().round(2)
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | year | month | day | hour | minute | delta | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2068161.00 | 2068161.00 | 2068161.00 | 2068161.00 | 2068161.00 | 2068161.00 | 2068161.0 | 2068161.00 | 2068161.00 | 2068161.00 | 2068161.00 | 2068161.00 |
mean | 24.80 | 780.64 | 30.11 | 432603.16 | 161.24 | 2.64 | 2021.0 | 3.10 | 15.78 | 11.52 | 29.50 | 2.00 |
std | 2.77 | 2.39 | 5.94 | 251980.00 | 72.85 | 0.72 | 0.0 | 0.75 | 8.00 | 6.91 | 17.32 | 0.28 |
min | 16.67 | 773.78 | 7.63 | 76404.00 | 0.00 | 0.00 | 2021.0 | 2.00 | 1.00 | 0.00 | 0.00 | 2.00 |
25% | 22.99 | 778.99 | 26.09 | 240846.00 | 96.80 | 3.00 | 2021.0 | 3.00 | 9.00 | 6.00 | 14.00 | 2.00 |
50% | 24.91 | 780.58 | 30.52 | 396125.00 | 181.00 | 3.00 | 2021.0 | 3.00 | 16.00 | 12.00 | 29.00 | 2.00 |
75% | 26.91 | 782.31 | 34.08 | 534470.00 | 226.40 | 3.00 | 2021.0 | 4.00 | 22.00 | 18.00 | 44.00 | 2.00 |
max | 30.62 | 787.66 | 59.19 | 2434389.00 | 500.00 | 3.00 | 2021.0 | 4.00 | 31.00 | 23.00 | 59.00 | 392.00 |
airdata[airdata.datetime == "2021-02-18 23:30:02.871531487"].round(2)
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | datetime | year | month | day | hour | minute | datetime-1 | delta | imputated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
194038 | 25.24 | 777.47 | 25.27 | 188409 | 28.7 | 1 | 2021-02-18 23:30:02.871531487 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:23:30.376312494 | 392.0 | False |
airdata[(airdata.datetime >= "2021-02-18 23:22")][((airdata.datetime <= "2021-02-18 23:30:03"))]
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | datetime | year | month | day | hour | minute | datetime-1 | delta | imputated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
194007 | 25.67 | 777.40 | 24.82 | 189056 | 35.0 | 3 | 2021-02-18 23:22:00.555237055 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:21:57.560976982 | 2.0 | False |
194008 | 25.66 | 777.43 | 24.83 | 187661 | 35.9 | 3 | 2021-02-18 23:22:03.552131891 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:00.555237055 | 2.0 | False |
194009 | 25.66 | 777.41 | 24.83 | 187661 | 36.4 | 3 | 2021-02-18 23:22:06.543661118 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:03.552131891 | 2.0 | False |
194010 | 25.64 | 777.41 | 24.82 | 188088 | 36.5 | 3 | 2021-02-18 23:22:09.544224024 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:06.543661118 | 3.0 | False |
194011 | 25.63 | 777.40 | 24.82 | 189925 | 34.8 | 3 | 2021-02-18 23:22:12.538087845 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:09.544224024 | 2.0 | False |
194012 | 25.64 | 777.41 | 24.81 | 189272 | 34.3 | 3 | 2021-02-18 23:22:15.531872988 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:12.538087845 | 2.0 | False |
194013 | 25.62 | 777.43 | 24.83 | 188409 | 34.8 | 3 | 2021-02-18 23:22:18.525767088 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:15.531872988 | 2.0 | False |
194014 | 25.61 | 777.38 | 24.90 | 187874 | 35.6 | 3 | 2021-02-18 23:22:21.519450902 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:18.525767088 | 2.0 | False |
194015 | 25.63 | 777.41 | 24.95 | 187342 | 36.6 | 3 | 2021-02-18 23:22:24.513052940 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:21.519450902 | 2.0 | False |
194016 | 25.64 | 777.41 | 24.97 | 187129 | 37.5 | 3 | 2021-02-18 23:22:27.506954193 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:24.513052940 | 2.0 | False |
194017 | 25.64 | 777.40 | 24.94 | 187448 | 37.8 | 3 | 2021-02-18 23:22:30.500771999 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:27.506954193 | 2.0 | False |
194018 | 25.64 | 777.43 | 24.90 | 189707 | 35.9 | 3 | 2021-02-18 23:22:33.488167286 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:30.500771999 | 2.0 | False |
194019 | 25.63 | 777.41 | 24.91 | 189489 | 34.7 | 3 | 2021-02-18 23:22:36.488412857 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:33.488167286 | 3.0 | False |
194020 | 25.63 | 777.41 | 24.90 | 187874 | 35.5 | 3 | 2021-02-18 23:22:39.481906414 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:36.488412857 | 2.0 | False |
194021 | 25.63 | 777.41 | 24.89 | 188302 | 35.7 | 3 | 2021-02-18 23:22:42.475703716 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:39.481906414 | 2.0 | False |
194022 | 25.61 | 777.40 | 24.91 | 188409 | 35.7 | 3 | 2021-02-18 23:22:45.463096142 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:42.475703716 | 2.0 | False |
194023 | 25.59 | 777.40 | 24.94 | 189381 | 34.8 | 3 | 2021-02-18 23:22:48.457271814 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:45.463096142 | 2.0 | False |
194024 | 25.60 | 777.41 | 25.01 | 188732 | 34.7 | 3 | 2021-02-18 23:22:51.451401234 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:48.457271814 | 2.0 | False |
194025 | 25.61 | 777.40 | 25.05 | 186496 | 36.8 | 3 | 2021-02-18 23:22:54.445565462 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:51.451401234 | 2.0 | False |
194026 | 25.64 | 777.43 | 25.01 | 186918 | 37.7 | 3 | 2021-02-18 23:22:57.442430258 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:54.445565462 | 2.0 | False |
194027 | 25.67 | 777.40 | 25.01 | 187024 | 38.2 | 3 | 2021-02-18 23:23:00.439792395 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:22:57.442430258 | 2.0 | False |
194028 | 25.68 | 777.41 | 24.95 | 187661 | 37.9 | 3 | 2021-02-18 23:23:03.433859587 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:00.439792395 | 2.0 | False |
194029 | 25.68 | 777.41 | 24.93 | 188409 | 36.9 | 3 | 2021-02-18 23:23:06.427474499 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:03.433859587 | 2.0 | False |
194030 | 25.69 | 777.40 | 24.90 | 187024 | 37.6 | 3 | 2021-02-18 23:23:09.421072960 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:06.427474499 | 2.0 | False |
194031 | 25.70 | 777.43 | 24.89 | 186812 | 38.3 | 3 | 2021-02-18 23:23:12.414887190 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:09.421072960 | 2.0 | False |
194032 | 25.70 | 777.43 | 24.91 | 188088 | 37.5 | 3 | 2021-02-18 23:23:15.408778191 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:12.414887190 | 2.0 | False |
194033 | 25.71 | 777.43 | 24.92 | 187767 | 37.2 | 3 | 2021-02-18 23:23:18.402513504 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:15.408778191 | 2.0 | False |
194034 | 25.69 | 777.43 | 24.95 | 186181 | 38.5 | 3 | 2021-02-18 23:23:21.390101194 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:18.402513504 | 2.0 | False |
194035 | 25.69 | 777.43 | 24.89 | 187235 | 38.4 | 3 | 2021-02-18 23:23:24.390413523 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:21.390101194 | 3.0 | False |
194036 | 25.69 | 777.43 | 24.85 | 189272 | 36.4 | 3 | 2021-02-18 23:23:27.384147644 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:24.390413523 | 2.0 | False |
194037 | 25.71 | 777.41 | 24.77 | 187661 | 36.7 | 3 | 2021-02-18 23:23:30.376312494 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:27.384147644 | 2.0 | False |
194038 | 25.24 | 777.47 | 25.27 | 188409 | 28.7 | 1 | 2021-02-18 23:30:02.871531487 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:23:30.376312494 | 392.0 | False |
airdata[((airdata.datetime <= "2021-02-18 23:30:03"))].iloc[-15:]
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | datetime | year | month | day | hour | minute | datetime-1 | delta | imputated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
194024 | 25.60 | 777.41 | 25.01 | 188732 | 34.7 | 3 | 2021-02-18 23:22:51.451401234 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:48.457271814 | 2.0 | False |
194025 | 25.61 | 777.40 | 25.05 | 186496 | 36.8 | 3 | 2021-02-18 23:22:54.445565462 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:51.451401234 | 2.0 | False |
194026 | 25.64 | 777.43 | 25.01 | 186918 | 37.7 | 3 | 2021-02-18 23:22:57.442430258 | 2021 | 2 | 18 | 23 | 22 | 2021-02-18 23:22:54.445565462 | 2.0 | False |
194027 | 25.67 | 777.40 | 25.01 | 187024 | 38.2 | 3 | 2021-02-18 23:23:00.439792395 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:22:57.442430258 | 2.0 | False |
194028 | 25.68 | 777.41 | 24.95 | 187661 | 37.9 | 3 | 2021-02-18 23:23:03.433859587 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:00.439792395 | 2.0 | False |
194029 | 25.68 | 777.41 | 24.93 | 188409 | 36.9 | 3 | 2021-02-18 23:23:06.427474499 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:03.433859587 | 2.0 | False |
194030 | 25.69 | 777.40 | 24.90 | 187024 | 37.6 | 3 | 2021-02-18 23:23:09.421072960 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:06.427474499 | 2.0 | False |
194031 | 25.70 | 777.43 | 24.89 | 186812 | 38.3 | 3 | 2021-02-18 23:23:12.414887190 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:09.421072960 | 2.0 | False |
194032 | 25.70 | 777.43 | 24.91 | 188088 | 37.5 | 3 | 2021-02-18 23:23:15.408778191 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:12.414887190 | 2.0 | False |
194033 | 25.71 | 777.43 | 24.92 | 187767 | 37.2 | 3 | 2021-02-18 23:23:18.402513504 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:15.408778191 | 2.0 | False |
194034 | 25.69 | 777.43 | 24.95 | 186181 | 38.5 | 3 | 2021-02-18 23:23:21.390101194 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:18.402513504 | 2.0 | False |
194035 | 25.69 | 777.43 | 24.89 | 187235 | 38.4 | 3 | 2021-02-18 23:23:24.390413523 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:21.390101194 | 3.0 | False |
194036 | 25.69 | 777.43 | 24.85 | 189272 | 36.4 | 3 | 2021-02-18 23:23:27.384147644 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:24.390413523 | 2.0 | False |
194037 | 25.71 | 777.41 | 24.77 | 187661 | 36.7 | 3 | 2021-02-18 23:23:30.376312494 | 2021 | 2 | 18 | 23 | 23 | 2021-02-18 23:23:27.384147644 | 2.0 | False |
194038 | 25.24 | 777.47 | 25.27 | 188409 | 28.7 | 1 | 2021-02-18 23:30:02.871531487 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:23:30.376312494 | 392.0 | False |
airdata[((airdata.datetime >= "2021-02-18 23:30:02"))].iloc[:15]
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | datetime | year | month | day | hour | minute | datetime-1 | delta | imputated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
194038 | 25.24 | 777.47 | 25.27 | 188409 | 28.7 | 1 | 2021-02-18 23:30:02.871531487 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:23:30.376312494 | 392.0 | False |
194039 | 25.23 | 777.47 | 25.21 | 187342 | 31.0 | 1 | 2021-02-18 23:30:05.860353708 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:02.871531487 | 2.0 | False |
194040 | 25.24 | 777.49 | 25.16 | 187342 | 32.7 | 1 | 2021-02-18 23:30:08.853756189 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:05.860353708 | 2.0 | False |
194041 | 25.26 | 777.49 | 25.11 | 188840 | 31.0 | 1 | 2021-02-18 23:30:11.847195148 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:08.853756189 | 2.0 | False |
194042 | 25.28 | 777.49 | 25.05 | 189056 | 29.5 | 1 | 2021-02-18 23:30:14.840431452 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:11.847195148 | 2.0 | False |
194043 | 25.31 | 777.47 | 25.03 | 188088 | 30.3 | 1 | 2021-02-18 23:30:17.833710194 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:14.840431452 | 2.0 | False |
194044 | 25.33 | 777.47 | 24.98 | 189707 | 27.8 | 1 | 2021-02-18 23:30:20.826932430 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:17.833710194 | 2.0 | False |
194045 | 25.36 | 777.47 | 24.94 | 188840 | 27.8 | 1 | 2021-02-18 23:30:23.820370197 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:20.826932430 | 2.0 | False |
194046 | 25.38 | 777.45 | 24.98 | 186391 | 32.3 | 1 | 2021-02-18 23:30:26.814028502 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:23.820370197 | 2.0 | False |
194047 | 25.39 | 777.47 | 25.00 | 187554 | 33.2 | 1 | 2021-02-18 23:30:29.807134628 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:26.814028502 | 2.0 | False |
194048 | 25.41 | 777.47 | 25.02 | 188195 | 32.5 | 1 | 2021-02-18 23:30:32.800416470 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:29.807134628 | 2.0 | False |
194049 | 25.43 | 777.47 | 24.97 | 188088 | 32.2 | 1 | 2021-02-18 23:30:35.794191837 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:32.800416470 | 2.0 | False |
194050 | 25.44 | 777.47 | 24.99 | 187235 | 33.5 | 1 | 2021-02-18 23:30:38.786831379 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:35.794191837 | 2.0 | False |
194051 | 25.45 | 777.45 | 25.01 | 188302 | 32.4 | 1 | 2021-02-18 23:30:41.780043602 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:38.786831379 | 2.0 | False |
194052 | 25.46 | 777.49 | 24.98 | 187767 | 32.5 | 1 | 2021-02-18 23:30:44.773188829 | 2021 | 2 | 18 | 23 | 30 | 2021-02-18 23:30:41.780043602 | 2.0 | False |
(
ggplot(airdata[(airdata.datetime >= "2021-02-18 23:10")][((airdata.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="temperature")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789789327509)>
(
ggplot(airdata[(airdata.datetime >= "2021-02-18 23:10")][((airdata.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="pressure")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789789258013)>
(
ggplot(airdata[(airdata.datetime >= "2021-02-18 23:10")][((airdata.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="humidity")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789789232785)>
(
ggplot(airdata[(airdata.datetime >= "2021-02-18 23:10")][((airdata.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="gasResistance")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789789179185)>
(
ggplot(airdata[(airdata.datetime >= "2021-02-18 23:10")][((airdata.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="IAQ")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789789178333)>
Markdown(f"Una opción es descartar los datos previos, y sólo tener \
{airdata[airdata.datetime >= '2021-02-18 23:35:02.871531487'].shape[0]:3,} \
observaciones del total de observaciones ({airdata.shape[0]:3,}).")
Una opción es descartar los datos previos, y sólo tener 1,874,022 observaciones del total de observaciones (2,068,161).
Otra opción es imputar los datos.
%%time
def interpolate_missing(df, idx, seconds=3):
"""
Function to interpolate missing.
examples
airdata2 = interpolate_missing(airdata, idx = airdata[airdata.delta > 100].index[0])
airdata2 = interpolate_missing(airdata, 194044)
"""
np.random.seed(175904)
#idx = df[df.delta > 100].index[0]
#df_prev = df.iloc[idx-100:idx].reset_index(drop=True)
df_prev = df.loc[idx-int(df.loc[idx]["delta"]):idx].reset_index(drop=True)
#df_after = df.loc[idx:idx+100].reset_index()
#display(df_prev)
a = df_prev.iloc[-2]
b = df_prev.iloc[-1]
offset3s = pd.offsets.Second(seconds) # remove closed form
offset2s = pd.offsets.Second(seconds-2) # remove closed form
out = {}
out["datetime"] = pd.date_range(a["datetime"] + offset3s,
b["datetime"] - offset2s,
freq='3s', closed='left')
out["datetime"] = out["datetime"].set_names("datetime")
for v in ["temperature", "pressure", "humidity", "gasResistance", "IAQ"]:
i = [i+1 for i, d in enumerate(out["datetime"])]
m = (b[v] - a[v])/len(out["datetime"])
sd = 0.7*np.std(df_prev[v])
rnds = np.random.normal(-sd, sd, len(out["datetime"]))
#rnds = np.random.uniform(-2*np.pi, 2*np.pi, len(out["datetime"]))
#rnds = np.cos(rnds) * sd
out[v] = [m*j + a[v] + rnds[j-1] for j in i]
#out[v] = [m*j b[v] + rnds[j-1] for j in i]
out["iaqAccuracy"] = 1
idf = pd.DataFrame(out)
reorder_columns = [col for col in out.keys() if col != 'datetime']
reorder_columns.append("datetime")
idf = idf.reindex(columns=reorder_columns)
#print(reorder_columns)
idf["year"] = [dt.year for dt in idf["datetime"]]
idf["month"] = [dt.month for dt in idf["datetime"]]
idf["day"] = [dt.day for dt in idf["datetime"]]
idf["hour"] = [dt.hour for dt in idf["datetime"]]
idf["minute"] = [dt.minute for dt in idf["datetime"]]
idf["imputated"] = True
# original dataframe
#df["imputated"] = False
idf = pd.concat([df, idf])
idf.sort_values("datetime", inplace=True)
idf.reset_index(inplace=True, drop=True)
idf["datetime-1"] = idf["datetime"].shift(1)
idf["delta"] = idf["datetime"] - idf["datetime-1"]
idf["delta"] = idf["delta"].dt.seconds
#airdata = airdata.assign(delta=lambda x: x["datetime"] - x["datetime-1"])
#idf["delta"] = [dt.seconds for dt in idf.delta]
#out["imputated"] = True)
#display(idf)
#display(df_prev.iloc[[0, -2, -1]])
#display(df.loc[[idx]])
return idf
imputation_list = [x for x in reversed(airdata.delta[airdata.delta > 10].index)]
airdata2 = airdata.copy()
for x in imputation_list:
airdata2 = interpolate_missing(airdata2, x)
display(Markdown("Tabla con valores faltantes:"))
display(airdata2[airdata2.delta > 3])
display(Markdown("Nótese cómo todos estos valores son muy pequeños (menores a 10 segundos)."))
Tabla con valores faltantes:
temperature | pressure | humidity | gasResistance | IAQ | iaqAccuracy | datetime | year | month | day | hour | minute | datetime-1 | delta | imputated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 21.54 | 777.41 | 43.73 | 153259.0 | 31.5 | 1 | 2021-02-12 06:05:47.812360048 | 2021 | 2 | 12 | 6 | 5 | 2021-02-12 06:05:38.837326527 | 8.0 | False |
11506 | 20.41 | 778.38 | 42.94 | 122095.0 | 243.2 | 1 | 2021-02-12 15:39:24.238069534 | 2021 | 2 | 12 | 15 | 39 | 2021-02-12 15:39:18.254687548 | 5.0 | False |
Nótese cómo todos estos valores son muy pequeños (menores a 10 segundos).
CPU times: user 4.02 s, sys: 1.83 s, total: 5.85 s Wall time: 5.85 s
#airdata2[airdata2.imputated]
(
ggplot(airdata2[(airdata2.datetime >= "2021-02-18 23:10")][((airdata2.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="temperature", color="imputated")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789789183709)>
#airdata2[airdata2.imputated]
(
ggplot(airdata2[(airdata2.datetime >= "2021-02-18 23:10")][((airdata2.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="gasResistance", color="imputated")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789788031969)>
#airdata2[airdata2.imputated]
(
ggplot(airdata2[(airdata2.datetime >= "2021-02-18 23:10")][((airdata2.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="IAQ", color="imputated")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789789261129)>
#airdata2[airdata2.imputated]
(
ggplot(airdata2[(airdata2.datetime >= "2021-02-18 23:10")][((airdata2.datetime <= "2021-02-18 23:35"))]) +
geom_point(aes(x="datetime", y="humidity", color="imputated")) +
theme(axis_text_x=element_text(angle=45))
)
/home/jaa6766/.conda/envs/cuda/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
<ggplot: (8789789201393)>
# guardamos el dataframe para utilizarlo
# en las redes neuronales secuenciales
airdata2.to_pickle("data/airdata/air-imputated.pickle.gz")
airdata2 = pd.read_pickle("data/airdata/air-imputated.pickle.gz")
def show_heatmap(data):
plt.matshow(data.corr().abs())
plt.xticks(range(data.shape[1]), data.columns, fontsize=14, rotation=90)
plt.gca().xaxis.tick_bottom()
plt.yticks(range(data.shape[1]), data.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title("Feature Correlation Heatmap", fontsize=14)
plt.show()
airdata3 = airdata2[["temperature", "pressure", "humidity",
"gasResistance", "IAQ", "day", "hour", "minute"]]
show_heatmap(airdata3)
airdata3.corr().abs().round(2)
temperature | pressure | humidity | gasResistance | IAQ | day | hour | minute | |
---|---|---|---|---|---|---|---|---|
temperature | 1.00 | 0.35 | 0.59 | 0.45 | 0.11 | 0.11 | 0.11 | 0.0 |
pressure | 0.35 | 1.00 | 0.44 | 0.25 | 0.35 | 0.18 | 0.19 | 0.0 |
humidity | 0.59 | 0.44 | 1.00 | 0.51 | 0.33 | 0.31 | 0.08 | 0.0 |
gasResistance | 0.45 | 0.25 | 0.51 | 1.00 | 0.29 | 0.11 | 0.20 | 0.0 |
IAQ | 0.11 | 0.35 | 0.33 | 0.29 | 1.00 | 0.05 | 0.27 | 0.0 |
day | 0.11 | 0.18 | 0.31 | 0.11 | 0.05 | 1.00 | 0.00 | 0.0 |
hour | 0.11 | 0.19 | 0.08 | 0.20 | 0.27 | 0.00 | 1.00 | 0.0 |
minute | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.0 |