SINAICA Imputations

SINAICA Data.

Nearby Air Quality Monitoring Stations

Here you may find the most proximate stations to "Camarones" which is the closest one to our sensor.

Mapa de Estaciones Cercanas

Camarones Air Quality Monitoring Station

Imputation: Missing Data from the Air Quality Monitoring Stations.

Some of the missisng observatinos are caused by maintenance on the monitoring systems. So we could try to fill out the missing data with nearby government sesnsors. Then we propose to evaluate how the imputations work.

Missing Data in Camarones

We can tell that "Camarones", the closest one, has missing data on all variables.

Complete Observations in Camarones.

Train and Test Split

Data Distribution

PM10

These are the Air Quality Monitoring Stations that measure PM10 pollutant.

Comparing Stations

PM2.5

Regresión Lineal

Removemos observaciones incompletas para realizar la regresión.

Lasso

Mean

Generalized Linear Models: GLM

K-Nearest Neighbors

Evaluation

Early Conclusions

Given that we all further treatment to use the data should be in a sequential fashion, ie as timeseries: we found that linear interpolation is adequate.

Then in the next section we are detailing it.

Interpolation

We found in the EDA and in previous sections that Merced has similar data as Camarones, and it has fewer incomplete observations (missing data).

Camarones has more incomplete observations:

Using those results we create a new dataframe with those imputations.

The data in the dataframe have the following columns (vars) in the following manner:

We used a lag to use the missing hours and found these gaps in the time line:

These are the missing gaps:

Realizamos una interpolación quedando los datos así:

We have imputated successfully all our data frame.

We recognize this might not be the best method, but we can explore more imputation methods on timeseries modeling.

References