15 May, 2022

Ozone Data Analysis Using Ozone Dataset

1 Data exploration
Several skewed variables may be seen in this dataset

We anticipate collinearity in the data due to the nature of the data.

A generic linear model of the data may also be used to detect collinearity and skewness, which we will discuss in more detail below. Take note of the collinearity in the Residuals vs Fitted plot, as well as the skewness in the Residuals vs Leverage plot, which are both shown in the figure.

2 Analysis
HourAverageMax and visibility were skewed to the right while pressure500Height is skewed in the ozone dataset. These variables are collinear, such as tempSandburg and inversionBaseTemp. Considering that the data was collected on a daily basis and weather patterns tend to stick around for a while, this is expected.




Because of this, penalized regression and decision trees were employed to select the best model for predicting the answer. When dealing with collinearity, penalized regression is an alternative to classical subset selection. As a result of this integration, decision trees are resistant to skewness. In order to decrease the correlations in our predictions, we may utilize random forests to pick from a subset of variables at each node. It's conceivable that we'll get it right this time.
Using ten-fold cross-validation, the models were evaluated. The optimal settings for penalized regression were determined to be:
Boosted and bagged random forest decision trees were also tuned using ten-fold cross-validation. The boosted models were used to assess shrinkage and interaction.depth. For the random forest model, the number of predictors, or mtry, was adjusted.
The Double CV was used to evaluate the following models, which are listed below:
1. Penalized regression with alpha =.06 and lambda = 0.062662 using the glmnet function with alpha =.06 and lambda = 0.062662
2. It is possible to increase the performance of decision trees by using the gbm function with shrinkage =.001 and interaction depth = 4.
3. In order to build a random forest, randomForests is utilized in conjunction with two predictors.
We employed the double CV strategy to arrive at Model 3, the Random Forest model, which resulted in the following results.
Double CV evaluation on random forest model (3)
A respectable amount of variance was explained by the selected model, which was demonstrated to account for around 75% of the variation in the test data.
3 Random forest model results and conclusion


In the area of Upland, California, the variable significance plot reveals that temperature is the most important factor to consider when determining the concentration of ozone. Pressure and humidity are right on its heels, following closely after. This is due to the fact that ozone's activity under pressure is to condense, and so these critical characteristics are self-evident. When there is a high quantity of humidity in the air, ozone has a difficult time dissipating from the atmosphere.

Finally, I'd urge app developers that use our algorithm to make sure they convey the accuracy of the approach to their users in an appropriate manner. Although a forecasting program with an estimated accuracy of 75% is acceptable, ozone levels more than 0.1 parts per million (ppm) constitute a significant hazard to human health and should be avoided. The app should provide a warning and give links to more up-to-date information if the algorithm is not updated in real time, so order to avoid widespread fear or a false sense of security. Additional data must be collected in order to obtain a more accurate forecast of the response variable. Thus, app developers would be able to provide the general public with an accurate and very precise ozone alert forecasting system as a consequence of this.

No comments:

Post a Comment