Summary of Modeling Strategy

Introduction

The Hackathon is intended for the purpose of making predictions on the dataset that has NYC data of COVID-19 case count along with death and hospitalization count. We will predict the hospitalization count of ten days using machine learning models. We aim to find the best fit model which will provide us with accurate results and visualizations that support our predictions.

Analysis

Data set

For this XN HACKATHON Project, we have used COVID-19 data for New York City [2]. The data starts from 29th February 2020 to 27th May 2020. Our goal is to use a machine learning model in R in order to predict the Hospitalized Counts for a 10-day timeframe - 7th April 2020 to 16th April 2020.

Input Dataset

Dependency Management

The code execution has a dependency on R programming language and R studio. Kindly refer to the below README.txt document for detailed instructions.

README.txt

Data Correlation

To view the association between the response and independent variables, we generated the following graph:

The graph shows highest correlation between hospitalized count and case count. Higher the cases, more people are hospitalized with a correlation of 95.7%. While death count also has a high correlation with hospitalization count but less as compared to case count. The density plots show a decreasing trend in overall for all 3 attributes. The least correlation is between corona cases and deaths with a value of 76.9%. It means, not all patients who have tested positive for corona-virus died; few might have recovered.

Data Partitioning

To perform prediction analysis, we split the data into training and test datasets. The training dataset is created using the initial 38 rows that are from 29th Feb - 6th April 2020, and the test dataset includes the next 10-day's prediction.

Data Modeling and Prediction

To analyze our data, we used 3 modeling strategies- ARIMA, Random Forest, and Generalized Linear Regression Model. To check the accuracy of the models, we have used 3 measures that are Residual Mean Square Error (RMSE), R-squared (R2), and Mean Absolute Percentage Error (MAPE).

1**. ARIMA**

To perform time-series analysis we used the Auto-Regressive Integrated Moving Average (ARIMA) method. ARIMA is a class model that captures the temporal nature of the data. Our data set is non-stationary i.e. it includes date, on which our prediction relies. This model can be used as a statistical evidence to carry out appropriate forecasting.