Overview

As of April 28th, the number of confirmed COVID-19 cases world-wide rises to about 3.08 million, with one third of those cases (1.03 million) reported within the United States 1. With the number of deaths exceeding that of the common flu 2 within just a short period of time, and as some states are beginning to reopen 3, it is important look back at the effectiveness of social distancing and lockdown by comparing the projection of infectives under no intervention measures.

While there are many epidemiological models available, the bread and butter of modeling any infectious diseases will have to be SIR (Susceptible, Infectives, Removed). For this project, the simpler SIR model is modified to include a delay factor 4:

\[\frac{dS}{dt} = - \frac{\beta S_\tau I_\tau}{N}\]

\[\frac{dI}{dt} = \frac{\beta S I}{N} - \gamma I\]

\[\frac{dR}{dt} = \gamma I\]

\[R_0 = \frac{\beta}{\gamma}\]

Where:

One main assumption is that the total population remains constant, such that \(S + I + R = N\). Also notice that the initial number of infected must be \(\ge 1\) for there to be any movement at all. For the model, Recovered and Deaths were combined to R to facilitate MCMC process.

From this model, the parameters, \(\beta\) and \(\gamma\), were estimated by MCMC5 and available COVID-19 cases. Adaptive Metroplis Hastings and Delayed Rejection 6 was used for 10000 simulations. The likelihood function was assumed to be normal, with cost function being the sum of squared errors. All of these will not be possible without pymcmcstat7. Here were the assumptions and initial conditions for projection:

Both MCMC simulations and projection assumed a 14 days incubation period. Model was integrated using scipy’s odeint. The result projection:

Projection

Based on the projection, for an “average state” of 6 million people, the number of infectives will peak at 200000 about 150 days from March 1st. Due to the simple nature of the model, many factors are not considered and thus should only be a representation of the worst case scenario. However, the model indicates an inflection point of around 110 days from March 1st, where the rates of infectives are slowing down.

Data Description and Data Processing

The entire data aquisition and processing can be found here, with automated script here.

The video above shows the number of COVID-19 cases (confirmed and deaths) to location over the past two months. Notice the shift in scatter plot locations from March 9th to March 10th, where cases recorded are grouped from local-wise to state-wise. This state-wise report is once again reverted back to be more granular from March 21st to 22nd. Obvious hotspots of COVID-19 cases include NewYork, California, Wisconsin, and almost anywhere with highly dense population.

Johns Hopkins CSSEGIS offers detailed data about COVID-19 cases all over the world. One major flaw of this database is its inconsistent formatting, especially in the “Recovered” column being all zeroes for regions within the US. This is a problem that the data collection team has to fix. Fortunately, MCMC simulations only relies on the number of confirmed cases.

A class was created to extract and reorganize relevant information for mappings: dates, column filters, etc.

class preProcess():
    def __init__(self, path, country):
        self.__dates = sorted([re.search(r'(\d+-\d+-\d+)', d).group() for d in glob.glob(os.path.join(path, "*.csv"))])
        self.__df = {re.search(r'(\d+-\d+-\d+)',i).group(): pd.read_csv(i) for i in glob.glob(os.path.join(path, "*.csv"))}
        self.c_format = {'Country/Region', 'Country_Region'} #Upon looking at csv formats, I realized there are inconsistencies in keys. This should fix it.
        self.coords_format = {'Latitude', 'Longitude', 'Lat', 'Long_'}
        self.s_format = {'Province/State', 'Province_State'}
        self.__countries = sorted({country for d in self.__dates for country in self.__df[d][(self.c_format & set(self.__df[d].columns)).pop()]})

        #1
        self.__cDates = sorted({i for i in self.dates if country in set(self.df[i][(self.c_format & set(self.df[i].columns)).pop()])} &
                               {j for j in self.dates if len(set(self.df[j][(self.coords_format & set(self.df[j].columns))]))})

        #2
        self.__country = self.countryBreakDown(country)

        #3
        self.__states = self.statesBreakDown()...        

Where reusable variables are important in making the structure easier to identify for the users:

#Break down stats by State/Province
def countryBreakDown(self, country):
    #Data processing is getting more difficult than I though. Let's hope that they stop changing the data format as the day goes on.
    pre_df = {d: self.df[d][self.df[d][(self.c_format & set(self.df[d].columns)).pop()] == country][[(self.s_format & set(self.df[d].columns)).pop(),
                                       sorted(self.coords_format & set(self.df[d].columns))[-1],
                                       sorted(self.coords_format & set(self.df[d].columns))[0],
                                       'Confirmed', 'Recovered', 'Deaths']].set_index((self.s_format & set(self.df[d].columns)).pop()) for d in self.cDates}
    #Add multi-Index
    for k, v in pre_df.items():
        coords = list(prod(['Coordinates'], v.columns[:2]))
        coords.extend(list(prod([k], v.columns[2:])))
        v.columns = pd.MultiIndex.from_tuples(coords)
        v.index.names = ['State/Province'] #Rename index level for consistency
    return pre_df

#Statesbreak down. PASS IN self.country
def statesBreakDown(self):
        statesdict = {"D.C":"District of Columbia","AL":"Alabama","AK":"Alaska","AZ":"Arizona","AR":"Arkansas","CA":"California","CO":"Colorado","CT":"Connecticut","DE":"Delaware","FL":"Florida","GA":"Georgia","HI":"Hawaii","ID":"Idaho","IL":"Illinois","IN":"Indiana","IA":"Iowa","KS":"Kansas","KY":"Kentucky","LA":"Louisiana","ME":"Maine","MD":"Maryland","MA":"Massachusetts","MI":"Michigan","MN":"Minnesota","MS":"Mississippi","MO":"Missouri","MT":"Montana","NE":"Nebraska","NV":"Nevada","NH":"New Hampshire","NJ":"New Jersey","NM":"New Mexico","NY":"New York","NC":"North Carolina","ND":"North Dakota","OH":"Ohio","OK":"Oklahoma","OR":"Oregon","PA":"Pennsylvania","RI":"Rhode Island","SC":"South Carolina","SD":"South Dakota","TN":"Tennessee","TX":"Texas","UT":"Utah","VT":"Vermont","VA":"Virginia","WA":"Washington","WV":"West Virginia","WI":"Wisconsin","WY":"Wyoming"}
        DPfilter = {'Unassigned Location (From Diamond Princess)','Grand Princess Cruise Ship'}
        #Use latest date as scheme for merge. This relies heavily on the latest date contains coords
        latest = self.cDates[-1]
        dff = self.country[latest].groupby(self.country[latest].index).sum()
        dff.drop(columns = dff.columns.levels[0][0], level = 0, inplace = True) #Drop the date columns, so all we have now are the states
        if len(dff.columns.levels) == 2:
            dff.drop(columns = dff.columns.levels[0][1], level = 0, inplace = True) #DROPPING COORDINATES TOO. COMMENT IF NEEDED
        #####################################################################
        for i in self.cDates:
            df = self.country[i].groupby(self.country[i].index).sum()
            #Some dates do not have the coordinates column, Need to check length of level 0 multiIndex
            if len(df.columns.levels) == 2:
                df.drop(columns = df.columns.levels[0][1], level = 0, inplace = True)
            #before merging, filter state/province columns from previous dates
            if len(set(df.index) & DPfilter):
                df.drop(index = list(set(df.index) & DPfilter), inplace = True)
            filtered = [re.findall(r'([A-Z]{2}|[A\.-Z]{3})', x)[0] for x in df.index if len(re.findall(r'([A-Z]{2}|[A\.-Z]{3})', x)) == 1]
            if len(set(statesdict) & set(filtered)):
                df.index = [statesdict[s] for s in filtered]
                df = df.groupby(df.index).sum()
                df.index.names = ['State/Province']
            dff = pd.merge(dff, df, how = 'left', on = 'State/Province')
        #Depending on new locations, might need to add or remove from below
        dff.drop(index = ['Recovered','Virgin Islands', 'Northern Mariana Islands', 'Grand Princess', 'Diamond Princess', 'Puerto Rico'], inplace = True)
        dff.fillna(0, inplace = True) #Fill na for states whose values don't exist from previous dates
        return dff

Overall, script functionality is contingent on the data collection teams and their willingness to be consistent.

Data Visualization

During this process, high performance processing modules like pandas8, numpy9, re10, and itertools11 were helpful in data manipulation and multiIndexing during reformat. With the spread sheets organized, Map creation was made easy with basemap12, where pyplot13 was then used to project COVID-19 statistics:

# Valid columns are: 'Confirmed', 'Recovered', 'Deaths'
def barGraphPerCountry(self, c_frame, column):
    fig, ax = plt.subplots(figsize = (12, 6))    
    fig = sns.barplot(x = list(c_frame.index),
                  y = column,
                  data = c_frame,
                  orient = 'v')
    ax.set_xticklabels([])

def mapPlot(self, c_frame, column, save = 1):
    date = c_frame.columns.levels[0][0]
    cf = c_frame.copy(deep = True)
    cf.columns = cf.columns.droplevel(0)
    #Filter coordinates from US cases outside mainland (excluding Alaska)
    cf = cf[(cf[cf.columns[0]].between(-130, -55) & cf[cf.columns[1]].between(26, 50))]

    '''Bookkeeping'''
    lon, lat = cf[cf.columns[0]].values, cf[cf.columns[1]].values
    confirmed, color = cf['Confirmed'].values, cf[column].values
    max_col = self.country[self.cDates[-1]][self.cDates[-1]][column].values

    '''Draw map'''
    fig = plt.figure(figsize = (20, 20))
    m = Basemap(projection = 'lcc', resolution = 'h',
        width = 5E6, height = 4E6,
        lat_1 = 26., lat_2 = 50, lat_0 = 35, lon_0 = -98)
    m.shadedrelief()
    m.drawcoastlines(color = 'gray')
    m.drawcountries(color = 'black')
    m.drawstates(color = 'gray')

    '''Scatter'''
    cNorm = plt.Normalize(vmin = 0, vmax = max_col.max())
    m.scatter(x = lon, y = lat, latlon = True,
          s = confirmed,
          c = color, cmap = plt.get_cmap('jet'),
          norm = cNorm,
          alpha = 0.5)

    cbar = plt.colorbar(label = column, shrink = 0.60)    

    plt.ylabel('Latitude', fontsize = 14)
    plt.xlabel('Longitude', fontsize = 14)
    plt.title('Confirmed Cases for {}: {}'.format(date, int(confirmed.sum())), fontsize = 20)

    #Size legend
    for a in [10, 100, 500]:
        plt.scatter([], [], c = 'k', alpha = 0.9, s = a,
                label = str(a) + ' Confirmed')
    plt.legend(loc = 3, fontsize = 16)

    if not save:
        plt.show()

    else:
        if not os.path.exists('Progress'):
        os.makedirs('Progress')
        plt.savefig("Progress/{}.png".format(date), bbox_inches = 'tight', dpi = 100)
        plt.clf()
        plt.close()

COVID-19 cases within the US from each date was passed into the method above to produce each graph:

mpl.use('Agg')
for i, date in enumerate(df.country.keys()):
    print("Creating map for {}.........{}/{}".format(date, i + 1, len(df.country.keys())))
    df.mapPlot(df.country[date], 'Deaths', 1)

Then cv214 was used to stitch the graphs together. “Agg” mode was invoked to prevent matplotlib from outputing each graph it created:

    '''Stitching images together'''
    print('Creating video of maps...')
    path = sorted(glob.glob(os.path.join('./Progress/*.png')))
    files = [cv2.imread(file) for file in path]
    height, width, layers = files[0].shape
    size = (width, height)
    files = [cv2.resize(file, size) for file in files]

    out = cv2.VideoWriter('ConfirmedvsDeaths.avi', cv2.VideoWriter_fourcc(*'XVID'), 60, size)
    for file in files:
        for _ in range(15):
            out.write(file)
    out.release()

Author: Tinh Son Last edited: 2020-05-04


  1. “Coronavirus (COVID-19).” Google News, Google, news.google.com/covid19/map?hl=en-US&gl=US&ceid=US%3Aen.↩︎

  2. “Disease Burden of Influenza.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 17 Apr. 2020, www.cdc.gov/flu/about/burden/index.html.↩︎

  3. “2019-2020 U.S. Flu Season: Preliminary Burden Estimates.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 17 Apr. 2020, www.cdc.gov/flu/about/burden/preliminary-in-season-estimates.htm.↩︎

  4. Zhang, Zhi-yu, et al. “Analysis of a Delayed SIR Model with Nonlinear Incidence Rate.” Discrete Dynamics in Nature and Society, Hindawi, 27 Jan. 2009, www.hindawi.com/journals/ddns/2008/636153/.↩︎

  5. Shaver, Ben. “A Zero-Math Introduction to Markov Chain Monte Carlo Methods.” Medium, Towards Data Science, 24 Dec. 2017, towardsdatascience.com/a-zero-math-introduction-to-markov-chain-monte-carlo-methods-dcba889e0c50.↩︎

  6. Laine, M., et al. “Delayed Rejection Adaptive Metropolis.” DRAM, 2011, helios.fmi.fi/~lainema/dram/.↩︎

  7. https://pypi.org/project/pymcmcstat/↩︎

  8. https://pandas.pydata.org/↩︎

  9. https://numpy.org/↩︎

  10. https://docs.python.org/3/library/re.html↩︎

  11. https://docs.python.org/3/library/itertools.html↩︎

  12. https://matplotlib.org/basemap/↩︎

  13. https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.html↩︎

  14. https://pypi.org/project/opencv-python/↩︎