Is adverse impact of COVID-19 more prevalent for specific gender or age group?


COVID-19, the highly contagious disease that has been spreading through the world like wildfire, is a black box. There is limited visibility on how the virus behind this disease acts or who it impacts more. Robust data analysis to answer some of the unknowns associated with the disease can help save many lives.


The purpose of this analysis is to identify if death rate for men above 50 years of age is more than other age groups and gender. Once we are able to ascertain this for confirmed cases of the disease around the world, the same can be determined specifically for Australia.


Publicly available demographic data on COVID-19 has been used for the analysis. The data primarily contains details on country of confirmed case, number of cases and the outcome i.e. death or recovery, along with demographic details such as gender and age.


Based on analysis, it has been found that male infections are more than female, further, men over 50 years of age have been more impacted than men under 50 years of age.


This category of population is more at risk and such finding can help plan resources and precautions better for them. Death rate is high at 5.8% as per the dataset. Also, the most commonly exhibited symptoms are fever and cough.



The world has been taken by storm by the spread of Novel Coronavirus, also called COVID-19. Most of the economic activities are on hold, with people around the world surviving on essentials. The virus is alleged to have jumped from bats to humans, it belongs to a class of coronaviruses common among wild animals. The less likely event of animal to human transmission occurred in China last year. The first case was reported last in 2019, in Wuhan, which became an epicentre of this highly contagious disease. Since then, as the movement of people in and out of Wuhan was still happening, more and more people travelled around the world and caused the disease to spread. Very little is known about these viruses. Doctors, scientists and data scientists are putting in all efforts to gather more information on how the virus behaves, who it impacts, how much risk it is to an individual and how to treat those affected. In the past month we have seen how this disease has severely strained the resources of countries like Italy and Spain. In such a scenario, it is important that we know which part of the population is most at risk, so that necessary preventive actions can be taken. An extra amount of caution is required. With the help of demographic data on COVID-19, we can assess if there is a specific age group and gender, which is at more risk. Further, there have been several reports on how elderly men are more likely to suffer from the adverse effect of the disease.


The role of data scientists in the current pandemic situation is huge. Almost 4 million have been affected worldwide, the medical fraternity is already working beyond what we could have considered humanly possible. The best way to defeat the virus is by finding out whatever we can about it. The phrase ‘knowledge is power’ makes perfect in this situation.


Knowledge of which gender and age-group is at more risk can help with better planning of country’s resources. Extra precautionary measures can be prescribed for identified population sets. There have been several reports in the past which indicated that elderly men are at a higher risk, this could be attributed to their lower immunity. A confirmation of this hypothesis can also help prioritize vaccines, once available, on logical grounds.




COVID-19 data is publicly available. All related data has been deemed open-source so that data scientists can work on it for novel insights. The chosen dataset has been picked from Johns Hopkins Github repository. A random sample of 1085 records with 19 variables has been chosen for observation.


This data is a compilation from the official reports by nations around the world, published daily. Johns Hopkins publishes a dashboard on confirmed cases and outcomes, split by geographical area. A sample of 1085 records has been analysed.


There are 19 variables in the dataset. List of the variables –


Figure 1 Dataset variables



From the list of column names, we can determine the nature of data. It contains a unique identifier for every row, number of cases in a country, date when case was reported, summary on the type of case, location and country of the case. Further, there are demographic details i.e. gender and age. There are details on dates of onset, exposure and if the case was from Wuhan or someone who visited Wuhan. The outcome of the case is also available. Credibility of case can be ascertained by looking at the source of information.




Various steps have been performed for this analysis and to arrive at the insights, these are described below. RStudio has been installed with the help of manual from CRAN (R Installation and Administration, 2020). RVersion used for the analysis is ‘1.1.383’.


Data representation

Preliminary understanding of the data is important before we start with any analysis. Data has been imported into a data frame using the read.csv command. The str command has been used to identify data types and example values in the data frame (str function, 2020)



Type conversion

Text data is stored as factors, as symptom data is required to understand the most common symptoms in patients, type of this column has been changed. The lapply function has been used (lapply function, 2020)


Age data has been converted from factors to numeric using the as.numeric function (Statistic Globe, 2020)


Unstructured to structured data

One of the columns called ‘symptom’ has text data, this has been cleaned to remove slashes, spaces and other characters found in text, for better analysis. This has been done by loading new libraries – dplyr, stringr, ggplot2, tidytext (Text processing in R, 2020)


Data cleaning

Death rate can be figured out from the data using the death column. However, a date is mentioned for cases where outcome of the case has been death. This information is not required, and such values have been replaced with a 1, by applying the condition mydata$death !=”0”. Data stored as character is then converted to numeric


Further, the number of deaths was found to be 63 out of 1085 i.e. 5.8?ath rate according to the chosen dataset


Group based data summarization

The data has been summarized by country to understand the distribution using table function (Table function in R, 2020). This gives a fair idea of representation of countries in data set. As a result, it was observed that China has the highest representation of data (197), followed by Japan (190), South Korea (114) and Hong Kong (94)



Data visualization

As we clean the data and come to the main objective of finding out the proportion of male and female cases, we found there have been more cases among men. 60% of those infected are male and 40% are female. This plot has been created using the barplot command (Barplot, 2020), with proportion distribution for better picture.


Figure 2 Gender split for all data




Further, age has been plotted as histogram, with breaks defined in the interval of 10. Hist function has been used (Hist function, 2020) (The subset function, 2020)


Figure 3 Age split for all data

The data complies with our hypothesis that those over the age of 50 are at a higher risk



Data subset selection

Data subset is created to understand if the observation on gender and age for all nations applies to Australia as well. The subset function has been used for this (The subset function, 2020)


Figure 4 Gender split for Australia



Figure 5 Age split for Australia



The data visualizations show that worldwide observations not only apply to Australia but are more profound. However, no conclusion can be drawn as the sample data is less than 30.


Exploratory visualization

Further, after cleaning the symptoms data, it was identified that a few symptoms occurred more frequently than others. ggplot2 function (ggplot2, 2020) has been used to identify the most common symptoms


Figure 6 Most common symptoms using ggplot2


Results and Discussion

Through data exploration we identified the split of this random dataset between countries, gender and age. We also identified that if grouped by countries, a conclusive statement can be made only for a few countries due to lack of data. There are a few key points that emerged from the analysis about COVID-19 –

  • Population at higher risk – Male, above 50 years of age

  • Death rate – 5.8%

  • Most common symptoms – Fever and cough



Data understanding and cleaning, followed by data conversion, subset formation, grouping and visualizations, resulted in some actionable insights. These insights have a greater implication in everyday life. We have determined that death rate for the disease is high, which implies that without proper planning and preparation, we may lose several lives. Further, elderly men should be extra-cautious, government can take measures to restrict movement of this category of population. There should be new services aimed at making their life easier and limiting interaction with outside world. Also, those exhibiting symptoms of COVID-19 i.e. fever and cough should be tested and isolated to prevent further spread of the disease.



R Installation and Administration. (2020). Retrieved from CRAN R Project:

str function. (2020). Retrieved from R documentation:

lapply function. (2020). Retrieved from R documentation:

Statistic Globe. (2020). Retrieved from Factors to Numbers:

Text processing in R. (2020). Retrieved from

Table function in R. (2020). Retrieved from Data science made simple:

Barplot. (2020). Retrieved from R documentation:

Hist function. (2020). Retrieved from R documentation:

The subset function. (2020). Retrieved from R Bloggers:

The subset function. (2020). Retrieved from R Bloggers:

ggplot2. (2020). Retrieved from Tidyverse:








Figure 7 str(mydata)



Figure 9 Type conversion- lapply





R Code

#check R version


#load csv file

mydata <- read.csv(file.choose())

#check data


#check data types


#convert country column to character

mydata[18] <- lapply(mydata[18], as.character)


#plot of gender


#convert age data to numeric

mydata[8] <- lapply(mydata[8], as.numeric)

#plot of age

hist(  mydata$age,

  breaks = c(0,10,20,30,40,50,60,70,80,90,100))


#subset country data for Australia

ausdata<-subset(mydata, mydata$country=="Australia")


#plot of gender for Australia


#plot of age for Australia

hist(  ausdata$age,

       breaks = c(0,10,20,30,40,50,60,70,80,90,100))

#load libraries for converting unstructured text data to more structured form


library(dplyr) # Data wrangling & manipulation

library(stringr) # For managing text

library(ggplot2) # For data visualizations & graphs


#create new list of symptoms



#clean text data

symptomslist <- paste(symptomslist, collapse = " ") # Remove spaces

symptomslist <- str_replace_all(symptomslist, pattern = '\"', replacement = "") # Remove slashes

symptomslist <- str_replace_all(symptomslist, pattern = '\n', replacement = "") # Remove \n

symptomslist <- str_replace_all(symptomslist, pattern = '\u0092', replacement = "'") #Replace with quote

symptomslist <- str_replace_all(symptomslist, pattern = '\u0091', replacement = "'") #Replace with quote


#create new vectors with symptom name and frequency

name<-c('fever','cough','throat','pain','headache','diarrhea','chills','breath','runny nose')

freq<-c(length(grep("fever", symptomslist)),

length(grep("cough", symptomslist)),

length(grep("throat", symptomslist)),

length(grep("pain", symptomslist)),

length(grep("headache", symptomslist)),

length(grep("diarrhea", symptomslist)),

length(grep("chills", symptomslist)),

length(grep("breath", symptomslist)),

length(grep("runny nose", symptomslist)))


#create a dataframe of symptom frequencies



#ggplot to determine most common symptoms

ggplot(data=symptomsfreq, aes(x=freq,y=name))+geom_bar(stat="identity")


List Of  Assignment Services Provided By BestAssignmentExperts.Com


  Finance Assignment Help         


                 Law Assignment Help                         Computer Assignment Help             


  Essay Help Writing                       


                  Marketing Assignment Help                                                 Management Assignment Help


No Need To Pay Extra
  • Turnitin Report

  • Proofreading and Editing

    Per Page
  • Consultation with Expert

    Per Hour
  • Live Session 1-on-1

    Per 30 min.
  • Quality Check

  • Total


New Special Offer

Get 25% Off


Call Back