1 Introduction to the data

In order to properly analyze the World Life Expectancy data, four data sets were downloaded and read into rstudio. The following data was utilized for the analysis:

  • Income per person
  • Life expectancy per year
  • Population Size
  • Country Regions

Each data set includes information that could potentially impact life expectancy throughout the world. Ranging from 1800 to 2018 this public domain data includes both relevant and irrelevant data for analysis.

countries.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/countries_total.csv")
income.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/income_per_person.csv")
life.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/life_expectancy_years.csv")
popu.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/population_total.csv")

Conserved across the four data sets is one common variable: country name. Additionally, three of the four utilized data sets share a secondary common variable, year.


2 Preparing the Data for a Proper Merge

Once the raw data was loaded into rstudio, variables that were outside of the scope of interest needed to be removed from each of the four data frames.

2.1 Income Per Person Data

Starting with Income per person data, the data frame required reshaping to allow for proper merge of all four data sets and subsequent analysis.


The data frame of raw income data (named income.raw) was reshaped to gather the data by year, omitting the country column. Omitting the country column allows for multiple country entries for each year.


This process transformed the raw income data of 193 observations of 220 variables into the reshaped data frame of income data (named income) that includes 42267 observations of 3 variables.


All blank values were omitted from the reshaped income data frame by setting na.rm = TRUE.

income <- income.raw %>% 
  gather(key = "Year",
         value = "Income",
         - geo,
         na.rm = TRUE)
colnames(income) <- c("Country", "Year", "Income")
group_by(income, Income)
# A tibble: 42,267 × 3
# Groups:   Income [2,218]
   Country             Year  Income
   <chr>               <chr>  <int>
 1 Afghanistan         X1800    603
 2 Albania             X1800    667
 3 Algeria             X1800    715
 4 Andorra             X1800   1200
 5 Angola              X1800    618
 6 Antigua and Barbuda X1800    757
 7 Argentina           X1800   1510
 8 Armenia             X1800    514
 9 Australia           X1800    814
10 Austria             X1800   1850
# ℹ 42,257 more rows
str(income)
'data.frame':   42267 obs. of  3 variables:
 $ Country: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ Year   : chr  "X1800" "X1800" "X1800" "X1800" ...
 $ Income : int  603 667 715 1200 618 757 1510 514 814 1850 ...


2.2 Life Expectancy Data

Second, Life Expectancy data required reshaping to appropriately merge the data sets for analysis.


The data frame of raw life expectancy data (named life.raw) was reshaped to gather the data by year, omitting the country column from the gather process. Omitting the country column allows for multiple country entries for each year.


This process transformed the raw life expectancy data of 187 observations of 220 variables into the reshaped data frame of life expectancy data (named life) that includes 40437 observations of 3 variables.


All blank values were omitted from the reshaped life expectancy data frame by setting na.rm = TRUE.

life <- life.raw %>% 
  gather(key = "Year",
         value = "Life_Exp",
         - geo,
         na.rm = TRUE)
colnames(life) <- c("Country", "Year", "Life_Exp")
group_by(life, Life_Exp)
# A tibble: 40,437 × 3
# Groups:   Life_Exp [739]
   Country             Year  Life_Exp
   <chr>               <chr>    <dbl>
 1 Afghanistan         X1800     28.2
 2 Albania             X1800     35.4
 3 Algeria             X1800     28.8
 4 Angola              X1800     27  
 5 Antigua and Barbuda X1800     33.5
 6 Argentina           X1800     33.2
 7 Armenia             X1800     34  
 8 Australia           X1800     34  
 9 Austria             X1800     34.4
10 Azerbaijan          X1800     29.2
# ℹ 40,427 more rows
str(life)
'data.frame':   40437 obs. of  3 variables:
 $ Country : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
 $ Year    : chr  "X1800" "X1800" "X1800" "X1800" ...
 $ Life_Exp: num  28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...


2.3 Population Data

Third, the Population Data required reshaping as well to appropriately merge the data sets for analysis.


The data frame of raw population data (named popu.raw) was reshaped to gather the data by year, omitting the country column. Omitting the country column allows for multiple country entries for each year.


This process transformed the raw population data of 195 observations of 220 variables into the reshaped data frame of population data (named popu) that includes 42705 observations of 3 variables.


All blank values were omitted from the reshaped population data frame by setting na.rm = TRUE.

popu <- popu.raw %>% 
  gather(key = "Year",
         value = "Population",
         - geo,
         na.rm = TRUE)
colnames(popu) <- c("Country", "Year", "Population")
group_by(popu, Population)
# A tibble: 42,705 × 3
# Groups:   Population [4,599]
   Country             Year  Population
   <chr>               <chr>      <int>
 1 Afghanistan         X1800    3280000
 2 Albania             X1800     410000
 3 Algeria             X1800    2500000
 4 Andorra             X1800       2650
 5 Angola              X1800    1570000
 6 Antigua and Barbuda X1800      37000
 7 Argentina           X1800     534000
 8 Armenia             X1800     413000
 9 Australia           X1800     351000
10 Austria             X1800    3210000
# ℹ 42,695 more rows
str(popu)
'data.frame':   42705 obs. of  3 variables:
 $ Country   : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ Year      : chr  "X1800" "X1800" "X1800" "X1800" ...
 $ Population: int  3280000 410000 2500000 2650 1570000 37000 534000 413000 351000 3210000 ...


2.4 Country Data

Third, the Country Data required reshaping as well to appropriately merge the data sets for analysis.


A subset of the data frame of raw country data (named countries.raw) to remove all variables except for the Country name and Continent.


This process transformed the raw country data of 248 observations of 11 variables into the reshaped data frame of country data (named countries) that includes 248 observations of 2 variables.

countries <- subset(countries.raw, select = c(name, region))

colnames(countries) <- c("Country", "Continent")
str(countries)
'data.frame':   248 obs. of  2 variables:
 $ Country  : chr  "Afghanistan" "\xea\xf3land Islands" "Albania" "Algeria" ...
 $ Continent: chr  "Asia" "Europe" "Europe" "Africa" ...



3 Merging the Data

With all four data sets (Income Per Person Data, Life Expectancy Data, Population Data, and Country Data) reshaped for a clean merge, the final data set was merged in multiple steps to ensure accuracy.


3.1 Merging Longitudinal Data Part 1

First the following longitudinal data frames were merged into a singular data frame (named LifeExpIncom):

  • Life Expectancy Data (reshaped)
  • Income Per Person Data (reshaped)
 LifeExpIncom <- inner_join(life, income, by = c("Country", "Year"))
group_by(LifeExpIncom, Life_Exp)
# A tibble: 40,437 × 4
# Groups:   Life_Exp [739]
   Country             Year  Life_Exp Income
   <chr>               <chr>    <dbl>  <int>
 1 Afghanistan         X1800     28.2    603
 2 Albania             X1800     35.4    667
 3 Algeria             X1800     28.8    715
 4 Angola              X1800     27      618
 5 Antigua and Barbuda X1800     33.5    757
 6 Argentina           X1800     33.2   1510
 7 Armenia             X1800     34      514
 8 Australia           X1800     34      814
 9 Austria             X1800     34.4   1850
10 Azerbaijan          X1800     29.2    775
# ℹ 40,427 more rows

In merging by both “Country” and “Year” variables the resultant data frame LifeExpIncom includes a single variable for Country and Year, despite both merged data frames containing these variables.


All blank values were omitted from the merged data frame by utilizing inner_join().

str(LifeExpIncom)
'data.frame':   40437 obs. of  4 variables:
 $ Country : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
 $ Year    : chr  "X1800" "X1800" "X1800" "X1800" ...
 $ Life_Exp: num  28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
 $ Income  : int  603 667 715 618 757 1510 514 814 1850 775 ...


3.2 Merging Longitudinal Data Part 2

Second the following longitudinal data frames were merged into a singular data frame (named LifeIncomPopu):

  • Life Expectancy and Income Data (merged data from above)
  • Population Data (reshaped)
LifeIncomPopu <- inner_join( LifeExpIncom, popu, by = c("Country", "Year"))

In merging by both “Country” and “Year” variables the resultant data frame LifeIncomPopu includes a single variable for Country and Year, despite both merged data frames containing these variables.


All blank values were omitted from the merged data frame by utilizing inner_join().

str(LifeIncomPopu)
'data.frame':   40437 obs. of  5 variables:
 $ Country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
 $ Year      : chr  "X1800" "X1800" "X1800" "X1800" ...
 $ Life_Exp  : num  28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
 $ Income    : int  603 667 715 618 757 1510 514 814 1850 775 ...
 $ Population: int  3280000 410000 2500000 1570000 37000 534000 413000 351000 3210000 880000 ...


3.3 Merging Longitudinal Data with Categorical Data

Once all longitudinal data has been merged into a single data frame the following data frames of both longitudinal and categorical data frames were merged into a final data frame (named LifeExp_Clean):

  • Life Expectancy, Income, and Population Data (merged data from above)
  • Country Data (reshaped)
LifeExp_Clean <- inner_join(LifeIncomPopu, countries, by = "Country")

In merging by “Country” variable the resultant data frame LifeExp_Clean includes a single variable for Country , despite both merged data frames containing these variables.


This process transformed the merged Income, Life Expectancy, and Population data of 40437 observations of 5 variables into the reshaped data frame of merged Income, Life Expectancy, Population Data, and Country Data,(named LifeExp_Clean) that includes 37590 observations of 2 variables.


All blank values were omitted from the merged data frame by utilizing inner_join().

str(LifeExp_Clean)
'data.frame':   37590 obs. of  6 variables:
 $ Country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
 $ Year      : chr  "X1800" "X1800" "X1800" "X1800" ...
 $ Life_Exp  : num  28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
 $ Income    : int  603 667 715 618 757 1510 514 814 1850 775 ...
 $ Population: int  3280000 410000 2500000 1570000 37000 534000 413000 351000 3210000 880000 ...
 $ Continent : chr  "Asia" "Europe" "Africa" "Africa" ...
write.xlsx(LifeExp_Clean, file = "C:\\Users\\natal\\Downloads\\LifeExp_Clean.xlsx")

A copy of the above printed data file can be accessed at the following github link.



4 Pruning the Data for Visualization Purposes

In order to properly visualize data of this scale a subset for the year 2000 was taken (named data2000)

data2000 <- filter(LifeExp_Clean, Year == "X2000")
str(data2000)
'data.frame':   174 obs. of  6 variables:
 $ Country   : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ Year      : chr  "X2000" "X2000" "X2000" "X2000" ...
 $ Life_Exp  : num  51.6 74.4 73.9 81.8 53.4 74.7 74.2 71.8 79.7 78.1 ...
 $ Income    : int  972 5470 10200 31700 3510 18800 14900 2930 35300 38800 ...
 $ Population: int  20100000 3120000 31200000 65400 16400000 83600 37100000 3070000 19100000 8070000 ...
 $ Continent : chr  "Asia" "Europe" "Africa" "Europe" ...
write.xlsx(data2000, file = "C:\\Users\\natal\\Downloads\\data2000.xlsx")

A copy of the above printed data file can be accessed at the following github link.



5 Plotting the Data

The Prepared and Pruned data as found in data2000 is plotted below.

ggplot(data2000)+
  aes(x = Income, y = Life_Exp, color = Continent)+
  geom_point(aes(size = Population, alpha = 0.5))+
  geom_smooth(method ="lm", fill = NA)+
  scale_color_manual(values=c('#648FFF','#785EF0', '#DC267F', '#FE6100', '#FFB000'))+
  labs(x="Income Per Person ($/person/year)", y="Life Expectancy (years)", title = "Impact of Income on Average Life Expectancy", subtitle = "Controlled for both region (Continent) and population size (Population)", color="Continent", size="Population")+
  guides(alpha ="none")

As illustrated above a positive correlation between Average Income Per Person and Average Life Expectancy for all continents. Continents with a greater range of average incomes per person demonstrate a greater positive correlation between income and life expectancy. Income per person range can be visualized by the length of the linear regression lines added to the above plot. Regression lines that extend to greater x values indicate a greater income range for the indicated Continent. Population size is more difficult to observe with the above plot, but there appears to be a minor negative correlation for the Asia continent between both population size and life expectancy as well as population size and income per person. This correlation is identified by the clustering of larger sized points closer towards the bottom left of the plot (lower values for both income and life expectancy).


