Introduction to the
data
In order to properly analyze the World Life Expectancy data, four
data sets were downloaded and read into rstudio. The following data was
utilized for the analysis:
- Income per person
- Life expectancy per year
- Population Size
- Country Regions
Each data set includes information that could potentially impact life
expectancy throughout the world. Ranging from 1800 to 2018 this public
domain data includes both relevant and irrelevant data for analysis.
countries.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/countries_total.csv")
income.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/income_per_person.csv")
life.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/life_expectancy_years.csv")
popu.raw <- read.csv("https://nlepera.github.io/sta553/w05_ggplot/data/population_total.csv")
Conserved across the four data sets is one common variable:
country name. Additionally, three of the four utilized data sets
share a secondary common variable, year.
Preparing the Data for
a Proper Merge
Once the raw data was loaded into rstudio, variables that were
outside of the scope of interest needed to be removed from each of the
four data frames.
Income Per Person
Data
Starting with Income per person data, the data frame required
reshaping to allow for proper merge of all four data sets and subsequent
analysis.
The data frame of raw income data (named
income.raw) was reshaped to gather
the data by year, omitting the country column. Omitting the country
column allows for multiple country entries for each year.
This process transformed the raw income data of
193 observations of
220 variables into the reshaped data
frame of income data (named income)
that includes 42267 observations of
3 variables.
All blank values were omitted from the reshaped income data
frame by setting na.rm = TRUE.
income <- income.raw %>%
gather(key = "Year",
value = "Income",
- geo,
na.rm = TRUE)
colnames(income) <- c("Country", "Year", "Income")
group_by(income, Income)
# A tibble: 42,267 × 3
# Groups: Income [2,218]
Country Year Income
<chr> <chr> <int>
1 Afghanistan X1800 603
2 Albania X1800 667
3 Algeria X1800 715
4 Andorra X1800 1200
5 Angola X1800 618
6 Antigua and Barbuda X1800 757
7 Argentina X1800 1510
8 Armenia X1800 514
9 Australia X1800 814
10 Austria X1800 1850
# ℹ 42,257 more rows
str(income)
'data.frame': 42267 obs. of 3 variables:
$ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Income : int 603 667 715 1200 618 757 1510 514 814 1850 ...
Life Expectancy
Data
Second, Life Expectancy data required reshaping to
appropriately merge the data sets for analysis.
The data frame of raw life expectancy data (named
life.raw) was reshaped to gather
the data by year, omitting the country column from the gather process.
Omitting the country column allows for multiple country entries for each
year.
This process transformed the raw life expectancy data of
187 observations of
220 variables into the reshaped data
frame of life expectancy data (named
life) that includes
40437 observations of
3 variables.
All blank values were omitted from the reshaped life
expectancy data frame by setting na.rm = TRUE.
life <- life.raw %>%
gather(key = "Year",
value = "Life_Exp",
- geo,
na.rm = TRUE)
colnames(life) <- c("Country", "Year", "Life_Exp")
group_by(life, Life_Exp)
# A tibble: 40,437 × 3
# Groups: Life_Exp [739]
Country Year Life_Exp
<chr> <chr> <dbl>
1 Afghanistan X1800 28.2
2 Albania X1800 35.4
3 Algeria X1800 28.8
4 Angola X1800 27
5 Antigua and Barbuda X1800 33.5
6 Argentina X1800 33.2
7 Armenia X1800 34
8 Australia X1800 34
9 Austria X1800 34.4
10 Azerbaijan X1800 29.2
# ℹ 40,427 more rows
str(life)
'data.frame': 40437 obs. of 3 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Life_Exp: num 28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
Population Data
Third, the Population Data required reshaping as well to
appropriately merge the data sets for analysis.
The data frame of raw population data (named
popu.raw) was reshaped to gather
the data by year, omitting the country column. Omitting the country
column allows for multiple country entries for each year.
This process transformed the raw population data of
195 observations of
220 variables into the reshaped data
frame of population data (named
popu) that includes
42705 observations of
3 variables.
All blank values were omitted from the reshaped population
data frame by setting na.rm = TRUE.
popu <- popu.raw %>%
gather(key = "Year",
value = "Population",
- geo,
na.rm = TRUE)
colnames(popu) <- c("Country", "Year", "Population")
group_by(popu, Population)
# A tibble: 42,705 × 3
# Groups: Population [4,599]
Country Year Population
<chr> <chr> <int>
1 Afghanistan X1800 3280000
2 Albania X1800 410000
3 Algeria X1800 2500000
4 Andorra X1800 2650
5 Angola X1800 1570000
6 Antigua and Barbuda X1800 37000
7 Argentina X1800 534000
8 Armenia X1800 413000
9 Australia X1800 351000
10 Austria X1800 3210000
# ℹ 42,695 more rows
str(popu)
'data.frame': 42705 obs. of 3 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Population: int 3280000 410000 2500000 2650 1570000 37000 534000 413000 351000 3210000 ...
Country Data
Third, the Country Data required reshaping as well to
appropriately merge the data sets for analysis.
A subset of the data frame of raw country data (named
countries.raw) to remove all
variables except for the Country name and Continent.
This process transformed the raw country data of
248 observations of
11 variables into the reshaped data
frame of country data (named
countries) that includes
248 observations of
2 variables.
countries <- subset(countries.raw, select = c(name, region))
colnames(countries) <- c("Country", "Continent")
str(countries)
'data.frame': 248 obs. of 2 variables:
$ Country : chr "Afghanistan" "\xea\xf3land Islands" "Albania" "Algeria" ...
$ Continent: chr "Asia" "Europe" "Europe" "Africa" ...
Merging the Data
With all four data sets (Income Per Person Data, Life
Expectancy Data, Population Data, and Country Data)
reshaped for a clean merge, the final data set was merged in multiple
steps to ensure accuracy.
Merging Longitudinal
Data Part 1
First the following longitudinal data frames were merged into a
singular data frame (named
LifeExpIncom):
- Life Expectancy Data (reshaped)
- Income Per Person Data (reshaped)
LifeExpIncom <- inner_join(life, income, by = c("Country", "Year"))
group_by(LifeExpIncom, Life_Exp)
# A tibble: 40,437 × 4
# Groups: Life_Exp [739]
Country Year Life_Exp Income
<chr> <chr> <dbl> <int>
1 Afghanistan X1800 28.2 603
2 Albania X1800 35.4 667
3 Algeria X1800 28.8 715
4 Angola X1800 27 618
5 Antigua and Barbuda X1800 33.5 757
6 Argentina X1800 33.2 1510
7 Armenia X1800 34 514
8 Australia X1800 34 814
9 Austria X1800 34.4 1850
10 Azerbaijan X1800 29.2 775
# ℹ 40,427 more rows
In merging by both “Country” and “Year” variables the resultant data
frame LifeExpIncom includes a
single variable for Country and Year, despite both merged data frames
containing these variables.
All blank values were omitted from the merged data frame by
utilizing inner_join().
str(LifeExpIncom)
'data.frame': 40437 obs. of 4 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Life_Exp: num 28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
$ Income : int 603 667 715 618 757 1510 514 814 1850 775 ...
Merging Longitudinal
Data Part 2
Second the following longitudinal data frames were merged into a
singular data frame (named
LifeIncomPopu):
- Life Expectancy and Income Data (merged data from above)
- Population Data (reshaped)
LifeIncomPopu <- inner_join( LifeExpIncom, popu, by = c("Country", "Year"))
In merging by both “Country” and “Year” variables the resultant data
frame LifeIncomPopu includes a
single variable for Country and Year, despite both merged data frames
containing these variables.
All blank values were omitted from the merged data frame by
utilizing inner_join().
str(LifeIncomPopu)
'data.frame': 40437 obs. of 5 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Life_Exp : num 28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
$ Income : int 603 667 715 618 757 1510 514 814 1850 775 ...
$ Population: int 3280000 410000 2500000 1570000 37000 534000 413000 351000 3210000 880000 ...
Merging Longitudinal
Data with Categorical Data
Once all longitudinal data has been merged into a single data frame
the following data frames of both longitudinal and categorical data
frames were merged into a final data frame (named
LifeExp_Clean):
- Life Expectancy, Income, and Population Data (merged data from
above)
- Country Data (reshaped)
LifeExp_Clean <- inner_join(LifeIncomPopu, countries, by = "Country")
In merging by “Country” variable the resultant data frame
LifeExp_Clean includes a single
variable for Country , despite both merged data frames containing these
variables.
This process transformed the merged Income, Life Expectancy, and
Population data of 40437 observations
of 5 variables into the reshaped data
frame of merged Income, Life Expectancy, Population Data, and Country
Data,(named LifeExp_Clean) that
includes 37590 observations of
2 variables.
All blank values were omitted from the merged data frame by
utilizing inner_join().
str(LifeExp_Clean)
'data.frame': 37590 obs. of 6 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
$ Year : chr "X1800" "X1800" "X1800" "X1800" ...
$ Life_Exp : num 28.2 35.4 28.8 27 33.5 33.2 34 34 34.4 29.2 ...
$ Income : int 603 667 715 618 757 1510 514 814 1850 775 ...
$ Population: int 3280000 410000 2500000 1570000 37000 534000 413000 351000 3210000 880000 ...
$ Continent : chr "Asia" "Europe" "Africa" "Africa" ...
write.xlsx(LifeExp_Clean, file = "C:\\Users\\natal\\Downloads\\LifeExp_Clean.xlsx")
A copy of the above printed data file can be accessed at the
following github
link.
Pruning the Data for
Visualization Purposes
In order to properly visualize data of this scale a subset for the
year 2000 was taken (named
data2000)
data2000 <- filter(LifeExp_Clean, Year == "X2000")
str(data2000)
'data.frame': 174 obs. of 6 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : chr "X2000" "X2000" "X2000" "X2000" ...
$ Life_Exp : num 51.6 74.4 73.9 81.8 53.4 74.7 74.2 71.8 79.7 78.1 ...
$ Income : int 972 5470 10200 31700 3510 18800 14900 2930 35300 38800 ...
$ Population: int 20100000 3120000 31200000 65400 16400000 83600 37100000 3070000 19100000 8070000 ...
$ Continent : chr "Asia" "Europe" "Africa" "Europe" ...
write.xlsx(data2000, file = "C:\\Users\\natal\\Downloads\\data2000.xlsx")
A copy of the above printed data file can be accessed at the
following github
link.
Plotting the Data
The Prepared and Pruned data as found in
data2000 is plotted below.
ggplot(data2000)+
aes(x = Income, y = Life_Exp, color = Continent)+
geom_point(aes(size = Population, alpha = 0.5))+
geom_smooth(method ="lm", fill = NA)+
scale_color_manual(values=c('#648FFF','#785EF0', '#DC267F', '#FE6100', '#FFB000'))+
labs(x="Income Per Person ($/person/year)", y="Life Expectancy (years)", title = "Impact of Income on Average Life Expectancy", subtitle = "Controlled for both region (Continent) and population size (Population)", color="Continent", size="Population")+
guides(alpha ="none")

As illustrated above a positive correlation between Average Income
Per Person and Average Life Expectancy for all continents. Continents
with a greater range of average incomes per person demonstrate a greater
positive correlation between income and life expectancy. Income per
person range can be visualized by the length of the linear regression
lines added to the above plot. Regression lines that extend to greater x
values indicate a greater income range for the indicated Continent.
Population size is more difficult to observe with the above plot, but
there appears to be a minor negative correlation for the Asia continent
between both population size and life expectancy as well as population
size and income per person. This correlation is identified by the
clustering of larger sized points closer towards the bottom left of the
plot (lower values for both income and life expectancy).
