Mohammed Ashraf is wildlife ecologist and founder of the Species Ecology. His academic interest and research focus broadly lie in three applied disciplines of science. These are: 1. wildlife ecology, management and conservation 2. mathematics notably applied mathematics and 3. high level mathematical programming in Python, R and Java. Ashraf is interested in understanding demographic variabilities of species that are both ecologically enigmatic, endangered and facing global extinction crisis in the face of anthropogenic disturbances across tropical and temperate ecosystems. The types of questions Ashraf is particularly interested in answering with regards to threatened species are:

1. What is the population size of the species and how the species is distributed spatially and temporally across its ecological habitat in any particular ecosystem?

2. What proportion of area is occupied by this species? Within this proportion, how is the species distributed (for example densely or sparsely)?

3. What is the distribution range (not home range) of this species and where is this distribution range increasing, shrinking and constant?

4. What is the relative abundance of this species and how we devise statistically valid sampling survey to estimate the indices of relative abundance of this species with confidence?

5. What is the absolute density and abundance of the species and what sampling method would best serve us to estimate the population size and density with degree of confidence?

In addition to population ecology of vertebrate fauna, Ashraf is equally passionate about  conservation education and outreach. These encompasses several facets of conservation outlets both in-situ (in the field) and ex-situ (in zoos and museums). For example, in most fundamental level, animal population management and resource conservation of any ecologically potential habitat require science bound approaches be it basic data collection on vegetation composition or species distribution pattern across its ecological landscape. Often these information can serve as benchmark to effectively devise conservation management and monitoring program in cost effective manner.

This website is not all about my biography as you may have realized by now hence my academic and other interest are neatly sandwiched between ecological study of species and its associated branch of mathematics and computer programming. I believe by presenting information this way, my audiences will simultaneously appreciate statistically valid ecological information along with my academic and research interests without compromising any of each.

Wildlife Ecology: Animal Population Sampling & Estimation: Brief Introduction

Mohammed Ashraf

You can download the portable document format of this article by clicking PDF here.

Ecology is not the kind of science that takes people by storm hence I am not expecting that it is just what the doctor ordered. But we at Species Ecology are pretty ‘gung ho’ about the motion and rolled up our sleeves and buckled down to do our part to ensure science bound ecological sustainability find its niche in the face of anthropogenic development across the chessboard. I am not going to beat around the bush hence one of the main purposes of reaching out to people neatly rooted into the fact that collaborative and collective actions are fundamental to reinforce the conservation pillars in which wildlife science and ecology are basic ingredients. Therefore, I am at the crossroad reaching out potential academic scholars so that collectively we could go back to the drawing board and crank out rudiments of common language (mathematics) to preserve mosaic of heterogeneous pristine ecological units from Baluchistan in Pakistan to Yosemite in California and anything in between. I like to keep the ball rolling and I am twisting arms to get scholars on board depending on what variety of fresh food (ecology) and ingredients (mathematical tools) they can bring on the table.

Lot of ecological inquires can be modeled into finding priority action measures and to predict scenarios hence for example looking into fish population (denoted with P) which can be modeled into quadratic equation to predict future population size. Here I have modeled the fish population P below and solved the equation to determine the time (in days) when fish population will reach 500. This is just an example of some of the works I am pretty ‘gung ho’ about.

\left( 3t + 10 \sqrt{t} + 140 \right) = P

\left( 3(\sqrt{t})^{2} + 10 \sqrt{t} + 140 \right) = 500

\left( 3(\sqrt{t})^{2} + 10 \sqrt{t} - 360 \right) = 0

\left( ax^{2} + bx + (-360) \right) = 0

t = \left( \frac{-10 \pm \sqrt{(10)^{2} - 4 \cdot{3} \cdot{-360}}}{2 \cdot {3}} \right)

t = \left( \frac{-10 \pm \sqrt{(10)^{2} - 4 \cdot{3} \cdot{-360}}}{2 \cdot {3}} \right)

t = \left( \frac{ -10 \pm \sqrt{100 + 4320}} {6} \right)

t = \left( \frac{-10 \pm \sqrt{4420}}{6} \right)

t = \left( \frac{-10 \pm 66.48}{6} \right)

t = \left( \frac{56.48}{6} \right)

t = 9.41

\sqrt{t} = 9.41

(\sqrt{t})^{2} = (9.41)^{2}

t = 88.5

Fish population Model

Therefore for fish population to reach 500 it would require 88.5 days or roughly 12 weeks. Refer to the 3t + 10\sqrt{t} + 140 = P population model curve.

Its critically important to develop an algorithm so that we can generalize quadratic model and in this example I have used Python programming language to model the equation (3t + 10 \sqrt{t} + 140) = P into square-root function for the purpose of fish population prediction.

Once critically endangered Florida Panther: Subspecies of Mountain Lion recovered from population decline : Thanks to dedicated science bound conservation measure…

The other example I would like to draw attention to is sampling size and the determination of sampling size based on simple (or stratified) random sampling. Animal population estimation is function of two critical parameters. 1. Occupancy 2. Detectability. Here the probability of animal occupancy is one of the statistical factors that need to be taken into account before carrying out animal population survey. In other words, statistically valid survey design is at paramount importance. Generally speaking, at one given time, our chance to detect any animal depends on whether our sampling units are true representation of the population size. For example, If I ought to find out the Florida panther (subspecies of mountain lion) population in any given area of 100 sq km, my primary objective is design a survey unit based on proportional and true representation of all the units. It simply means, if we conduct animal detection survey of roughly 2 sq km that I can cover in a day on foot, then I need to ensure that each 2 sq km I choose is a true representation or have the equal probability of selection among my fifty 2-sq-km panther survey unit (50 times 2 equates our total 100 sq km). Surely 100 sq km is a relatively big area for me to survey on my own but I still need to conduct the survey hence I could survey 40 sq km out of my total 100 sq km potential survey area to estimate the Florida panther population. Since my survey units are all 2 sq km each, hence 40 sq km translates to total 20 blocks which I would need to randomly select out of total 50 blocks or 100 sq km. Here I have used R programming language to write up a function that will allow me to randomly select 20 blocks out of 100 or any numbers of blocks depending on how many blocks you wish to include into your survey as random sample. I have provided below the block matrix of 100 in which twenty 2-sq-km block are randomly selected. Blocks are highlighted bold for the purpose of clarity. Also note, this is not an algebraic matrix that you may often utilize to solve problems in linear algebra. This is just a block sample that some folks may simply present in a grid block as oppose to block matrix.

Sampling blocks under conceptually unified statistically valid random sampling procedure undertaken through R programming language

These 20 blocks are true representation of my sampling survey area and if survey is carried out in these blocks, even if I can detect only few panther from my survey unit, the sampling size would still be true representation of the population size hence it would allow me to estimate the detection probability of panther population from the entire 100 sq.km. ecological unit. As an example, if I manage to detect only 3 panther out of my 40 sq km survey unit and my detection probability stands out 0.1, it then translates to undetected panther population size of 30 which in turn give me the total population size of 33 in that particular Everglade mangrove habitat.

This is just a short article providing some very brief understanding with regards to ecological study focusing animal population survey design and estimation techniques. The article deduced hard core mathematical rigor and modeling techniques to produce succinct easy-to-understand ecological piece without compromising the statistical rigor. The primary objective of this short essay is to publicize these rather mathematically challenging models in simplistic coherent format so that average people from non scientific background yet avid conservationist can able to digest the rudiment of population ecology and its conservation implications.

This draft is prepared in \LaTeX – the brainchild of Donald Knuth, developed by American Mathematical Society (AMS) and created by George Gratzar from University of Manitoba Department of Mathematics. I have also utilized both Python and R Programming Language to develop quadratic population model and for designing random sampling matrix. No commercial software under capitalistic market share is used in preparation of this draft. UNIX variant GNU-Debian Linux is used throughout as core to run all software packages.

It’s all in the Geese!

Mohammed Ashraf

You can download the portable document format of this essay by clicking PDF here


Wildlife ecologists are often interested to find out the parameters that influence the species distribution and population size. These parameters can range from intrinsic ecological factors (for example density dependent population regulation) to extrinsic anthropogenic disturbances (man made caused of greenhouse gas emission). Within this broad spectrum, wildlife ecologists often need to find out the possible underlying trend or mechanism that influence the population parameter of species that are at concern. Lot of wildlife biologists who recently graduated are in a situation where they feel the necessary statistical tools they require to successfully carry out ecological data analyses are absent due to various economic and social factors that are hindering them to access the cutting edge scientific tools and resources. This problem is more intense in developing nations where technical and academic supports are often few and far between due to weak economic and social structure and conditions. For example, both in developed and developing nations, students are often trained to carry out necessary statistical tests under conceptually unified mathematical rigor within the broad spectrum of ecology in general and mathematics in particular. Students are trained to use handful of statistical and mathematical software that they are often introduced in their undergraduate university level education. These software usually range from Minitab, SPSS, JMP, for statistical analyses and MATLAB, Maplin and Maple for mathematical programming. These are commercially lucrative easy-to-use, graphical user interphase (GUI) based high-priced software that students once graduate, struggle to get hold of due to various economic and social factors beyond their control. One of the main factors is these are expensive software which are also closed-sourced meaning one cannot really access the source code (programming codes) to redevelop or reproduce the software or to work with the software in an ultimate freedom. Hence, students, once finish their undergraduate study in wildlife science or ecology, often find it hard to keep up their academic and research study in ecology in general and wildlife science in particular. These in turn affect the overall balance of delivering healthy pool of scientific scholars from biodiversity conservation arena. On the other hand, human existence deeply rooted into sustaining and conservation the remaining biodiversity of our planet. Simply put, if concerted and active (pro active as oppose to retrospective) measures to help conserve (if not preserve) the remaining ecological diversity within the next 25-50 years at top, our planet will simply face doomsday scenario which will eliminate human species in the very blink of our eyes. Considering to the fact, out planet is very old (over 5 billion years old), extinction of human species although pretty apparent in evolutionary time scale, will happen within the next 100 years or so if we fail to curve the species extinction in the face of capitalistic free-market resource consumption and exploitation across the hemispheres.

Geese are the heart beat of wetland ecosystems

Despite the fact that many students from conservation biology related disciplines pose necessary scientific skills, it is unfortunate they they lack necessary technical tools to master and utilize their newly developed analytical skills in ecology due to capitalistic profit-driven market enterprise that only allow the wealthy section of the society to access their products in this case the scientific software that we are about to reveal. However, things have changed and lot of folks now started to boycotting these commercial enterprise and started to write their own source codes hence developing open-source scientific software to conduct their necessary ecological research. Therefore, in this article, I introduce R programming language which is open-source totally free (as often jokingly term as free beer) scientific software that is significantly more powerful and streamlined than any commercial enterprise software that are monopolized and controlled by capitalistic profit driven market tycoons. R can be accessed and installed pretty quick and if you are using UNIX variant operation systems for example Linux or BSD, R (the base R) usually comes with it. Here, I will not discuss the necessary steps one require to access and install R as these information are readily available on the net (use your favorite search engine to download and install R). Instead, I will dive down to the implementation and utilization of R programming language to address ecological datasets that are critically important to help conserve ecological units from genetics to population, from community to landscape level and from ecosystem to the entire planet (commonly known as biosphere or ecosphere).

Mommy goose with her gosling

I am always fascinated by ducks and geese and often interested to know more about their population and what influences their population. Hence I am going to provide you with a simple and clean example of how R (the power of R and things it can do for you are simply infinite and astounding) can help you develop your ecological model based on simple data that you can collect right after finishing your undergraduate study in wildlife ecology or any related disciplines (ecology, conservation biology, landscape ecology and so on). Remember, lack of technical resources due to capitalism is not your problem hence do not let capitalist to stop doing the good things for yourself and more importantly for our planet which needs more conservation scientists than MBA. Recently, I went to mangrove ecosystem of particular tropical estuarine landscape where I was interested to find out how geese population (greylag goose in particular) is influenced by the presence of crow and eagle nests around its vicinity. As we may know, although geese is relatively big bird, it has its fare share of enemy and often eagle and crows are the birds of prey that either kill gosling or severely disturb roosting and grazing habits of geese population. This problem is more phenomenon in tropical mangrove where geese often visit by following their long-haul winter migratory route from temperate and tundra ecosystems as far as Himalaya and Siberia. Here I am interested to see whether their is any possible relationship between number of geese and number of eagle or crow nests. I am also interested to count both male and female geese and how male and female population size are influenced by crow nest within their roosting points.

I collected sample of 49 observation of geese population size over two weeks of ecological survey in mangrove ecosystem. My sampling sites are randomly selected and no sites are repeated to collect data. The total survey area was 100 sq km estuarine mangrove. I first carried out necessary feasibility study to find out how much of an area I can cover to count geese in one fell swoop. Based on my energy and logistic resources, I worked out if I can cover 2 sq km a day, I can then generate fifty, 2-sq-km blocks (50 times 2 equates 100 sq km) to carry out my sampling survey. The block design is critically important. It is pre-requisite for random sampling design in which I must need to ensure each 2-sq-km block posses equal chance of being selected for my survey so that I do not end up choosing any block based on favoritism (as if I do what I want or like, as if ad-hoc study which has no ecological and scientific bearing). Hence each of my 2-sq-km block has probability of (1/20) or 0.05 percent chance of being selected hence will form valid representative of the entire area of 100 sq. km. Its notable to point out that the power of random sampling is very robust therefore it does not really matter how many sample blocks you going to choose to carry out your survey (although one rule of thumb is no less than 10 percent of the total sampling size). However, what does matter is whether you have randomly selected your blocks or not. Therefore, even if I choose to carry out my graylag geese survey on 10 sq km (10 percent of my total 100 sq km sampling area) which works out five 2-sq-km blocks out of total twenty blocks comprising my potential survey unit of 100 sq km, I can still come up with ecologically valid data set with regards to geese population and eagle nests to infer or generalize how the geese numbers are influenced by the eagle nests (you could class this as my working ecological question at this point of time).

Sampling blocks are numbered

Lets do some work on R to start with. I need to choose five random blocks of 2 sq km each from total of 20 blocks that comprise my 100 sq km survey area. Please note, I created a cell block (see the matrix diagram) with each blocked are assigned with serial number starting from 1 and ending with 50.

I will now ask R to randomly select 5 blocks out of 50 and present me with the set of random five numbers which will be my sampling blocks. I can write a simple code that R will use to generate random five blocks out of 50 from my five by ten (5 rows and 10 columns) matrix dimension and the R code is provided below.

sample ( x= 1 : 50, size = 5, replace = FALSE)

{40 7 5 9 2}

Random blocks generated by R are bold highlighted

I have written this simple code above hence asked R to generate five random numbers out of 50 hence I can write my sampling block by using set builders notation as such [1 \geq{x} \geq{50} \, x | 40, 7, 5, 9, 2] (Pronounced as the set of all x between 1 to 50 such that x is 40, 7, 5, 9, 2). Hence these set of five numbers (40, 7, 5, 9, 2) are my ecological survey unit comprising total of 10 sq km out of 100 sq km potential survey unit. I have further generated the matrix dimension but this time I have highlighted my random blocks in which I will investigate the grey-leg goose population size and how it relates with eagle and crow nests in or around their roosting/grazing/resting site.

Before we go further, just a quick note on my simple R coding. As you can see, I have asked R to randomly select 5 numbers between 1 to 50 by assigning it as as variable x hence (x = 1 \cdots \cdots 50). I then assigned R with my sample size which is 5 meaning R will randomly select five blocks out of 50 from my sample matrix. Finally I asked R not to replace the block by writing FALSE. What it means is, by default, R will pick any number between 1 to 50 randomly and then put it back into the system (often known as recycling) but we do not want to select the same number twice hence I asked R not to replace the selected number which in coding term simply works out as replace = FALSE.

Waterfowl are heart beat of wetland ecosystems

Now that I have my blocks randomly selected I can begin my survey work (the fun part). I have visited the blocks every morning and every evening for the past two weeks and collected the data on greylag goose population size. I also collected the data in terms of distribution of greylag goose by gender (male and female goose). I then carried out line transect survey in the same blocks every morning and evening to count eagle and or crow nests in or around the vicinity of grazing/roosting and resting sites of the goose population. My line transects were roughly half a km long although some line transects were a km long due to high density of crow nests in relation to the vicinity of greylag goose population. My dataset is presented below. Can we make anything out of this data? Can we answer few statistically valid ecological questions from this dataset? Possibly not, because dataset is often useless on its own unless we make it meaningful. How we going to gain high level understanding from this freshly collected ecological data on grey-leg goose population in relation to crow/eagle nests? Answer lies in solid command in statistics and harnessing the power of statistical tools and modeling. We will harness the power of statistical tools by utilizing the power of R programming. Hence the remaining part of the essay will focus on R coding to gain high level understanding of out dataset.

When we are presented with dataset of two numerical variables as in the case in my data, we are often interested to find out whether these numerical variables are anyhow relate with each other. Here I am interested to find out whether there is any relationship between number of goose and number of eagle nests. I am also interested to find out whether there is any relationship between male and female goose distribution in relation to eagle/crow nests. Furthermore, I am ecologically motivated to develop a model that will provide statistically valid summery which we can utilize to generalize and make predictions in terms of goose population and eagle nests. Does any of these make sense so far and if so, how we go about it? It’s simple, we let R to answer all these interesting questions. All we gotta do is ask R by writing codes (language) that R can understand. It’s as simple as that.

Geese by the wetland

As I was saying before, when we have two numerical variables (fashionably known as bivariate data), first thing we want to do is create a scatter plot to see at a glance what our data looks like graphically. This would be our first step towards gaining high level understanding of ecological data. I am going to write fairly simple code hence ask R to generate a scatter plot for me. But before we do anything, let me provide you with brief fundamental information with regards to how exactly R plots graphs. Firstly, R is highly powerful and sophisticated mathematical programming language that hosts over 5000 packages. These packages are developed by scientists from various backgrounds ranging from mathematicians to wildlife ecologists, academic scholars to computer programmers. Packages are like a restaurant where you can go and order meal and order various types of meal and enjoy! In R packages are like different restaurant. For example, you can choose to go to Pakistani restaurant to enjoy Pakistani cuisines or Bangladeshi restaurant to enjoy Bangladeshi cuisines. In R, you have packages very similar to your choice of restaurants. You can download and install package that will generate highly sophisticated data rich and powerful graphs for your analytical modeling. You can also install package that will do all the algebraic calculations or solve advance problem focusing calculus and so on. You can also install package that will do cladisitic and principle component analysis and more advance work. You can also install package that will carry out geographic analysis GIS for you. Hence its like going to different restaurant right and there are over 5000 different restaurants (packages) in R town. Making any sense so far? Now, I have also mentioned about going to restaurant ordering your favorite menu. Surely there are many menu to choose from. In R, we call them function hence each package will come with lots of functions that we need to use to write our programs or to instruct R to carry out set of specific numerical and statistical tasks. Hence, package is like restaurant and functions are like menu. Just like if you choose to go to Pakistani restaurant, you are not expecting to order Vietnamese menu right? So if you are working on ggplot2 package of R, you are not expecting to conduct matrix or principle component analysis (PCA) right? What it entails is, set of R functions are grouped together to work for specific package. Although there may be situation where you come across functions (that is menu in a restaurant) are overlapping between one package (package is your restaurant) to another, but generally packages host set of functions to carry out specific mathematical and statistical tasks. I have already indicated one of the package that we going to use to analyze our geese data. This package is called ggplot2 and it will host set of functions that we will utilize to derive high level understanding of our data through insightful graphs. Now that you have some basic background understanding how R packages work alongside with set of functions that come within the package, we can start the analytical part of our ecological study. It’s really a fun part when you learn to harness the power of R coding to gain high level understanding of you hard-earned ecological field data.

As mentioned earlier the first thing I would like do to is, generate a scatter plot to see how two of my variables are laid out. Hence I am interested to see how number of geese are influenced by eagle/crow nest. Scatter plot is really a point graph where we have our eagle/crow nest at x-axes and geese numbers in y-axes. ggplot will do all the job to generate a point corresponding to both x and y axis for my geese variables. First I load the ggplot2 package and then develop a framework in which I will simply add necessary layers to enrich the plot as we go along. It’s pretty simple. It’s like baking a cake. You make a plain cake and then add necessary toppings from strawberry cream to different flavor of vanilla or chocolate, may be even put ice cream in it too…so the options are unlimited. It’s the same with ggplot. We first ask R to develop the framework and then simply add layers to enrich our graph to gain high level statistical insight.


geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese))

geese_plot + geom_point()

Basic Scatter Plot showing distribution of geese population under influence of predatory birds presence.

Now, this is our very basic scatter plot. At quick glance, as we can see, almost all our geese numbers fall between the crow nests that range from 0-2.5. In other words, simply by glancing our scatter plot, we have already gained a valuable information about how our geese numbers are influenced by crow/eagle nests. We can also visualize the fact that there is considerable variation in geese numbers ranging from 0 to 20 within the nesting range of eagle/crow from 0 to 2.5. But have you spotted one or two things yet? Have you noticed that our scatter plot actually does not reveal information in terms of gender? Remember I collected data of geese numbers of both male and female geese. So the question I am not curious to know is how male and female geese population is distributed within the crow/eagle nesting range of 0 to 2.5. All I am going to do now is, write a simple code in my original framework to instruct R to provide me with gender wise population distribution and the codes are as follows:

geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese, color = Gender))

geese_plot + geom_point()

Scatter plot refined to reveal further ecological information

I just typed color equal to gender and ggplot has done the rest. It has pulled together my data variables, matched the variables together to generate points based on gender. It then color coded the gender so that I can visualize the difference of geese numbers based on their gender. How nice and powerful is that? Now, my data is making more sense. Not only I now know that geese numbers do well when eagle nests or less than 2.5 but I also now pin point how female numbers are more sensitive to crow nests as oppose to the male geese. More particularly most of our female geese population do well when there is no crow nest at all. Take a good look at the first column of our scatter plot where you will find more pink dots (female geese) vertically lined up where crow nest in our x-axes is 0. Interestingly we found only one male geese when crow nest is 0 and rest of them are all females. In our second column we see considerable variation in geese population ranging from as low as 0 to as high as 20. We do now really know why there is such a high variation in terms of geese numbers but we do know that there are more male geese than female within this population variation. Now, have you spotted something else so far? Have you counted my total observation. I have collected total of 49 sample of geese population from my two-weeks field survey. But, if you count the points, it does not match up. Can you answer why not? It’s cause we may have points that are overlapping with other points meaning they possibly have similar or same number in terms of their population size. Therefore, we need to ask R to disentangle the overlapping of our data point to reveal all our data points in the graph. This hopefully would provide us with more clearer perspective how the population is actually influenced by the eagle nests. Because the position of our data points may have been overlapped, all I will do is write position equal to jitter. What jitter does is it disentangle any observation that overlaps with other. The mathematical procedure that R follows is also pretty simple. R simply assign a random number as reference point for each observation and then based on that reference number it can geometrically disentangle any closest numbers surrounding it. The codes and the output are provided below:

geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese, color= Gender))

geese_plot + geom_point(position = “jitter”)

Scatter plot assigned with random numbers as reference point to disentangle data point overlapping

Now this provide all our data points. If you now go ahead and count the points, it should match up to my 49 observation. This also revealed that almost half of our observation was overlapped hence we did not see it from our previous graph. This non overlapping jitter plot now actually revealed full picture of our geese population distribution. We can almost confidently say that female geese population is very sensitive to even small increment of eagle or crow nest. As you can see from the graph that there exist distinctive separation of female population size in terms of eagle/crow nest numbers. Lot of females are almost absent (see the base of the x – axes) even when the crow nest is less than $4$. I still think data are clumped together. Although it has revealed all our data points, by a quick glance we see, some points are still relatively attached to one another and that is due to the size of the point (the circle). What I would like to do now is change the size of the circle so that it provide us with slightly more improve version of our jitter plot. The code and the output are provided below:

geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese, color = Gender))

geese_plot + geom_point(size = 1.5 , position = “jitter”)

Scatter plot refined by changing the point size to better reflect the distribution pattern of the geese population sample

Now this looks lot better. Hence this would be our standard scatter plot for further data exploration and analysis. By now you probably started to realize the power of R coding and more precisely the flexibility, freedom and options of writing your own codes to explore, analyze and manipulate data under conceptually unified mathematical and statistical rigor. ggplot is extremely flexible and powerful and if you planning on becoming full blown scientist or academic, regardless of which discipline your study and research focused on, you would be million times better off harnessing the power of R programming language as oppose to commercial profit-driven capitalistic products that you have probably used when you did statistical course at your undergraduate or graduate school.

Before we go ahead carrying out further data analysis based on the scatter plot that ggplot has enabled us to create, did you notice something that we could change at this point. If we look at the labels of the graph in x and y axis, we could improve it by adding a layer. As mentioned earlier, once you developed your skeleton of the plot by using ggplot command, all we have to do afterwards is continue adding layers to improve our plot. Hence lets improve the label of our plot by simply adding a layer called labs. The command and the improved output of the plot are provided below:

geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese, color = Gender))

geese_plot + geom_point(size = 1.5 , position = “jitter”) + labs (x = “Number of Eagle/Crow Nests”, y = “Number of Geese”)

Scatter plot refined with clarity in terms of its labels

Now this improved version is obviously reveal more clearer understanding in terms of what our x axis and y axis represents in terms of our bivariate data variables. Although you may have noticed that I keep typing the backbone code which developed our skeleton:

geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese, color = Gender)).

However when you actually working in writing R code, you only have to do it once. After that, you just work on adding layers just like the the way I added labels as one of the layers in our original skeleton which R saved as R object as geese_plot.

Now that we have covered quite a bit in terms of collecting data to data pre-processing and beyond by organizing our data hence to generate scatter plot to make some meaning of our data set by harnessing the power of ggplot in R coding, we will step back a bit and focus on statistical method underpinning our data variables. In this remit, its notable to emphasis the fact that when we work on bivariate datasets, as in our geese datasets, we are often interested in three aspects of our data variables: 1. Scatter plot to get a first hand glance hour our data are behaving hence to make first hand impression of our ecological variables. 2. We then very much interested to determine whether our data is linearly distributed. That is whether our scatter plot looks like it can be fitted with a straight line. This is statistical technique and it is known as regression method. Hence, in our scatter plot, our next job would be to conduct regression analysis. At first glance, it is pretty evident that our dataset is actually not forming a straight line as most of our data points are clustered between 0 to 4 in our x axis. Nevertheless, it does open up a question then, what proportion of our data points can be answered through fitted line or as it known as regression line. Regression line is simple a straight line that help us to predict data points within specific range of our original data values. Hence regression line is pretty helpful for making predictions. For example, firstly, I am interested to find out what proportion of our data points can fall into regression line that is if I would have to predict geese number based on crow nest variations across x axis, I am then interested to find out what percentage of our data can be explained or predict from the regression line. Do all these making sense? I am not going to go into critical details of statistical mechanism as I intend to provide you with separate treatments of regression analysis by my other articles. But for now, we will simply R to fit a regression line in our scatter plot. Again, the procedure is pretty simple. We will simply add another layer. In R programming, regression line is known as smooth. The rationale behind the name is, it makes our data variables smooth by finding the best fitted line based on all the data points we have in our scatter diagram. Of course, R does not pull this off in thin air…the mathematical procedure R use is rooted into conceptually unified statistical rigor. In other words, R will find the best fitted line based on least squared criterion which is an statistical and algebraic procedure to find the best line that can fit among our data. For now, you do not really need to focus on how this line is derived mathematically as this article is more about appreciating the R programming and it implications on ecological study. Hence, I am going to write a code for adding another layer as smooth and the code and the output are as follows:

geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese, color = Gender))

geese_plot + geom_point(size = 1.5 , position = “jitter”) + labs (x = “Number of Eagle/Crow Nests”, y = “Number of Geese”)+ geom_smooth(method = “lm”)

Scatter plot with linear models of both male and female geese distribution pattern

Here we have our linear model and as you can see R has found the best fitted lines both for our male and female geese. Although, as suspected, our lines are not about the data points as most of the data are not really about any of these lines hence it intuitively answers my question that is very small proportion or percentage of our data points can be answered or predict through our best fitted linear model or regression line. Nevertheless, it still provide lot of solid insight. For example, as I was telling you before that our female geese are really super sensitive to eagle or crow nest and a quick glance at our regression lines (red line for the female data points) confirm that. As you may know from elementary geometry, more precisely from your coordinate geometry class that slope of any straight line is defined as ratio of rise and run where rise is the difference between two coordinate points in y axis and run is the difference between two points in x axis. If you look into our female geese regression line (the red line) we see, its slope is higher (because the red line is lot more steeper) than the blue line representing our male geese. In other words, even though our regression line does not really provide a robust linear model for making ecological predictions, it does however tell us the steepness of the female data points which in turns mean our female geese are extremely sensitive to eagle nests in or around their vicinity. Of course it is expected as females exhibit brooding attributes and strong motherly instinct to protect their eggs and subsequent gosling. Therefore ecological and conservation management implication is to ensure crow nests are removed if our conservation management goal is to help safeguard migratory female geese population in any specific estuarine mangrove ecosystem or freshwater wetlands as an example.

Now, lets ask R to do further improvement of our regression lines. As you may notice, that our regression lines also have shaded area. Firstly what are these shaded areas. Shaded area are actually 95% confidence interval. 95% confidence interval is a statistical measure that enable us to answer in terms of our probability to make predictions from our data points. And not surprisingly, as mentioned earlier, our regression models are pretty weak (small proportion of our data points are about the lines, meaning close to the lines) hence as you can see from the shaded area, we are 95% confident that only a small proportion of our data points can be utilized for making predictions. In other words, most of our data points are actually outside of our shaded area. However, there are overlapping between male and female confidence intervals. The middle portion (slightly more darker) is actually our overlapping proportion of male and female geese data points and this has serious conservation and ecological significance. However, before we do any further analysis, what we like to do is disentangle our common color coding of gray shaded area. Hence we would like to ask R to assign separate color for our female and male confidence intervals. This would then enable us to appreciate the overlapping part better hence would help us to gain high level understanding of overlapped confidence intervals to make robust predictions.

Did you notice this is the first time, I actually brought the option for making prediction based on our weak linear models. Can you tell me why? It is because even though we are only dealing with two variables that is geese numbers and crow nests, we in fact have two groups or levels in our geese data that is male and female geese. Hence we have this overlapped confidence interval with decent proportion of data points comprising male and female within our data range. Therefore as you can see, even a weak regression model can serve us with valuable insights into our data points providing we have grouped (male and female group) scatter plot. Before we gain ecological insight from our grouped overlapped linear model, let’s just write a simple code that will eliminate our gray color and separate our confidence intervals of our male ad female geese population sample. The code and the output are provided below:

geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese, color = Gender))

geese_plot + geom_point(size = 1.5 , position = “jitter”) + labs (x = “Number of Eagle/Crow Nests”, y = “Number of Geese”)+ geom_smooth(method = “lm”, aes (fill = Gender))

Scatter plot with distinctive color codes for male and female geese distribution within 95% confidence intervals.

Now this is lot better. R has got rid of the gray shaded bits and added distinct colors for both female (pink) and male (light blue) geese reflecting 95% confidence intervals. The middle portion which is overlapped common area is also lot more clearer and it reveals significant data points are in fact overlapped. However, to ensure no data points are hiding under the overlapped color codes, we can actually do better by asking R to lighten the colors so that if any data points that might be hiding behind the colors can be revealed. The code is simple. Under geom_smooth which is our regression line, we will simply incorporate alpha with numeric value to lighten the shaded area in our plot. The magnitude of numeric value in decimal points that alpha can take determines how light or dark you wish your shaded area to be, depending on the modality of your regression analysis of course.I usually stick to decimal range between 0.1 to 0.3 to lighten the shaded area to reveal any data points that me be previously hidden behind dark shadow

geese_plot <- ggplot (geese, aes(eagleCrowNest, numberOfGeese, color = Gender))

geese_plot + geom_point(size = 1.5 , position = “jitter”) + labs (x = “Number of Eagle/Crow Nests”, y = “Number of Geese”)+ geom_smooth(method = “lm”, aes (fill = Gender), alpha = 0.2)

Final refined version of distribution model of our geese population under predatory influence in mangrove delta

Now, surely, this has greatly improved our regression graph and we can clearly see a good deal of our data points can be answered within the overlapped 95% confidence interval. At first glance, we can confidently say that the variation of female geese numbers range from 6 to 9 when there is total absent of crow or eagle nests. Moreover, we are 95% confident to make future prediction that in any particular estuarine mangrove ecosystem, female geese numbers will range between 6 to 9 when predatory bird as such eagle or crow nests are absolute absent. In terms of male geese population, generally speaking, male geese exhibit less sensitivity towards crow nests. Our male geese data points pose considerable variation ranging from 1 to 25 and most importantly within this range, almost all male geese can tolerate predatory birds (crow/eagle) presence that range from 0 to 6. Our data also reveals that we have two extreme observation in which two males show unusual characteristic. In y axis we have 25 male geese (extreme observation point) sitting against predatory nests of 5 which although unusual but intuitively it is pretty evident that large numbers of males in a flock are brave enough to tackle predatory presence ranging from 0 to almost 7. On the other hand, although we have witnessed good numbers of our female geese are absent when crow numbers varies from 1 to 12, however, in x axis, we see an extreme observation of one male and the only male which is absent when predatory nests range from 0 to 12. In other words, all our male geese were present with variations in numbers of 1 to 25 within the predatory range of 1-12 except one male which is our outlier.

As you may realize that ecological study of any species simply rooted into conceptually unified statistically valid sampling design, followed by sampling bound data collection leading to data analysis by harnessing the power of sophisticated and powerful statistical packages that are at our disposal. In this essay, I demonstrated the power of R programming language by drawing attention from basic ecological study focusing gray leg geese population influence against predatory bird population in estuarine mangrove ecosystem. This study demonstrate the power of R programming by harnessing the statistical tools as such regression model and its implications on ecological and conservation management.

Finally, in this article, I did not attempt on covering the statistical procedures to develop regression line, neither I attempted to provide underlying statistical mechanisms that underpin this study. More precisely, this study is rooted into developing regression equation \hat{y} = b_{0} + b_{1}x, followed by estimating coefficient of determination that reflects what proportion of our data can be fitted into regression line and finally calculating the correlation coefficient also known as Pearson coefficient (named in the honor of the developer Karl Person who originally developed the method). These three statistical procedures underlie the study of my geese population and provided the conceptual framework of the essay. In my next essay, I intend to present these statistical methods and the full treatment of its analytic procedures drawing attention from the same datasets of geese population. This essay is primarily intended to serve two purposes: 1. To show the power of R programming language 2. To understand and appreciate ecological study and its close association with statistics and R programming language as powerful and sophisticated mathematical package to answer simple but interesting ecological questions focusing animal population sampling and estimation methods.

This essay is prepared in \LaTeX – the brainchild of Donalnd Knuth, developed by American Mathematical Society (AMS) and created by George Gratzar from University of Manitoba Department of Mathematics. I have also utilized both Python and R Programming Language to develop population model and for designing random sampling matrix. No commercial software under capitalistic market share is used in preparation of this draft. UNIX variant GNU-Debian Linux is used throughout as core to run all software packages.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: