Statistics deals with the analysis of data; statistical methods are developed to analyze large volumes of data and their properties. Statistical methods are used by various organizations and governments to calculate a collaborative property about employees or people; such properties then influence the decisions taken by the organizations and governments. For example, a government may want to know the average number of children below the age of 12 that are malnourished in the country, in the same way an organization may want to know the average number of employees working that have stress disorders. Depending upon the number of impoverished children, the government may come up with a new policy that will help the government in dealing with the malnutrition problem. Statistics has many practical uses in researches also, as you can learn on our course on this topic.
Now if a country has 1000 citizens then it may keep a full database of all its inhabitants and the statistical calculations may be easy, but when the number is in billions, the gathering of complete data may become a little tedious or in many ways impossible. In such cases, a random sample of population is selected as a representative to the whole population and statistical calculations are carried according to it.
In this article we will explore some of the basic concepts of statistical analysis that form the basis of all complex statistics problems. The basic concepts of mean, median, mode, variance and standard deviation are the stepping stones to almost all statistical calculations. So let’s explore them one by one.
Mean or Average
Mean or average, in theory, is the sum of all the elements of a set divided by the number of elements in the set. Mean could be treated as a collaborative property of the whole set of values. You can get a fairly good idea about the whole set of data by calculating its mean. Thus the formula for mean will become.
Mean = Sum of all the set elements / Number of elements
The importance of mean lies in its ability to summarize the whole dataset with a single value. For example, you may want to compare the average household income of County 1 to County 2. To compare the household incomes between the two counties you cannot compare each and every household income of one county to the other. The best solution would be to find the average household incomes of the two counties and then compare them with each other. By comparing the two means, we may make an assumption as to which county is more prosperous than the other.
Simply put: Median is the middle value of a set. So, if a set consists of odd number of sets, then the middle value is the median of the set, and if the set consists of an even number of sets, then the median is the average of the two middle values. The median may be used to separate a set of data into two parts.
To find the median of a set, all one needs to do is to write the elements of the set in increasing order and find the number of elements then finally find the median. Median can prove to be a very useful property in case of any outliers in the dataset. An outlier is nothing but a very huge aberration in the values specified in the set. For example, if a set consists of values: 1, 2, 3, 4, 10000, then the value 10000 is an outlier. Outliers can make mean values deeply flawed. For example, the mean of the above set is 10010/5=2002 and the median is 3. Thus, we can definitely say that the median most properly summaries the set, better than the mean. You can learn some more about the various statistics formulas and become well acquainted with the topic.
The mode in a dataset is the value that is most frequent in a dataset. Like mean and median, mode is also used to summarize a set with a single piece of information. For example, the mode of the dataset S = 1,2,3,3,3,3,3,4,4,4,5,5,6,7, is 3 since it occurs the maximum number of times in the set S.
An important property of mode is that it is equal to the value of mean and median in the case of a normal distribution. In other distributions or skewed distributions the value of mode may differ from the two. In normal distributions the data is symmetrical to a central value. A normal distribution curve is a curve that is symmetrical to an axis. Another important property of normal distributions is that half of the values in the set are larger than the mean and half are smaller.
You may want to measure the deviation of a set of data from the mean value. For example, a huge variance of the household income data of a country may be interpreted as an economy with high inequality. Many useful interpretations can be carried out by analyzing the variance in data. The variance is obtained by:
- Finding out the difference between the mean value and all the values in the set.
- Squaring those differences.
- Adding the differences.
Thus, one can observe that the variance of the particular dataset is always positive. The most proper use of variance is its use in the calculation of Standard Deviation, which is one of the most important concepts of statistics. Also, the calculation of variance can be lengthy; you may want to take up a course on Vedic Mathematics which will teach you on how you can do the calculations faster.
The standard deviation is calculated by square rooting the variance of the data. The standard deviation gives a more accurate account of the dispersion of values in a dataset. Since variance is obtained by squaring the values, it cannot be applied to real world calculations. Standard deviation is calculated by obtaining the square root of the variance which is of the same unit as the elements of the set. Hence, Standard Deviation can be used as a trusted statistical quantity to make proper statistical calculations. Standard deviation is also related to probability in many ways, so you may like to take a workshop on probability and statistics to explore more about the relation between the two topics.
A standard use of deviation is finding out how much the values of the dataset differ from the mean. Let’s understand standard deviation with the help of an example:
Suppose a particular country claims that the average salary of its people is 5000 Dollars per month, hence the country is very prosperous. This is a classic problem for a statistician who may ask the claimant the standard deviation in the salary distribution for the people in his country. If the standard deviation is very huge then the statistician may claim that the dispersion in salary is very huge and hence the prosperity claim of the country should be viewed with suspicion. If the standard deviation is less, then the claim of the country may really be credible because of the low difference in the individual salaries from the mean salary.
A thumb rule of standard deviation is that generally 68% of the data values will always lie within one standard deviation of the mean, 95% within two standard deviations and 99.7% within three standard deviations of the mean. Thus, if somebody says that 95% of the state’s population is aged between 4 and 84, and asks you to find the mean. Then, you can easily calculate the mean age of the population to be 4+84/2=22. Thus, the mean age of the population comes out to be 22. Hence, we can assume a very young population.
In the above example, mean may not be able to provide a very correct account of the country’s claim and hence deviation steps in to save the day for the statistician. You should learn some more about introductory statistics to get an in depth knowledge about these topics.