STATISTICAL
TERMS
Statistics
- the practice or science of collecting and analyzing numerical data in large
quantities, especially for the purpose of inferring proportions in a whole from
those in a representative sample.
Primary
data - is information that you collect specifically for
the purpose of your research project. An advantage of primary data is that it
is specifically tailored to your research needs. A disadvantage is that it is
expensive to obtain. Primary data are collected by the investigator conducting
the research.
Secondary
data
- Common sources of secondary data for social science include censuses,
information collected by government departments, organisational records and
data that was originally collected for other research purposes.
Secondary
data offers the following advantages:
(1) It is highly convenient to use information which
someone else has compiled. There is no need for printing data collection forms,
hiring enumerators, editing and tabulating the results, etc. Researchers alone
or with some clerical assistance may obtain information from published records
compiled by somebody else.
(2) If secondary data are available they are much
quicker to obtain than primary data.
(3) Secondary data may be available on some subjects
where it would be impossible to collect primary data. For example, census data
cannot be collected by an individual or research organization, but can only be
obtained from Government publications.
However,
two major problems are encountered in using secondary data:
(1) The first is the difficulty of finding data
which exactly fit is to the need of the present study.
(2) The second problem is finding data which are
sufficiently accurate
Even if utmost care has been taken in selecting a
sample, the results derived from a sample study may not be exactly equal to the
true value in the population. The reason is that estimates of are based on a
part of population and not on the whole and samples are seldom, if ever,
perfect miniature of the population. Hence sampling gives rise to certain
errors known as sampling errors (or sampling fluctuations). These errors would
not be present in a complete enumeration survey. However, the errors can be
controlled. The modern sampling theory helps in designing the survey in such a manner
that the sampling errors can be made small.
Sampling errors are two types: biased and unbiased
1. Biased errors: These
errors arise from any bias in selection, estimation, etc. For example, if in
place of simple random sampling, deliberate sampling has been used in a
particular case some bias is introduced in the result and hence such errors are
called biased sampling errors.
2. Unbiased errors. These
errors arise due to chance differences between the members of population
included in the sample and those not included. An error in statistics is the
difference between the value of a statistic and that of the corresponding
parameter.
Thus the total sampling error is made up of errors
due to bias, if any, and the random sampling error. The essence of bias is that
it forms a constant component of error that does not decrease in a large
population as the number in the sample increased. Such error is, therefore,
also known as cumulative or non compensating error. The random sampling error,
on the other hand, decreases on an average as the size of the sample increases.
Such error is, therefore, also known as non-cumulative or compensating error.
Bias may arise due to:
(i) in appropriate process of selection
(ii) in appropriate work during the
collection; and
(iii) in appropriate methods of analysis
(i) Faulty selection. Faulty selection of the sample
may give rise to bias in a number of ways, such as:
(a) Deliberate selection of a ‘representative’
sample.
(b) Conscious or unconscious bias in the selection
of a ‘random’ sample. The randomness of selection may not really exist, even
though the investigator claims that he has a random sample if he allows his
desire to obtain a certain result to influence his selection.
(c) Substitution. Substitution
of an item in place of one chosen in random sample sometimes leads to bias.
Thus, if it is decided to interview every 50th householder in
the street, it would be inappropriate to interview the 51st or any
other number in his place as the characteristics possessed by them differ from
those who were originally to be included in the sample.
(d) Non-response. If
all the items to be included in the sample are not covered there will be
bias even though no substitution has been attempted. This fault occurs
particularly in mailed questionnaires method, which are incompletely
returned. Moreover, the information supplied by the informants may also be
biased.
(e) An appeal to the vanity of the person questioned
may give rise to yet another kind of bias. For example, the question ‘Are you a
good student?’ is such that most of the students would succumb to vanity and
answer ‘Yes;
(ii) Bias due to Faulty Collection
of Data. Any consistent error in measurement will give rise
to bias whether the measurement are carried out on a sample or on all the units
of population. The danger of error is, however, likely to be greater in
sampling work, since the units measured are often smaller. Bias may arise due
to improper formulation of the decision, problem or wrongly defining the
population, specifying the wrong decision, securing an inadequate frame, and so
on. Biased observations may result from a poorly designed questionnaire, an
ill-trained interviewer, failure of a respondent’s memory, etc. Bias in the
flow of data may be due to unorganized collection procedure, faulty editing or
coding of responses.
(iii) Bias in Analysis. In
addition to bias which arises from faulty process of selection and faulty
collection of information, faulty methods of analysis may also introduce bias.
Such bias can be avoided by adopting the proper methods of analysis.
If possibilities of bias exist, fully objective conclusions
cannot be drawn. The first essential of any sampling or census procedure must,
therefore, be the elimination of all sourced of bias. The simplest and the only
certain way of avoiding bias in the selection process is for the sample to be
drawn either entirely at random, or at random subject to restrictions which,
while improving the accuracy, are of such a nature that they do not introduce
bias in the results. In certain cases, systematic selection may also be
permissible.
- Statistics - a set of concepts, rules, and procedures that help
us to:
- organize numerical information in the form of tables, graphs,
and charts;
- understand statistical techniques underlying decisions that
affect our lives and well-being; and
- make informed decisions.
- Data - facts, observations, and information that come from
investigations.
- Measurement data sometimes called quantitative data -- the result of
using some instrument to measure something (e.g., test score, weight);
- Categorical data also referred to as frequency or qualitative
data. Things are grouped according to some common property(ies) and
the number of members of the group are recorded (e.g., males/females,
vehicle type).
- Variable - property of an object or event that can take on
different values. For example, college major is a variable that
takes on values like mathematics, computer science, English, psychology,
etc.
- Discrete Variable - a variable with a limited number of values (e.g.,
gender (male/female), college class (freshman/sophomore/junior/senior).
- Continuous Variable - a variable that can take on many different values,
in theory, any value between the lowest and highest points on the
measurement scale.
- Independent Variable - a variable that is manipulated, measured, or
selected by the researcher as an antecedent condition to an observed
behavior. In a hypothesized cause-and-effect relationship, the
independent variable is the cause and the dependent variable is the
outcome or effect.
- Dependent Variable - a variable that is not under the experimenter's
control -- the data. It is the variable that is observed and
measured in response to the independent variable.
- Qualitative Variable - a variable based on categorical data.
- Quantitative Variable - a variable based on quantitative data.
- Graphs - visual display of data used to present frequency
distributions so that the shape of the distribution can easily be seen.
- Bar graph - a form of graph that uses bars separated by an
arbitrary amount of space to represent how often elements within a
category occur. The higher the bar, the higher the frequency of
occurrence. The underlying measurement scale is discrete (nominal
or ordinal-scale data), not continuous.
- Histogram - a form of a bar graph used with interval or
ratio-scaled data. Unlike the bar graph, bars in a histogram touch
with the width of the bars defined by the upper and lower limits of the
interval. The measurement scale is continuous, so the lower limit
of any one interval is also the upper limit of the previous interval.
- Measures of Center - Plotting data in a frequency distribution shows the
general shape of the distribution and gives a general sense of how the numbers
are bunched. Several statistics can be used to represent the
"center" of the distribution. These statistics are
commonly referred to as measures of central tendency.
- Mode - The mode of a distribution is simply defined as the
most frequent or common score in the distribution. The mode is the
point or value of X that corresponds to the highest point on the
distribution. If the highest frequency is shared by more than one
value, the distribution is said to be multimodal. It is not
uncommon to see distributions that are bimodal reflecting peaks in
scoring at two different points in the distribution.
- Median - The median is the score that divides the
distribution into halves; half of the scores are above the median and
half are below it when the data are arranged in numerical order.
The median is also referred to as the score at the 50th
percentile in the distribution. The median location of N
numbers can be found by the formula (N + 1) / 2. When N
is an odd number, the formula yields a integer that represents the value
in a numerically ordered distribution corresponding to the median
location. (For example, in the distribution of numbers (3 1 5 4 9 9
8) the median location is (7 + 1) / 2 = 4. When applied to the
ordered distribution (1 3 4 5 8 9 9), the value 5 is the median, three
scores are above 5 and three are below 5. If there were only 6
values (1 3 4 5 8 9), the median location is (6 + 1) / 2 = 3.5. In
this case the median is half-way between the 3rd and 4th
scores (4 and 5) or 4.5.
- Mean - The mean is the most common measure of central
tendency and the one that can be mathematically manipulated. It is
defined as the average of a distribution is equal to the SX / N.
Simply, the mean is computed by summing all the scores in the distribution
(SX)
and dividing that sum by the total number of scores (N). The
mean is the balance point in a distribution such that if you subtract
each value in the distribution from the mean and sum all of these deviation
scores, the result will be zero.
- Measures of Spread - Although the average value in a distribution is
informative about how scores are centered in the distribution, the mean,
median, and mode lack context for interpreting those statistics.
Measures of variability provide information about the degree to
which individual scores are clustered about or deviate from the average
value in a distribution.
- Range - The simplest measure of variability to compute and
understand is the range. The range is the difference between the
highest and lowest score in a distribution. Although it is easy to
compute, it is not often used as the sole measure of variability due to
its instability. Because it is based solely on the most extreme
scores in the distribution and does not fully reflect the pattern of
variation within a distribution, the range is a very limited measure of
variability.
- Interquartile Range (IQR) - Provides a measure of the spread of the middle 50%
of the scores. The IQR is defined as the 75th percentile
- the 25th percentile. The interquartile range plays an
important role in the graphical method known as the boxplot.
The advantage of using the IQR is that it is easy to compute and extreme
scores in the distribution have much less impact but its strength is also
a weakness in that it suffers as a measure of variability because it
discards too much data. Researchers want to study variability while
eliminating scores that are likely to be accidents. The boxplot
allows for this for this distinction and is an important tool for exploring
data.
- Variance - The variance is a measure based on the deviations
of individual scores from the mean. As noted in the definition of
the mean, however, simply summing the deviations will result in a value
of 0. To get around this problem the variance is based on squared
deviations of scores about the mean. When the deviations are
squared, the rank order and relative distance of scores in the
distribution is preserved while negative values are eliminated.
Then to control for the number of subjects in the distribution, the sum
of the squared deviations, S(X
- `X),
is divided by N (population) or by N - 1 (sample).
The result is the average of the sum of the squared deviations and it is
called the variance.
Bias.
·
A
measurement procedure or estimator is said to be biased if,
on the average, it gives an answer that differs from the truth. The bias is the
average (expected) difference between the
measurement and the truth. For example, if you get on the scale with clothes
on, that biases the measurement to be larger than your true weight (this would
be a positive bias). The design of an experiment or of a survey can also lead
to bias. Bias can be deliberate, but it is not necessarily so.
Class
Boundary.
·
A
point that is the left endpoint of one class interval, and the right endpoint of
another class interval.
Class
Interval.
·
In
plotting a histogram, one starts by dividing
the range of values into a set of non-overlapping intervals, called class
intervals, in such a way that every datum is contained in some class
interval. See the related entries class boundary and endpoint convention.
Cluster
Sample.
·
In
a cluster sample, the sampling unit is a collection of population
units, not single population units. For example, techniques for adjusting the
U.S. census start with a sample of geographic blocks, then (try to) enumerate
all inhabitants of the blocks in the sample to obtain a sample of people. This
is an example of a cluster sample.
Discrete Variable.
A quantitative variable whose set of possible
values is countable. Typical examples of
discrete variables are variables whose possible values are a subset of the
integers, such as Social Security numbers, the number of people in a family,
ages rounded to the nearest year, etc. Discrete variables are
"chunky." C.f. continuous variable. A discrete random variable is one whose set of
possible values is countable. A random variable is
discrete if and only if its cumulative probability
distribution function is
a stair-step function; i.e., if it is piecewise constant and only
increases by jumps.
Median.
"Middle value" of a list. The smallest number such
that at least half the numbers in the list are no greater than it. If the list
has an odd number of entries, the median is the middle entry in the list after
sorting the list into increasing order. If the list has an even number of
entries, the median is the smaller of the two middle numbers after sorting. The
median can be estimated from a histogram by finding the smallest number such
that the area under the histogram to the left of that number is 50%.
numbers does not make sense. Quantitative variables
typically have units of measurement, such as inches, people, or pounds.
Quartiles.
There are three quartiles. The first or lower quartile (LQ)
of a list is a number (not necessarily a number in the list) such that at least
1/4 of the numbers in the list are no larger than it, and at least 3/4 of the
numbers in the list are no smaller than it. The second quartile is the median. The third or upper quartile (UQ) is a number such that at least 3/4 of
the entries in the list are no larger than it, and at least 1/4 of the numbers
in the list are no smaller than it. To find the quartiles, first sort the list
into increasing order. Find the smallest integer that is at least as big as the
number of entries in the list divided by four. Call that integer k. The kth
element of the sorted list is the lower quartile. Find the smallest integer
that is at least as big as the number of entries in the list divided by two.
Call that integer l. The lth element of the sorted list is the
median. Find the smallest integer that is at least as large as the number of
entries in the list times 3/4. Call that integer m. The mth
element of the sorted list is the upper quartile.
Random
Sample.
A random sample is a sample whose members are chosen
at random from a given population in such a way that the
chance of obtaining any particular sample can be computed. The number of units
in the sample is called the sample size, often denoted n. The
number of units in the population often is denoted N. Random samples can
be drawn with or without replacing objects between draws; that is, drawing all n
objects in the sample at once (a random sample without replacement), or drawing
the objects one at a time, replacing them in the population between draws (a
random sample with replacement). In a random sample with replacement, any given
member of the population can occur in the sample more than once. In a random
sample without replacement, any given member of the population can be in the
sample at most once. A random sample without replacement in which every subset
of n of the N units in the population is equally likely is also
called a simple random sample. The term random sample
with replacement denotes a random sample drawn in such a way that every n-tuple
of units in the population is equally likely.
Range.
The range of a set of numbers is the largest value in the
set minus the smallest value in the set. Note that as a statistical term, the
range is a single number, not a range of numbers.
A
cumulative frequency (cumulative relative frequency) is
obtained by summing the frequencies (relative frequencies) of all classes up to
the specific class. In a case of qualitative variables, cumulative frequencies
makes sense only for ordinal variables, not for nominal variables.
The qualitative data are presented
graphically either as a pie chart or as a horizontal or vertical bar graph.
A
pie chart is a disk divided into pie-shaped
pieces proportional to the relative frequencies of the classes. To obtain angle
for any class, we multiply the relative frequencies by 360 degrees, which corresponds
to the complete circle.
A
horizontal bar graph displays the classes on the
horizontal axis and the frequencies (or relative frequencies) of the classes on
the vertical axis. The frequency (or relative frequency) of each class is
represented by vertical bar whose height is equal to the frequency (or relative
frequency) of the class.
In
a bar graph, its bars do not touch each other.
At vertical bar graph, the classes are displayed on the vertical axis and the
frequencies of the classes on the horizontal axis.
IMPORTANCE OF
STATISTICS IN BUSINESS
There are three major functions
in any business enterprise in which the statistical methods are useful. These
are as follows:
(i) The planning
of operations: This
may relate to either special projects or to the recurring activities of a firm
over a specified period.
(ii) The setting
up of standards: This
may relate to the size of employment, volume of sales, fixation of quality norms
for the manufactured product, norms for the daily output, and so forth.
(iii) The
function of control: This
involves comparison of actual production achieved against the norm or target
set earlier. In case the production has fallen short of the target, it gives
remedial measures so that such a deficiency does not occur again.
A worth noting point is that
although these three functions-planning of operations, setting standards, and
control-are separate, but in practice they are very much interrelated.
Different authors have
highlighted the importance of Statistics in business. For instance, Croxton and
Cowden give numerous uses of Statistics in business such as project planning,
budgetary planning and control, inventory planning and control, quality
control, marketing, production and personnel administration. Within these also they
have specified certain areas where Statistics is very relevant. Another author,
Irwing W. Burr, dealing with the place of statistics in an industrial
organisation, specifies a number of areas where statistics is extremely useful.
These are: customer wants and market research, development design and
specification, purchasing, production, inspection, packaging and shipping, sales
and complaints, inventory and maintenance, costs, management control,
industrial engineering and research.
Statistical problems arising in
the course of business operations are multitudinous. As such, one may do no
more than highlight some of the more important ones to emphasis the relevance
of statistics to the business world. In the sphere of production, for example,
statistics can be useful in various ways.
Statistical quality control
methods are used to ensure the production of quality goods.
Identifying and rejecting
defective or substandard goods achieve this. The sale targets can be fixed on
the basis of sale forecasts, which are done by using varying methods of
forecasting. Analysis of sales affected against the targets set earlier would
indicate the deficiency in achievement, which may be on account of several
causes: (i) targets were too high and unrealistic (ii) salesmen's performance
has been poor (iii) emergence of increase in competition (iv) poor quality of
company's product, and so on. These factors can be further investigated.
Another sphere in business where
statistical methods can be used is personnel management. Here, one is concerned
with the fixation of wage rates, incentive norms and performance appraisal of
individual employee. The concept of productivity is very relevant here. On the
basis of measurement of productivity, the productivity bonus is awarded to the
workers. Comparisons of wages and productivity are undertaken in order to
ensure increases in industrial productivity.
Statistical methods could also be
used to ascertain the efficacy of a certain product, say, medicine. For
example, a pharmaceutical company has developed a new medicine in the treatment
of bronchial asthma. Before launching it on commercial basis, it wants to
ascertain the effectiveness of this medicine. It undertakes an experimentation
involving the formation of two comparable groups of asthma patients. One group
is given this new medicine for a specified period and the other one is treated
with the usual medicines. Records are maintained for the two groups for the
specified period. This record is then analysed to ascertain if there is any significant
difference in the recovery of the two groups. If the difference is really significant
statistically, the new medicine is commercially launched.
LIMITATIONS OF
STATISTICS
Statistics has a number of
limitations, pertinent among them are as follows:
(i) There are
certain phenomena or concepts where statistics cannot be used. This is because
these phenomena or concepts are not amenable to measurement.
For example, beauty,
intelligence, courage cannot be quantified. Statistics has no place in all such
cases where quantification is not possible.
(ii) Statistics
reveal the average behaviour, the normal or the general trend. An application
of the 'average' concept if applied to an individual or a particular situation
may lead to a wrong conclusion and sometimes may be disastrous.
For example, one may be misguided
when told that the average depth of a river from one bank to the other is four
feet, when there may be some points in between where its depth is far more than
four feet. On this understanding, one may enter those points having greater
depth, which may be hazardous.
(iii) Since statistics
are collected for a particular purpose, such data may not be relevant or useful
in other situations or cases. For example, secondary data
(i.e., data originally collected
by someone else) may not be useful for the other person.
(iv) Statistics are
not 100 per cent precise as is Mathematics or Accountancy.
Those who use statistics should
be aware of this limitation.
CHARACTERISTICS OF THE ARITHMETIC
MEAN
Some of the important
characteristics of the arithmetic mean are:
1. The sum of the deviations of
the individual items from the arithmetic mean is always zero. This means I: (x
- x ) = 0, where x is the value of an item and x is
the arithmetic mean. Since the sum of the deviations in the positive direction
is equal to the sum of the deviations in the negative direction, the arithmetic
mean is regarded as a measure of central tendency.
2. The sum of the squared deviations
of the individual items from the arithmetic mean is always minimum. In other
words, the sum of the squared deviations taken from any value other than the
arithmetic mean will be higher.
3. As the arithmetic mean is
based on all the items in a series, a change in the value of any item will lead
to a change in the value of the arithmetic mean.
4. In the case of highly skewed
distribution, the arithmetic mean may get distorted on account of a few items
with extreme values. In such a case, it may cease to be the representative
characteristic of the distribution.
MEDIAN
Median is defined as the value of
the middle item (or the mean of the values of the two middle items) when the
data are arranged in an ascending or descending order of magnitude. Thus, in an
ungrouped frequency distribution if the n values are arranged in
ascending or descending order of magnitude, the median is the middle value if n
is odd. When n is even, the median is the mean of the two middle
values.
Suppose we have the following
series:
15, 19,21,7, 10,33,25,18 and 5
We have to first arrange it in
either ascending or descending order. These figures are arranged in an
ascending order as follows:
5,7,10,15,18,19,21,25,33
Now as the series consists of odd
number of items, to find out the value of the middle item, we use the formula
Where n +1
2
Where n is the number of
items. In this case, n is 9, as such,
n + 1 = 5
2
that , the size of the 5th item
is the median. This happens to be 18.
Suppose the series consists of
one more items 23. We may, therefore, have to include
23 in the above series at an
appropriate place, that is, between 21 and 25. Thus, the series is now 5, 7,
10, 15, 18, 19, and 21,23,25,33. Applying the above formula, the
30
median is the size of 5.5th item.
Here, we have to take the average of the values of 5th
and 6th item. This means an
average of 18 and 19, which gives the median as 18.5.
It may be noted that the formula
n + 1
2
itself is not the formula for the median; it merely indicates the
position of the median, namely, the number of items we have to count until we
arrive at the item whose value is the median. In the case of the even
number of items in the series, we
identify the two items whose values have to be
averaged to obtain the median. In
the case of a grouped series, the median is
calculated by linear
interpolation with the help of the following formula:
M =(L1) l2+l1(M-C)
f
Where M = the median
l1 = the lower
limit of the class in which the median lies
12 = the upper
limit of the class in which the median lies
f = the frequency
of the class in which the median lies
m = the middle
item or (n + 1)/2th, where n stands for total number of
items
c = the cumulative frequency of
the class preceding the one in which the median lies.
The frequency of a
particular data value is the number of times the data value occurs. For
example, if four students have a score of 80 in mathematics, and then the score
of 80 is said to have a frequency of 4. The frequency of a data
value is often represented by f.
Frequency Distribution: values and their frequency (how often each value occurs).
Here is another example:
Example:
Newspapers
These are the numbers of newspapers
sold at a local shop over the last 10 days:
22,
20, 18, 23, 20, 25, 22, 20, 18, 20
Let us count how many of each number
there is:
Papers
Sold
|
Frequency
|
18
|
2
|
19
|
0
|
20
|
4
|
21
|
0
|
22
|
2
|
23
|
1
|
24
|
0
|
25
|
1
|
It is also possible to group
the values. Here they are grouped in 5s:
Papers
Sold
|
Frequency
|
15-19
|
2
|
20-24
|
7
|
25-29
|
1
|
HISTOGRAMS VS BAR GRAPHS
Bar
Graphs are good when your data is in categories (such as
"Comedy", "Drama", etc).
It
is best to leave gaps between the bars of a Bar Graph, so it doesn't look like
a Histogram.
Pie Chart: a special
chart that uses "pie slices" to show relative sizes of data.
Imagine you survey your friends
to find the kind of movie they like best:
Table: Favorite
Type of Movie
|
||||
Comedy
|
Action
|
Romance
|
Drama
|
SciFi
|
4
|
5
|
6
|
1
|
4
|
You can show the data by this Pie
Chart:
It is a really good way to show
relative sizes: it is easy to see which movie types are most liked, and which
are least liked, at a glance.
Or you can make them yourself ...
How to Make Them
Yourself
First, put your data into a table
(like above), then add up all the values to get a total:
Table: Favorite
Type of Movie
|
|||||
Comedy
|
Action
|
Romance
|
Drama
|
SciFi
|
TOTAL
|
4
|
5
|
6
|
1
|
4
|
20
|
Next, divide each value by the
total and multiply by 100 to get a percent:
Comedy
|
Action
|
Romance
|
Drama
|
SciFi
|
TOTAL
|
4
|
5
|
6
|
1
|
4
|
20
|
4/20
= 20% |
5/20
= 25% |
6/20
= 30% |
1/20
= 5% |
4/20
= 20% |
100%
|
A Full Circle has 360 degrees,
so we do this calculation:
Comedy
|
Action
|
Romance
|
Drama
|
SciFi
|
TOTAL
|
4
|
5
|
6
|
1
|
4
|
20
|
20%
|
25%
|
30%
|
5%
|
20%
|
100%
|
4/20 × 360°
= 72° |
5/20 × 360°
= 90° |
6/20 × 360°
= 108° |
1/20 × 360°
= 18° |
4/20 × 360°
= 72° |
360°
|
Now
you are ready to start drawing!
Draw
a circle.
Here
I show the first sector ...
Finish
up by coloring each sector and giving it a label like "Comedy: 4 (20%)",
etc.
(And
dont forget a title!)
Another Example
You can use pie charts to show
the relative sizes of many things, such as:
- what
type of car people have,
- how
many customers a shop has on different days and so on.
- how
popular are different breeds of dogs
Example: Student
Grades
Here is how many students got
each grade in the recent test:
A
|
B
|
C
|
D
|
4
|
12
|
10
|
2
|
And here is the pie chart:
Pie diagrams are at times less
effective than bar diagrams for accurate reading and interpretation,
particularly when series is divided into a large number of components or
the difference among the components is very small. It is generally inadvisable
to attempt to portray a series of more than five or six categories by means of
a pie chart. If, for example, there are eight, ten or more categories it may be
very confusing to differentiate the relative values portrayed specially
when several small sectors are of approximately the same size. This type
of diagram, although frequently used, appears upon comparison inferior to simple
bar diagram, the divided bar diagram or a group of curves.
SIGNIFICANCE OF
DIAGRAMS & GRAPHS
Diagrams and
graphs are extremely useful because of the following reasons:
1.
Diagrams and graphs are attractive, impressive and save time.
2. They
make data representation simple and have universal utility.
3.
They make comparison possible and give more information.
GENERAL RULES
FOR CONSTRUCTING DIAGRAMS:
The construction of diagrams is
an art, which can be acquired through practice. However, observance of some
general guidelines can help in making them more attractive and effective. The
diagrammatic presentation of statistical facts will be advantageous provided
the following rules are observed in drawing diagrams.
- A
diagram should be neatly drawn and attractive.
- The
measurements of geometrical figures used in diagram should be accurate and
proportional.
- The
size of the diagrams should match the size of the paper.
- Every
diagram must have a suitable but short heading.
- The
scale should always be mentioned in the diagram.
- Diagrams
should be neatly as well as accurately drawn with the help of drawing
instruments.
- Index
must be given for identification so that the reader can easily make out
the meaning of the diagram.
- Economy
in cost and energy should be exercised in drawing diagram.
Limitations of
Diagrammatic Presentation
- Diagrams
do not present the small differences properly.
- These
can easily be misused.
- Only
artist can draw multi-dimensional diagrams.
- In
statistical analysis, diagrams are of no use.
- Diagrams
are just supplement to tabulation.
- Only
a limited set of data can be presented in the form of diagram.
- Diagrammatic
presentation of data is a more time consuming process.
- Diagrams
present preliminary conclusions.
- Diagrammatic
presentation of data shows only on estimate of the actual behavior of the
variables.
TYPES OF DIAGRAMS
·
(a) Line
Diagrams
In these diagrams only line is drawn to represent one
variable. These lines may be vertical or horizontal. The lines are drawn such that
their length is the proportion to value of the terms or items so that
comparison may be done easily.
·
(b) Simple
Bar Diagram
Like line diagrams these figures are also used where only
single dimension i.e. length can present the data. Procedure is almost the
same, only one thickness of lines is measured. These can also be drawn either
vertically or horizontally. Breadth of these lines or bars should be equal.
Similarly distance between these bars should be equal. The breadth and distance
between them should be taken according to space available on the paper.
·
(c)
Multiple Bar Diagrams
The diagram is used, when we have to make comparison between
more than two variables. The number of variables may be 2, 3 or 4 or more. In
case of 2 variables, pair of bars is drawn. Similarly, in case of 3 variables,
we draw triple bars. The bars are drawn on the same proportionate basis as in
case of simple bars. The same shade is given to the same item. Distance between
pairs is kept constant.
·
(d)
Sub-divided Bar Diagram
The data which is presented by multiple bar diagram can be
presented by this diagram. In this case we add different variables for a period
and draw it on a single bar as shown in the following examples. The components
must be kept in same order in each bar. This diagram is more efficient if
number of components is less i.e. 3 to 5.
·
(e)
Percentage Bar Diagram
Like sub-divide bar diagram, in this case also data of one
particular period or variable is put on single bar, but in terms of
percentages. Components are kept in the same order in each bar for easy
comparison.
·
(f)
Duo-directional Bar Diagram
In this case the diagram is on both the sides of base line
i.e. to left and right or to above or below sides.
·
(g) Broken
Bar Diagram
This diagram is used when value of some variable is very
high or low as compared to others. In this case the bars with bigger terms or
items may be shown broken.
MERITS OF THE MEAN:
Merits
i)
It is rigidly defined
ii)
It is easy to understand and easy to calculate
iii)
It is based upon all the observations
iv)
It is amenable to algebraic treatment. The mean of the composite series in
terms of the means and sizes of the component series is given by
v)
Of all the averages, arithmetic mean is affected least by fluctuations of sampling.
This property is sometimes described by saying that mean is a stable average.
Thus,
we see that arithmetic mean satisfies all the properties laid down by Prof.
Yule for an ideal average.
Demerits
i)
Arithmetic mean is affected very much by extreme values. In case of extreme
items, arithmetic mean gives a distorted picture of the distribution and no
longer remains representative of the distribution.
ii)
Arithmetic mean may lead to wrong conclusions if the details of the data from
which it is computed are not given. Let us consider the following marks
obtained by two students A and B in three tests, viz, terminal
test, half-yearly examination and annual examination respectively.
Marks
in : I Test II Test III Test Average marks
A
50% 60% 70%
B
70% 60% 50%
Thus
average marks obtained by each of the two students at the end of the year are
60%. If we are given the average marks alone we conclude that the level of
intelligence of both the students at the end of the year is same. This is a
fallacious conclusion since we find from the data that student A has
improved consistently while student B has deteriorated consistently.
iii)
Arithmetic mean cannot be calculated if the extreme class is open, e.g. below
10 or above 70. Moreover, even if a single observation is missing mean cannot
be calculated.
MERITS OF THE
MEDIAN:
i) It is rigidly defined
ii) It is easily understood and
is easy to calculate. In some cases it can be located merely by inspection.
iii) It is not at all affected by
extreme values
iv) It can be calculated for
distributions with open-end classes.
Demerits:
i) In case of even number of
observations median cannot be determined exactly. We merely estimate it by
taking the mean of the two middle most terms.
ii) It is not based on all the
observations. For example, the median of 10, 25, 50, 60 and 65 is 50. We can
replace the observations 10 and 25 by any two values which are smaller than 50
and the observations 60 and 65 by any two values greater than 50, without
affecting the value of median. This property is sometimes described by saying
that median is insensitive.
iii) It is not amenable to
algebraic treatment
iv) As compared with mean, it is
affected much by fluctuations of sampling.
Uses:
i) Median is the only average to
be used while dealing with qualitative data which cannot be measured
quantitatively but still can be arranged in ascending or descending order of
magnitude, e.g. to find the average intelligence or average honesty among a
group of people.
ii) It is to be used for
determining the typical value in problems concerning wages, distribution of
wealth, etc.
MERITS
AND DEMERITS OF THE MODE:
Merits
i) Mode is readily comprehensible
and easy to calculate. Like median, mode can be found in some cases merely
by inspection.
ii) Mode is not at all affected
by extreme values
iii) Mode can be conveniently
located even if the frequency distribution has classes of unequal magnitude
provided the modal class and the classes preceding and succeeding it are of the
same magnitude. Open-end classes also do not pose any problem in
the determination of mode.
Demerits
i) Mode is ill-defined. It is not
always possible to find a clearly defined mode. In some cases, we may come
across distributions with two modes. Such distributions are called bi-modal. If
a distribution has more than two modes, it is said to be multimodal.
ii) It is not based upon all the
observations
iii) It is not capable of further
mathematical treatment
iv) As compared with mean, mode
is affected to a greater extent by fluctuations of sampling
Uses
Mode is the average to be used to
find the ideal size, e.g. in business forecasting, in the manufacture of
ready-made garments, shoes, etc.
MERITS
AND DEMERITS OF GEOMETRIC MEAN
Merits:
i) It is rigidly defined
ii) It is based upon all the
observations
iii) It is suitable for further
mathematical treatment
iv) It is not affected much by
fluctuations of sampling
v) It gives comparatively more
weight to small items
Demerits:
i) Because of its abstract
mathematical character, geometric mean is not easy to understand and to
calculate for a non-mathematics student.
ii) If any one of the
observations is zero, geometric mean becomes zero and if any one of the
observations is negative, geometric mean becomes imaginary regardless of the
magnitude of the other items.
Uses
Geometric mean is used
i) To find the rate of population
growth and the rate of interest
ii) In the construction of index
numbers
The
following are the measures of dispersion.
i) Range
ii) Quartile deviation or
semi-interquartile range
iii) Mean deviation and
iv) variance and Standard
deviation
|
The
range
is the difference between two extreme observations of the data. If A and B are
the greatest and the smallest observations respectively in a data, then its
range is A-B.
Range is the simplest but a crude
measure of dispersion. Since it is based on two extreme observations which
themselves are subject to chance fluctuations, it is not at all a reliable
measure of dispersion.
Quartile
deviation
or semi-interquartile range Q is given by
Q = ½ (Q3-Q1)
where Q1 and Q3
are the first and third quartiles of distribution respectively.
Quartile deviation is definitely
a better measure than the range as it makes use of 50% of the data. But since
it ignores the other 50% of the data, it cannot be regarded as a reliable
measure.
INDEX
NUMBERS
ndex numbers are commonly used
statistical device for measuring the combined fluctuations in a group related
variables. If we wish to compare the price level of consumer items today with
that prevalent ten years ago, we are not interested in comparing the prices of
only one item, but in comparing some sort of average price levels. We may wish
to compare the present agricultural production or industrial production with
that at the time of independence. Here again, we have to consider all items of
production and each item may have undergone a different fractional increase (or
even a decrease). How do we obtain a composite measure? This composite measure
is provided by index numbers which may be defined as a device for combining the
variations that have come in group of related variables over a period of time,
with a view to obtain a figure that represents the ‘net’ result of the change
in the constitute variables.
Index numbers may be classified
in terms of the variables that they are intended to measure. In business,
different groups of variables in the measurement of which index number
techniques are commonly used are (i) price, (ii) quantity, (iii) value and (iv)
business activity. Thus, we have index of wholesale prices, index of consumer
prices, index of industrial output, index of value of exports and index of
business activity, etc. Here we shall be mainly interested in index numbers of
prices showing changes with respect to time, although methods described can be
applied to other cases. In general, the present level of prices is compared
with the level of prices in the past. The present period is called the current
period and some period in the past is called the base period.
Index Numbers:
Index numbers are statistical
measures designed to show changes in a variable or group of related variables
with respect to time, geographic location or other characteristics such as
income, profession, etc. A collection of index numbers for different years,
locations, etc., is sometimes called an index series.
Simple Index
Number:
A simple index number is a number
that measures a relative change in a single variable with respect to a base.
Composite Index
Number:
A composite index number is a
number that measures an average relative changes in a group of relative
variables with respect to a base.
Types of Index
Numbers:
Following types of index numbers
are usually used:
Price index
Numbers:
Price index numbers measure the
relative changes in prices of a commodities between two periods. Prices can be
either retail or wholesale.
Quantity Index
Numbers:
These index numbers are
considered to measure changes in the physical quantity of goods produced,
consumed or sold of an item or a group of items.
INDEX NUMBERS are a statistician's way of expressing the difference between two
measurements by designating one number as the "base", giving it the
value 100 and then expressing the second number as a percentage of the first.
Example:
If the population of a town increased from 20,000 in 1988 to 21,000 in 1991,
the population in 1991 was 105% of the population in 1988. Therefore, on a 1988
= 100 base, the population index for the town was 105 in 1991.
An "index", as the term
is generally used when referring to statistics, is a series of index numbers
expressing a series of numbers as percentages of a single number.
Example: the numbers
50 75 90 110
expressed as an index, with the first number as a base, would be
100 150 180 220
50 75 90 110
expressed as an index, with the first number as a base, would be
100 150 180 220
Indexes can be used to express
comparisons between places, industries, etc. but the most common use is to
express changes over a period of time, in which case the index is also a time
series or "series". One point in time is designated the base
period—it may be a year, month, or any other period—and given the value 100.
The index numbers for the measurement (price, quantity, value, etc.) at all
other points in time indicate the percentage change from the base period.
If the price, quantity or value
has increased by 15% since the base period, the index is 115; if it has fallen
5%, the index is 95. It is important to note that indexes reflect percentage
differences relative to the base year and not absolute levels. If the price
index for one item is 110 and for another is 105, it means the price of the
first has increased twice as much as the price of the second. It does not mean
that the first item is more expensive than the second.
Each index number in a series
reflects the percentage change from the base period. It is important not to
confuse an index point change and a percentage change between two index numbers
in a series.
Example: if the price index for
butter was 130 one year and 143 the next year, the index point change would be:
143 – 130 = 13
but the percentage change for the index would be:
(143 – 130) x 100) ÷ 130 = 10%
143 – 130 = 13
but the percentage change for the index would be:
(143 – 130) x 100) ÷ 130 = 10%