Introduction to statistics
Statistics plays a vitally important role in the research. Much of the scientific information is very
often explained in
... [Show More] statistical terms, with many decisions in the Health Sciences being created
through statistical studies
Statistics enables you:
o to read and evaluate reports and other literature
o to take independent research investigations
o to describe the data in meaningful terms
Definitions
Statistics: is the study of how to collect, organizes, analyze, and interpret data.
Data: the values recorded in an experiment or observation.
Population: refers to any collection of individual items or units that are the subject of
investigation.
Sample: A small representative sample of a population is called sample.
Observation: each unit in the sample provides a record, as a measurement which is called
observation.
Sampling: getting sample from a population
Variable: the value of an item or individual is called variable
Raw Data: Data collected in original form.
Frequency: The number of times a certain value or class of values occurs.
Tabulation: can be defined as the logical and systematic arrangement `of statistical data in rows
and columns.
Frequency Distribution: The organization of raw data in table form with classes and frequencies.
Class Limits: Separate one class in a grouped frequency distribution from another. The limits
could actually appear in the data and have gaps between the upper limit of one class and the
lower limit of the next.
Class Boundaries: Separate one class in a grouped frequency distribution from another.
Cumulative Frequency: The number of values less than the upper class boundary for the
current class. This is a running total of the frequencies.
Histogram: A graph which displays the data by using vertical bars of various heights to represent
frequencies.
Variables
• The value of an item or individual is called variable.
• Variables are of two types:
o Quantitative: a variable with a numeric value. E.g. age, weight.
o Qualitative: a variable with a category or group value. E.g. Gender (M/F),
Religion (H/M/C), Qualification (degree/PG)
• Quantitative variable are two types:
o Discrete /categorical variables
o Continuous variables
• Variables can be
o Independent
Are not influenced by other variables.
Are not influenced by the event, but could influence the event.
o Dependent
The variable which is influenced by the others is often referred as
dependent variable.SBL 321: Biostatistics J. C.Korir
2
E.g. In an experimental study on relaxation intervention for reducing hypertension, blood
pressure is the dependent variable and relaxation training, age and gender are independent
variable.
Sampling
• Sampling is the process of getting a representative fraction of a population.
• Analysis of the sample gives an idea of the population.
Methods of sampling
1. Random Sampling or Probability sampling
Simple random sampling
Stratified random Sampling
Systematic sampling
Cluster sampling
Propotionate sampling
Multistage sampling
2. Non-random sampling
Haphazard Sampling
Convenient Sampling
Purposive Sampling
Quota Sampling
Simple Random sampling
Each individual of the population has an equal chance of being included in the sample. Two
methods are used in simple random sampling:
• Random Numbers method
• Lottery method
Stratified random sampling
Stratified random sampling is used when we have subgroups in our population that are likely to
differ substantially in their responses or behavior. This sampling technique treats the population
as though it were two or more separate populations and then randomly samples within each.
For example, you are interested in visual-spatial reasoning and previous research suggests that
men and women will perform differently on these types of task. So, you divide your sample into
male and female members and randomly select equal numbers within each subgroup (or
"stratum"). With this technique, you are guaranteed to have enough of each subgroup for
meaningful analysis.
Systematic sampling
Systematic sampling yields a probability sample but it is not a random sampling strategy.
Systematic sampling strategies take every nth person from the sampling frame. For example,
you choose a random start page and take every 45th name in the directory until you have the
desired sample size. Its major advantage is that it is much less cumbersome to use than the
procedures outlined for simple random sampling.
Cluster sampling
Cluster sampling is useful when it would be impossible or impractical to identify every person in
the sample. Suppose a college does not print a student directory. It would be most practical in
this instance to sample students from classes. Rather than randomly sample 10% of students
from each class, which would be a difficult task, randomly sampling every student in 10% of the
classes would be easier.
Sampling every student in a class is not a random procedure. However, by randomly selecting
the classes, you have a greater probability of capturing a representative sample of the
population. Many students believe that it is not possible to gather a representative sample for aSBL 321: Biostatistics J. C.Korir
3
class project or a thesis. However, this type of cluster sampling is easily done, especially since all
colleges publish lists of classes for registration.
Propotionate sampling
Proportionate sampling is a variation of stratified random sampling. We use this technique when
our subgroups vary dramatically in size in our population. For example, we are interested in risk
taking among college students and suspect that risk taking might differ between smokers and
nonsmokers. Given increasing societal pressures against smoking, there are many fewer
smokers on campus than nonsmokers. Rather than take equal numbers of smokers and
nonsmokers, we want each group represented in their proportions in the population.
Proportionate sampling strategies begin by stratifying the population into relevant subgroups
and then random sampling within each subgroup. The number of participants that we recruit
from each subgroup is equal to their proportion in the population.
Multistage sampling
This is the most sophisticated sampling strategy and it is often used in large epidemiological
studies. To obtain a representative national sample, researchers may select zip codes at random
from each state. Within these zip codes, streets are randomly selected. Within each street,
addresses are randomly selected. While each zip code constitutes a cluster, which may not be as
accurate as other probability sampling strategies, it still can be very accurate.
Non-random sampling
Non-probability sampling strategies are used when it is practically impossible to use probability
sampling strategies. This typically occurs because of time and expense constraints and the lack
of an adequate sampling frame. Nonprobability sampling is also used when the frequency of the
behavior or characteristic of interest is so low in the population that a more targeted strategy is
needed to find sufficient numbers of participants for the research.
Haphazard Sampling
Haphazard sampling is a strategy that is almost guaranteed to introduce bias into your study. It
should be avoided at all costs. A typical haphazard strategy uses a "man-on-the-street"
technique to recruit those who wander by or selects a sampling frame that does not accurately
reflect the population.
Convenience sampling
This is a type of non-probability sampling which involves the sample being drawn from that part
of the population which is selected because it is readily available and convenient.
Purposive sampling
Purposive sampling targets a particular group of people. When the desired population for the
study is rare or very difficult to locate and recruit for a study, purposive sampling may be the
only option. For example, you are interested in studying cognitive processing speed of young
adults who have suffered closed head brain injuries in automobile accidents. This would be a
difficult population to find.
Quota sampling
In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as
in stratified sampling. Then judgment is used to select the subjects or units from each segment
based on a specified proportion. For example, an interviewer may be told to sample 200 females
and 300 males between the age of 45 and 60. This means that individuals can put a demand on
who they want to sample (targeting)
It is this second step which makes the technique one of non-probability sampling. In quota
sampling, the selection of the sample is non-random unlike random sampling and can often be
found unreliable. For example interviewers might be tempted to interview those people in the
street who look most helpful, or may choose to use accidental sampling to question those whichSBL 321: Biostatistics J. C.Korir
4
are closest to them, for time-keeping sake. The problem is that these samples may be biased
because not everyone gets a chance of selection. This non-random element is its greatest
weakness and quota versus probability has been a matter of controversy for many years.
Quota sampling is useful when time is limited, a sampling frame is not available, the research
budget is very tight or when detailed accuracy is not important. You can also choose how many
of each category is selected.
Scales of measurement
There are five measurement scales are used:
• Nominal Data
• Ordinal Data
• Rank Data
• Discrete Data
• Continuous Data
Nominal data
Nominal variables include categories of people, events, and other phenomena are named. Often
we do not need the full power of numbers for every application. To make this point clear we
classify our use of numbers into different class. For example, one kind of data is what we call a
nominal data; when we label males as 0, females as 1, then that’s nominal data. Another
example of nominal data is if we use 0 to denote who's alive and 1 for denoting people who are
dead. In both these examples, they are nominally numbers, just 0 or 1.
The only property we're making use of the number system here is that 0 is different from 1.
We're not saying 1 is bigger than 0. We're not saying that 1 is one unit away from 0. Simply that
0 and 1 are different. This is the simplest example we have of nominal data. This is sometimes
called binary data or dichotomous data, depending upon whether you prefer the Greek or the
Latin root for two.
But it doesn't just have to have two values. For example, if we're looking at blood groups, here
we would need four values: one each for blood groups A, B, AB and O.
They are exhaustive in nature, and are mutually exclusive. These categories are discrete and
non-continuous. The Statistical operations permissible are: counting of frequency, Percentage,
Proportion, mode, and coefficient of contingency.
Ordinal data
It is second in terms of its refinement as a means of classifying information. It incorporates the
functions of nominal scale. The ordinal scale is used to arrange (or rank) individuals into a
sequence ranging from the highest to lowest. For example, we might classify some disease as
mild, moderate, or severe, where we might label mild as a 1, moderate as a 2, and severe a 3.
We use the order of the data because 2 is a little bit more severe than 1, and 3 is a little bit more
severe than 2. So the order is important.
Rank data
Rank data is sort of like when we just had the Olympics, the person who finishes first gets the
gold medal. The person who finishes second gets the silver. It doesn't matter how far behind the
second is from the first. It's just that the second one finished second. So it could be a fraction of
a second, to finish second, later than the first. Or it could be a few minutes. It doesn't matter.
It's just the rank, the rank in which the data are ordered.
Interval data
Interval scale refers to the third level of measurement in relation to complexity of statistical
techniques used to analyze data. It is quantitative in nature. The individual units are equidistant
from one point to the other. The interval data does not have an absolute zero. E.g. temperature
is measured in Celsius or Fahrenheit.SBL 321: Biostatistics J. C.Korir
5
Ratio data
Have equal distances between the increments. This scale has an absolute zero. Ratio variables
exhibit the characteristics of ordinal and interval measurement E.g. variable like time; length
and weight are ratio scales and also are measured using nominal or ordinal scale.
Processing of data
The first step in processing of data is classification and tabulation. Classification is the process of
arranging data on the basis of some common characteristics possessed by them.
Two approaches in analyzing data are:
o Descriptive statistics
o Inferential statistics
Descriptive statistics are concerned with describing the characteristics of frequency
distributions. The common methods in descriptive analyses are:
o Measures of central tendency
o Measures of dispersion
o Tabulation, cross-tab, contingency table
o Line diagram, bar diagram, pie diagram.
o Histogram, frequency polygon, frequency curve
o Quantile, Q-Q plot
o Scatterplot
The inferential statistics helps to decide whether the outcome of the study is a result of factors
planned within design of the study or determined by chance. Common inferential statistical
tests are: T-tests, Chi-squire test, Pearson correlation.
Methods of summarizing data
Descriptive Statistics describe basic features of the data gathered from an experimental study in
various ways. They provide simple summaries about the sample via graphs and numbers, mainly
measures of center and variation. Together with graphics analysis (histograms, bar plots, piecharts), they are the cornerstone of quantitative data analysis.
- Tables (frequency distributions, stem-and-leaf plots) that summarize the data.
- Graphical representations of the data (histograms, bar plots, pie-charts).
- Summary statistics (numbers) which summarize the data.
Tables
The most common ways of summarizing data into tables are frequency distribution, relative
frequency distribution and relative frequency distribution tables. Another common format is
using a stem-and-leaf plot.
Frequency distribution table
A frequency distribution summarizes the data into a table containing ranges where the data
falls, and the frequency (or amount) of data that fall in that range.
• Simple depiction of all the data
• Frequency distribution is a statistical table containing “groups of values according to the
number of times a value occurs.”
• The data collected by an investigator is called raw data.
• Raw data is ungrouped data.
• It is not in order.
• Raw data is arranged in order called array.SBL 321: Biostatistics J. C.Korir
6
• The data arranged in ascending order or descending order
Frequency Distribution with Classes
• It is constructed with class intervals.
• It is a frequency distribution of continuous series.
• Raw data arranged as array data.
• Then the data is divided in to groups called classes.
• The first class and the last class are fixed by seeing the lowest and highest values.
• Lowest and highest numbers of each class are called class limits (upper & lower).
• The class limit may be made in two methods:
1. Inclusive methods
2. Exclusive method
Sturge’s Rule: k = 1 + 3.322(log10 n), k is the number of classes; n is the size of the data.
Class Interval=Range/No of Classes
Hence, No of Classes=Range/Class Interval
Stem-and-Leaf Plots
Another way of summarizing data into tables is the stem-and-leaf plot, which works best when
the data can be subdivided into tens and units. Suppose that the GDP per capita for example is:
81 76 70 57 55 55 89 46 46 45 44 44 42 40 39 39 56 39 39 39, which is given in thousands of
dollars. In other words, one country has a GDP per capita of 81,000 dollars (namely Luxemburg),
Qatar has a GDP per capita of 76,000, and for example the USA have a GDP per capita of 46,000
dollars. This data can be nicely summarized in the following stem-and-leaf-plot:
In the stem-and-leaf plot above, 8|1 means that there is one county with GDP per person in the
80,000s, and that that county has a GDP of 81,000 per capita.
7|06 means that there are 2 countries with GDP between $70,000 and $79,999, and that those
countries have GDP per capita of 70,000 and 76,000. That is why there is a 0 and a 6 in the units
are, to the right of the vertical bar.
Diagrammatic Presentation of data
It is a visual form of presentation of statistical data in which data are presented in the form of
diagrams such as bars, lines, circles, maps
Advantages of diagrammatic presentation of data:
1. It more attractive
2. It simplify complex information
3. It saves timeSBL 321: Biostatistics J. C.Korir
7
4. It helps to make comparison.
Rules for drawing diagrams
- It should have a title
- Proper scaling should be used.
- Index must be given for better understanding of diagrams
Common Types of diagrams
- Line Diagram
- Pie diagram
- Bar diagram
Graphical Presentation of data
- Presenting data in the form of graphs prepared on a graph.
- The graph has two axes: X & Y
- Usually, Independent variable is marked on the X-axis and dependent variable on the Yaxis.
Common Types:
1. Histogram
2. Frequency Polygon
3. Frequency curve
Histogram
- Histogram is a graph containing frequencies in the form of vertical rectangles.
- It is an area diagram
- It is the graphical presentation of frequency distribution.
- X-axis is marked with class intervals
- Y-axis is marked with frequencies
- Histogram differs from bar diagram. The bar diagram is one dimensional, whereas
histogram is two-dimensional.
Uses of histogram
1. It gives a clear picture of entire data
2. It simplifies complex data
3. Median and mode can be calculated.
4. It facilitates comparison of two or more frequency distributions on the same
graph.
Category Systolic
BP
(mmHg)
Number of
Persons7
1 100-109 7
2 110-119 16
3 120-129 19
4 130-139 31
5 140-149 41
6 150-159 23
7 160-169 10
8 170-179 3
Frequency PolygonSBL 321: Biostatistics J. C.Korir
8
A frequency polygon is another way to show the information in a frequency table. It looks a
little bit like a line graph. To make a frequency polygon, you just need to plot a the mid-points
of each class and then join the points by straight lines.
Frequency Curve
Frequency curve is obtained by joining the points of frequency polygon by a freehand smoothed
curve. Unlike frequency polygon, where the points we joined by straight lines, we make use of
free hand joining of those points in order to get a smoothed frequency curve. It is used to
remove the ruggedness of polygon and to present it in a good form or shape. We smoothen the
angularities of the polygon only without making any basic change in the shape of the curve. In
this case also the curve begins and ends at base line, as is in case of polygon. Area under the
curve must remain almost the same as in the case of polygon.
Measures of central tendency
Measures of central tendency are sometimes needed to make meaningful interpretation of
data. Generally, it is found that in any distribution values of the variable tend to congregate
around a central value of the distribution. This tendency of the distribution is known as central
tendency and the measures devised to consider this tendency is known as measure of central
tendency. One of the most important objectives of this statistical analysis is to get a single value
that describes the characteristic of the entire mass of data [Show Less]