Study guide

Biostatistics Study Guide. LATEST Updated & Verified

Introduction to statistics Statistics plays a vitally important role in the research. Much of the scientific information is very often explained in... [Show More] statistical terms, with many decisions in the Health Sciences being created through statistical studies Statistics enables you: o to read and evaluate reports and other literature o to take independent research investigations o to describe the data in meaningful terms Definitions Statistics: is the study of how to collect, organizes, analyze, and interpret data. Data: the values recorded in an experiment or observation. Population: refers to any collection of individual items or units that are the subject of investigation. Sample: A small representative sample of a population is called sample. Observation: each unit in the sample provides a record, as a measurement which is called observation. Sampling: getting sample from a population Variable: the value of an item or individual is called variable Raw Data: Data collected in original form. Frequency: The number of times a certain value or class of values occurs. Tabulation: can be defined as the logical and systematic arrangement `of statistical data in rows and columns. Frequency Distribution: The organization of raw data in table form with classes and frequencies. Class Limits: Separate one class in a grouped frequency distribution from another. The limits could actually appear in the data and have gaps between the upper limit of one class and the lower limit of the next. Class Boundaries: Separate one class in a grouped frequency distribution from another. Cumulative Frequency: The number of values less than the upper class boundary for the current class. This is a running total of the frequencies. Histogram: A graph which displays the data by using vertical bars of various heights to represent frequencies. Variables • The value of an item or individual is called variable. • Variables are of two types: o Quantitative: a variable with a numeric value. E.g. age, weight. o Qualitative: a variable with a category or group value. E.g. Gender (M/F), Religion (H/M/C), Qualification (degree/PG) • Quantitative variable are two types: o Discrete /categorical variables o Continuous variables • Variables can be o Independent Are not influenced by other variables. Are not influenced by the event, but could influence the event. o Dependent The variable which is influenced by the others is often referred as dependent variable.SBL 321: Biostatistics J. C.Korir 2 E.g. In an experimental study on relaxation intervention for reducing hypertension, blood pressure is the dependent variable and relaxation training, age and gender are independent variable. Sampling • Sampling is the process of getting a representative fraction of a population. • Analysis of the sample gives an idea of the population. Methods of sampling 1. Random Sampling or Probability sampling Simple random sampling Stratified random Sampling Systematic sampling Cluster sampling Propotionate sampling Multistage sampling 2. Non-random sampling Haphazard Sampling Convenient Sampling Purposive Sampling Quota Sampling Simple Random sampling Each individual of the population has an equal chance of being included in the sample. Two methods are used in simple random sampling: • Random Numbers method • Lottery method Stratified random sampling Stratified random sampling is used when we have subgroups in our population that are likely to differ substantially in their responses or behavior. This sampling technique treats the population as though it were two or more separate populations and then randomly samples within each. For example, you are interested in visual-spatial reasoning and previous research suggests that men and women will perform differently on these types of task. So, you divide your sample into male and female members and randomly select equal numbers within each subgroup (or "stratum"). With this technique, you are guaranteed to have enough of each subgroup for meaningful analysis. Systematic sampling Systematic sampling yields a probability sample but it is not a random sampling strategy. Systematic sampling strategies take every nth person from the sampling frame. For example, you choose a random start page and take every 45th name in the directory until you have the desired sample size. Its major advantage is that it is much less cumbersome to use than the procedures outlined for simple random sampling. Cluster sampling Cluster sampling is useful when it would be impossible or impractical to identify every person in the sample. Suppose a college does not print a student directory. It would be most practical in this instance to sample students from classes. Rather than randomly sample 10% of students from each class, which would be a difficult task, randomly sampling every student in 10% of the classes would be easier. Sampling every student in a class is not a random procedure. However, by randomly selecting the classes, you have a greater probability of capturing a representative sample of the population. Many students believe that it is not possible to gather a representative sample for aSBL 321: Biostatistics J. C.Korir 3 class project or a thesis. However, this type of cluster sampling is easily done, especially since all colleges publish lists of classes for registration. Propotionate sampling Proportionate sampling is a variation of stratified random sampling. We use this technique when our subgroups vary dramatically in size in our population. For example, we are interested in risk taking among college students and suspect that risk taking might differ between smokers and nonsmokers. Given increasing societal pressures against smoking, there are many fewer smokers on campus than nonsmokers. Rather than take equal numbers of smokers and nonsmokers, we want each group represented in their proportions in the population. Proportionate sampling strategies begin by stratifying the population into relevant subgroups and then random sampling within each subgroup. The number of participants that we recruit from each subgroup is equal to their proportion in the population. Multistage sampling This is the most sophisticated sampling strategy and it is often used in large epidemiological studies. To obtain a representative national sample, researchers may select zip codes at random from each state. Within these zip codes, streets are randomly selected. Within each street, addresses are randomly selected. While each zip code constitutes a cluster, which may not be as accurate as other probability sampling strategies, it still can be very accurate. Non-random sampling Non-probability sampling strategies are used when it is practically impossible to use probability sampling strategies. This typically occurs because of time and expense constraints and the lack of an adequate sampling frame. Nonprobability sampling is also used when the frequency of the behavior or characteristic of interest is so low in the population that a more targeted strategy is needed to find sufficient numbers of participants for the research. Haphazard Sampling Haphazard sampling is a strategy that is almost guaranteed to introduce bias into your study. It should be avoided at all costs. A typical haphazard strategy uses a "man-on-the-street" technique to recruit those who wander by or selects a sampling frame that does not accurately reflect the population. Convenience sampling This is a type of non-probability sampling which involves the sample being drawn from that part of the population which is selected because it is readily available and convenient. Purposive sampling Purposive sampling targets a particular group of people. When the desired population for the study is rare or very difficult to locate and recruit for a study, purposive sampling may be the only option. For example, you are interested in studying cognitive processing speed of young adults who have suffered closed head brain injuries in automobile accidents. This would be a difficult population to find. Quota sampling In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. This means that individuals can put a demand on who they want to sample (targeting) It is this second step which makes the technique one of non-probability sampling. In quota sampling, the selection of the sample is non-random unlike random sampling and can often be found unreliable. For example interviewers might be tempted to interview those people in the street who look most helpful, or may choose to use accidental sampling to question those whichSBL 321: Biostatistics J. C.Korir 4 are closest to them, for time-keeping sake. The problem is that these samples may be biased because not everyone gets a chance of selection. This non-random element is its greatest weakness and quota versus probability has been a matter of controversy for many years. Quota sampling is useful when time is limited, a sampling frame is not available, the research budget is very tight or when detailed accuracy is not important. You can also choose how many of each category is selected. Scales of measurement There are five measurement scales are used: • Nominal Data • Ordinal Data • Rank Data • Discrete Data • Continuous Data Nominal data Nominal variables include categories of people, events, and other phenomena are named. Often we do not need the full power of numbers for every application. To make this point clear we classify our use of numbers into different class. For example, one kind of data is what we call a nominal data; when we label males as 0, females as 1, then that’s nominal data. Another example of nominal data is if we use 0 to denote who's alive and 1 for denoting people who are dead. In both these examples, they are nominally numbers, just 0 or 1. The only property we're making use of the number system here is that 0 is different from 1. We're not saying 1 is bigger than 0. We're not saying that 1 is one unit away from 0. Simply that 0 and 1 are different. This is the simplest example we have of nominal data. This is sometimes called binary data or dichotomous data, depending upon whether you prefer the Greek or the Latin root for two. But it doesn't just have to have two values. For example, if we're looking at blood groups, here we would need four values: one each for blood groups A, B, AB and O. They are exhaustive in nature, and are mutually exclusive. These categories are discrete and non-continuous. The Statistical operations permissible are: counting of frequency, Percentage, Proportion, mode, and coefficient of contingency. Ordinal data It is second in terms of its refinement as a means of classifying information. It incorporates the functions of nominal scale. The ordinal scale is used to arrange (or rank) individuals into a sequence ranging from the highest to lowest. For example, we might classify some disease as mild, moderate, or severe, where we might label mild as a 1, moderate as a 2, and severe a 3. We use the order of the data because 2 is a little bit more severe than 1, and 3 is a little bit more severe than 2. So the order is important. Rank data Rank data is sort of like when we just had the Olympics, the person who finishes first gets the gold medal. The person who finishes second gets the silver. It doesn't matter how far behind the second is from the first. It's just that the second one finished second. So it could be a fraction of a second, to finish second, later than the first. Or it could be a few minutes. It doesn't matter. It's just the rank, the rank in which the data are ordered. Interval data Interval scale refers to the third level of measurement in relation to complexity of statistical techniques used to analyze data. It is quantitative in nature. The individual units are equidistant from one point to the other. The interval data does not have an absolute zero. E.g. temperature is measured in Celsius or Fahrenheit.SBL 321: Biostatistics J. C.Korir 5 Ratio data Have equal distances between the increments. This scale has an absolute zero. Ratio variables exhibit the characteristics of ordinal and interval measurement E.g. variable like time; length and weight are ratio scales and also are measured using nominal or ordinal scale. Processing of data The first step in processing of data is classification and tabulation. Classification is the process of arranging data on the basis of some common characteristics possessed by them. Two approaches in analyzing data are: o Descriptive statistics o Inferential statistics Descriptive statistics are concerned with describing the characteristics of frequency distributions. The common methods in descriptive analyses are: o Measures of central tendency o Measures of dispersion o Tabulation, cross-tab, contingency table o Line diagram, bar diagram, pie diagram. o Histogram, frequency polygon, frequency curve o Quantile, Q-Q plot o Scatterplot The inferential statistics helps to decide whether the outcome of the study is a result of factors planned within design of the study or determined by chance. Common inferential statistical tests are: T-tests, Chi-squire test, Pearson correlation. Methods of summarizing data Descriptive Statistics describe basic features of the data gathered from an experimental study in various ways. They provide simple summaries about the sample via graphs and numbers, mainly measures of center and variation. Together with graphics analysis (histograms, bar plots, piecharts), they are the cornerstone of quantitative data analysis. - Tables (frequency distributions, stem-and-leaf plots) that summarize the data. - Graphical representations of the data (histograms, bar plots, pie-charts). - Summary statistics (numbers) which summarize the data. Tables The most common ways of summarizing data into tables are frequency distribution, relative frequency distribution and relative frequency distribution tables. Another common format is using a stem-and-leaf plot. Frequency distribution table A frequency distribution summarizes the data into a table containing ranges where the data falls, and the frequency (or amount) of data that fall in that range. • Simple depiction of all the data • Frequency distribution is a statistical table containing “groups of values according to the number of times a value occurs.” • The data collected by an investigator is called raw data. • Raw data is ungrouped data. • It is not in order. • Raw data is arranged in order called array.SBL 321: Biostatistics J. C.Korir 6 • The data arranged in ascending order or descending order Frequency Distribution with Classes • It is constructed with class intervals. • It is a frequency distribution of continuous series. • Raw data arranged as array data. • Then the data is divided in to groups called classes. • The first class and the last class are fixed by seeing the lowest and highest values. • Lowest and highest numbers of each class are called class limits (upper & lower). • The class limit may be made in two methods: 1. Inclusive methods 2. Exclusive method Sturge’s Rule: k = 1 + 3.322(log10 n), k is the number of classes; n is the size of the data. Class Interval=Range/No of Classes Hence, No of Classes=Range/Class Interval Stem-and-Leaf Plots Another way of summarizing data into tables is the stem-and-leaf plot, which works best when the data can be subdivided into tens and units. Suppose that the GDP per capita for example is: 81 76 70 57 55 55 89 46 46 45 44 44 42 40 39 39 56 39 39 39, which is given in thousands of dollars. In other words, one country has a GDP per capita of 81,000 dollars (namely Luxemburg), Qatar has a GDP per capita of 76,000, and for example the USA have a GDP per capita of 46,000 dollars. This data can be nicely summarized in the following stem-and-leaf-plot: In the stem-and-leaf plot above, 8|1 means that there is one county with GDP per person in the 80,000s, and that that county has a GDP of 81,000 per capita. 7|06 means that there are 2 countries with GDP between $70,000 and $79,999, and that those countries have GDP per capita of 70,000 and 76,000. That is why there is a 0 and a 6 in the units are, to the right of the vertical bar. Diagrammatic Presentation of data It is a visual form of presentation of statistical data in which data are presented in the form of diagrams such as bars, lines, circles, maps Advantages of diagrammatic presentation of data: 1. It more attractive 2. It simplify complex information 3. It saves timeSBL 321: Biostatistics J. C.Korir 7 4. It helps to make comparison. Rules for drawing diagrams - It should have a title - Proper scaling should be used. - Index must be given for better understanding of diagrams Common Types of diagrams - Line Diagram - Pie diagram - Bar diagram Graphical Presentation of data - Presenting data in the form of graphs prepared on a graph. - The graph has two axes: X & Y - Usually, Independent variable is marked on the X-axis and dependent variable on the Yaxis. Common Types: 1. Histogram 2. Frequency Polygon 3. Frequency curve Histogram - Histogram is a graph containing frequencies in the form of vertical rectangles. - It is an area diagram - It is the graphical presentation of frequency distribution. - X-axis is marked with class intervals - Y-axis is marked with frequencies - Histogram differs from bar diagram. The bar diagram is one dimensional, whereas histogram is two-dimensional. Uses of histogram 1. It gives a clear picture of entire data 2. It simplifies complex data 3. Median and mode can be calculated. 4. It facilitates comparison of two or more frequency distributions on the same graph. Category Systolic BP (mmHg) Number of Persons7 1 100-109 7 2 110-119 16 3 120-129 19 4 130-139 31 5 140-149 41 6 150-159 23 7 160-169 10 8 170-179 3 Frequency PolygonSBL 321: Biostatistics J. C.Korir 8 A frequency polygon is another way to show the information in a frequency table. It looks a little bit like a line graph. To make a frequency polygon, you just need to plot a the mid-points of each class and then join the points by straight lines. Frequency Curve Frequency curve is obtained by joining the points of frequency polygon by a freehand smoothed curve. Unlike frequency polygon, where the points we joined by straight lines, we make use of free hand joining of those points in order to get a smoothed frequency curve. It is used to remove the ruggedness of polygon and to present it in a good form or shape. We smoothen the angularities of the polygon only without making any basic change in the shape of the curve. In this case also the curve begins and ends at base line, as is in case of polygon. Area under the curve must remain almost the same as in the case of polygon. Measures of central tendency Measures of central tendency are sometimes needed to make meaningful interpretation of data. Generally, it is found that in any distribution values of the variable tend to congregate around a central value of the distribution. This tendency of the distribution is known as central tendency and the measures devised to consider this tendency is known as measure of central tendency. One of the most important objectives of this statistical analysis is to get a single value that describes the characteristic of the entire mass of data [Show Less]

Preview 6 out of 62 pages