## College of Micronesia-FSM: Dana Lee Ling's Introduction to Statistics Using OpenOffice.org, LibreOffice.org Calc, 4th edition: "05 Probability"

Read this section to learn about the intuition behind ways to compute probabilities.

## 5.1 Ways to determine a probability

A probability is the likelihood of an event or outcome. Probabilities are specified mathematically by a number between 0 and 1 including 0 or 1.

**0**is no likelihood an event will occur.**1**is absolute certainty an event will occur.**0.5**is an equal likelihood of occurrence or non-occurrence.- Any value between 0 and 1 can occur.

We use the notation *P(eventLabel) = probability* to report a probability.

There are three ways to assign probabilities.

- Intuition or subjective estimate
- Equally likely outcomes
- Relative Frequencies

### Intuition

Intuition/subjective measure. An educated best guess. Using available information to make a best estimate of a probability. Could be anything from a wild guess to an educated and informed estimate by experts in the field.

### Equally Likely Events or Outcomes

Equally Likely Events: Probabilities from mathematical formulas

In the following the word "event" and the word "outcome" are taken to have the same meaning.

#### Probabilities versus Statistics

The study of problems with equally likely outcomes is termed the study of probabilities. This is the realm of the mathematics of probability. Using the mathematics of probability, the outcomes can be determined ahead of time. Mathematical formulas determine the probability of a particular outomce. All measures are population parameters. The mathematics of probability determines the probabilities for coin tosses, dice, cards, lotteries, bingo, and other games of chance.

This course focuses not on probability but rather on statistics. In statistics, measurement are made on a sample taken from the population and used to estimate the population's parameters. All possible outcomes are not usually known. is usually not known and might not be knowable. Relative frequencies will be used to estimate population parameters.

### Calculating Probabilities

Where each and every event is equally likely, the probability of an event occurring can be determined from

probability = ways to get the desired event/total possible events

or

probability = ways to get the particular outcome/total possible outcomes

### Dice and Coins

#### Binary probabilities: yes or no, up or down, heads or tails

##### A penny

P(head on a penny) = one way to get a head/two sides = 1/2 = 0.5 or 50%

That probability, 0.5, is the probability of getting a heads or tails **prior** to the toss. Once the toss is done, the coin is either a head or a tail, 1 or 0, all or nothing. There is no 0.5 probability anymore.

Over any ten tosses there is no guarantee of five heads and five tails: probability does not work like that. Over any small sample the ratios of expected outcomes can differ from the mathematically calculated ratios.

Over thousands of tosses, however, the ratio of outcomes such as the number of heads to the number of tails, will approach the mathematically predicted amount. We refer to this as the *law of large numbers*.

In effect, a few tosses is a sample from a population that consists, theoretically, of an infinite number of tosses. Thus we can speak about a population mean μ for an infinite number of tosses. That population mean μ is the mathematically predicted probability.

Population mean μ = (number of ways to get a desired outcome)/(total possible outcomes)

#### Dice: Six-sided

A six-sided die. Six sides. Each side equally likely to appear. Six total possible outcomes. Only one way to roll a one: the side with a single pip must face up. 1 way to get a one/6 possible outcomes = 0.1667 or 17%

P(1) = 0.17

#### Dice: Four, eight, twelve, and twenty sided

The formula remains the same: the number of possible ways to get a particular roll divided by the number of possible outcomes (that is, the number of sides!).

Think about this: what would a three sided die look like? How about a two-sided die? What about a one sided die? What shape would that be? Is there such a thing?

#### Two dice

Ways to get a five on two dice: 1 + 4 = 5, 2 + 3 = 5, 3 + 2 = 5, 4 + 1 = 5 (each die is unique). Four ways to get/36 total possibilities = 4/36 = 0.11 or 11%

Homework:

- What is the probability of rolling a three on...
- A four sided die?
- A six sided die?
- An eight sided die?
- A twelve sided die?
- A twenty sided die labeled 0-9 twice.

- What is the probability of throwing two pennies and having both come up heads?

## 5.2 Sample space

The sample space set of all possible outcomes in an experiment or system.

Bear in mind that the following is an oversimplification of the complex biogenetics of achromatopsia for the sake of a statistics example. Achromatopsia is controlled by a pair of genes, one from the mother and one from the father. A child is born an achromat when the child inherits a recessive gene from both the mother and father.

A is the dominant gene

a is the recessive gene

A person with the combination AA is "double dominant" and has "normal" vision.

A person with the combination Aa is termed a carrier and has "normal" vision.

A person with the combination aa has achromatopsia.

Suppose two carriers, Aa, marry and have children. The sample space for this situation is as follows:

mother | |||

father | \ | A | a |
---|---|---|---|

A | AA | Aa | |

a | Aa | aa |

The above diagram of all four possible outcomes represents the sample space for this exercise. Note that for each and every child there is only one possible outcome. The outcomes are said to be mutually exclusive and independent. Each outcome is as likely as any other individual outcome. All possible outcomes can be calculated. the sample space is completely known. Therefore the above involves probability and not statistics.

The probability of these two parents bearing a child with achromatopsia is:

P(achromat) = one way for the child to inherit aa/four possible combinations = 1/4 = 0.25 or 25%

This does NOT mean one in every four children will necessarily be an achromat. Suppose they have eight children. While it could turn out that exactly two children (25%) would have achromatopsia, other likely results are a single child with achromatopsia or three children with achromatopsia. Less likely, but possible, would be results of no achromat children or four achromat children. If we decide to work from actual results and build a frequency table, then we would be dealing with statistics.

The probability of bearing a carrier is:

P(carrier) = two ways for the child to inherit Aa/four possible combinations = 2/4 = 0.50

Note that while each outcome is equally likely,there are TWO ways to get a carrier, which results in a 50% probability of a child being a carrier.

At your desk: mate an achromat aa father and carrier mother Aa.

- What is the probability a child will be born an achromat? P(achromat) = ________
- What is the probability a child will be born with "normal" vision? P("normal") = ______

Homework: Mate a AA father and an achromat aa mother.

- What is the probability a child will be born an achromat? P(achromat) = ________
- What is the probability a child will be born with "normal" vision? P("normal") = ______

See: http://www.achromat.org/ for more information on achromatopsia.

Genetically linked schizophrenia is another genetic example:

Mol Psychiatry. 2003 Jul;8(7):695-705, 643.

Genome-wide scan in a large complex pedigree with predominantly male schizophrenics from the island of Kosrae: evidence for linkage to chromosome 2q.Wijsman EM, Rosenthal EA, Hall D, Blundell ML, Sobin C, Heath SC, Williams R, Brownstein MJ, Gogos JA, Karayiorgou M. Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA. It is widely accepted that founder populations hold promise for mapping loci for complex traits. However, the outcome of these mapping efforts will most likely depend on the individual demographic characteristics and historical circumstances surrounding the founding of a given genetic isolate. The 'ideal' features of a founder population are currently unknown. The Micronesian islandic population of Kosrae, one of the four islands comprising the Federated States of Micronesia (FSM), was founded by a small number of settlers and went through a secondary genetic 'bottleneck' in the mid-19th century. The potential for reduced etiological (genetic and environmental) heterogeneity, as well as the opportunity to ascertain extended and statistically powerful pedigrees makes the Kosraen population attractive for mapping schizophrenia susceptibility genes. Our exhaustive case ascertainment from this islandic population identified 32 patients who met DSM-IV criteria for schizophrenia or schizoaffective disorder. Three of these were siblings in one nuclear family, and 27 were from a single large and complex schizophrenia kindred that includes a total of 251 individuals. One of the most startling findings in our ascertained sample was the great difference in male and female disease rates. A genome-wide scan provided initial suggestive evidence for linkage to markers on chromosomes 1, 2, 3, 7, 13, 15, 19, and X. Follow-up multipoint analyses gave additional support for a region on 2q37 that includes a schizophrenia locus previously identified in another small genetic isolate, with a well-established recent genealogical history and a small number of founders, located on the eastern border of Finland. In addition to providing further support for a schizophrenia susceptibility locus at 2q37, our results highlight the analytic challenges associated with extremely large and complex pedigrees, as well as the limitations associated with genetic studies of complex traits in small islandic populations. PMID: 12874606 [PubMed - indexed for MEDLINE]

The above article is both fascinating and, at the same time, calls into question privacy issues. On the small island of Kosrae "three siblings from one nuclear family" are identifiable people.

## 5.3 Relative Frequency

The third way to assign probabilities is from relative frequencies. Each relative frequency represents a probability of that event occurring for that sample space. Body fat percentage data was gathered from 58 females here at the College since summer 2001. The data had the following characteristics:

count | 59 |
---|---|

mean | 28.7 |

sx | 7.1 |

min | 15.6 |

max | 50.1 |

A five class frequency and relative frequency table has the following results:

BFI = Body Fat Index (percentage*100)

CLL = Class (bin) Lower Limit

CUL = Class (bin) Upper Limit (Excel uses)

Note that the classes are not equal width in this example.

Medical Category | BFI fem CUL x | Frequency f | Relative Frequency f/n or P(x) |
---|---|---|---|

Athletically fit* | 20 | 3 | 0.05 |

Physically fit | 24 | 15 | 0.25 |

Acceptable | 31 | 24 | 0.41 |

Borderline obese (overfat) | 39 | 12 | 0.20 |

Medically obese | 51 | 5 | 0.08 |

Sample size n: | 59 | 1.00 |

* body fat percentage category

This means there is a...

- 0.05 (five percent) probability of a female student in the sample having a body fat percentage between 12 and 20 (athletically fit)
- 0.25 (25%) probability of a female student in the sample has body fat percentage between 20.1 (the Tanita unit only measured to the nearest tenth) and 24 (physically fit)
- 0.41 (41%) probability of a female student in the sample has body fat percentage between 24.1 and 31 (acceptable but not fit level of fat)
- 0.20 (20%) probability of a female student in the sample has body fat percentage between 31.1 and 39 (on the borderline between acceptable and obese)
- 0.08 (8%) probability of a female student in the sample has body fat percentage between 39.1 and 51 (medically obese)

The most probable result (most likely) is a body fat measurement between 24.1 and 31 with a 41% probability of a student being in each of either of these intervals.

The same table, but for male students:

Medical Category | BFI male CUL x | Frequency f | Relative Frequency f/n or P(x) |
---|---|---|---|

Athletically fit* | 13 | 9 | 0.18 |

Physically fit | 17 | 11 | 0.22 |

Acceptable | 20 | 10 | 0.20 |

Borderline obese (overfat) | 25 | 9 | 0.18 |

Medically obese | 50 | 12 | 0.24 |

Sample size n: | 51 | 1.00 |

The male students have a higher probability of being obese than the female students!

#### Kosraens abroad: Another example

What is the probability that a Kosraen lives outside of Kosrae? An informal survey done on the 25th of December 2007 produced the following data. The table also includes data gathered Christmas 2003.

*Kosraen population estimates*

Location | 2003 Conservative | 2003 Possible | 2007 | Growth |
---|---|---|---|---|

Ebeye | - | - | 30 | - |

Guam | 200 | 300 | 300 | 50% |

Honolulu | 600 | 1000 | 1000 | 67% |

Kona | 200 | 200 | 800 | 300% |

Maui | 100 | 100 | 60 | -40% |

Pohnpei | 200 | 200 | 300 | 50% |

Seattle | 200 | 200 | 600 | 200% |

Texas | 200 | 200 | N/A | - |

Virgina Beach | 200 | 200 | N/A | - |

USA Other | - | 200 | N/A | - |

Diaspora sums: | 1700 | 2400 | 3090 | - |

Kosrae | 7663 | 7663 | 8183 | - |

Est. Total Pop.: | 9363 | 10063 | 11273 | - |

Percentage abroad: | 18.2% | 23.8% | 27% | 48% |

The relative frequency of 27% is a point estimate for the probability that a Kosraen lives outside of Kosrae.

### Law of Large Numbers

For relative frequency probability calculations, as the sample size increases the probabilities get closer and closer to the true population parameter (the actual probability for the population). Bigger samples are more accurate.