Thursday, November 21, 2019

Skittles Statistic Project

                This my reflection of my Skittles project for my Math 1040 Statistics class.  This project has been very fascinating for several reasons.  Statistics has some actual, real-world implications.  The purpose of statistics is to figure out a desired proportion of data to answer several questions.  For this project, two questions were ultimately asked: “What is the expected mean number of Skittles for the population?” and “What is the expected mean proportion of yellow Skittles for the population?
                The process wasn’t very much at all what I expected, and that’s a good thing.  Probably the most useful thing I learned is the implication of confidence intervals.  This presents the fact that no statistical question is ever going to be answered to the utmost accuracy.  Therefore, when dealing with the population, it’s never answered in an absolute.  Even though the samples can be dealt with extreme accuracy, it is impossible to find out the actual answers for an entire population, especially since the population in this case is the world.  Therefore, population answers are always responded with “we expect.”  Hence the use of the confidence interval.  To staple a confidence percentage to a population prediction shows humanity in these studies, and with enough accuracy to allow a margin of error to stay somewhat accurate.
                These statistics skills can be incredibly useful for other classes that require math skills.  It was also incredibly useful to learn how create the different kind of charts needed to visually convey different data sets.  I think another lesson that the instructor was trying to get us to learn is that when working with statistics, it would be discouraged to go through the entire process all by yourself.  Which is why he assigned us into groups.  Getting different outlooks and opinions on statistical definitions, as well as how the data is turning out so far.
                With all of that being said, here’s how my project report turned out.  Enjoy!
          


          We were asked to buy a 2.17-ounce bag of Original Skittles.  The purpose of this was to have a varying amount of samples from each of the classmates in an attempt to figure out the mean population proportion of yellow Skittles found in said bags.
          Obviously, the first thing I did was buy a bag of Skittles with the limits set.  The data turned out something like this:


Count Red
Count Orange
Count Yellow
Count Green
Count Purple
Total
My Bag
10
8
17
17
7
59













          The next thing we were asked to do was to make a prediction about what the overall proportions for each bag.  Here were my predictions:


Red
Orange
Yellow
Green
Purple
Predicted proportion for each color
0.15
0.14
0.30
0.30
0.11


          I tried to keep my prediction as close to my original count in my personal bag as possible.  I figured that would be a great place to start hypothesizing.  After receiving the data of the other student's here's how the results turned out:


Red
Orange
Yellow
Green
Purple
Total Count
Counts for my bag
10
8
17
17
7
59
Counts for the entire class sample
1206
1171
1196
1125
1094
5792
Actual Proportions for my bag
0.17
0.14
0.29
0.29
0.12
1
Actual Proportions for the entire class sample
0.21
0.20
0.21
0.19
0.19
1


          Then I turned the data into a pie chart and a pareto chart.  Here's what those look like:







          And then I made my statement on the project up to this point:

          "This study so far has been surprisingly informative.  I wasn’t very surprised by how random the color of skittles was in my personal bag.  I figured it would be like that.  And looking at other class member’s personal skittles confirmed to me just how random theirs were as well.  That didn’t surprise me either.  What did surprise me, however, was how close to equal portions the colors became once everyone’s bags of skittles all came together.  With the entire class sample, every color turned up to somewhere around 20%.  I didn’t think those results would be so uniform.  I thought the overall number would be as random as any ordinary bag of skittles.  I’m looking forward to learning more about these statistics."


          As part of this project, we were also assigned into different groups.  The instructor asked us specific questions throughout the semester and we as a group would determine our group answer and respond.  We were asked, "Does the class data represent a random sample? Why or why not? What population are we sampling from?"  Here was our response:

"As a group we discussed and whether the class data was a random sample or not. We concluded that it is a random sample because each bag of skittles was purchased by individual people at different places and at different times.
We were unsure of what defined the population but concluded that it could be all of the 2.17oz bags of skittles in the US."


          The next thing I did was calculate what the mean number, standard deviation, and the 5-number summary was for the data thus far.  Here are those results:

-    Mean Number of Candies Per Bag:  63.6

-    Standard Deviation of the Number of Candies Per Bag:  34.5

-    5-Number Summary for the Number of Candies Per Bag:  47, 57, 59, 60, 382


          Then I turned the bag count into a histogram and into a boxplot.  Here's what those look like:




          If this data looks incredibly skewed, it's because there was a student who, for some reason, didn't stick to the rules and decided to buy a Skittles bag with 382 Skittles in it.  And wanting to be as fiercely loyal to the instructor's instructions as possible (and the fact that he gave us the data for this enormous bag), I put the data in.  Here was my summary after making these graphs:

          "The difficult part about this graph is the random skittles bag filled with 382 skittles.  It sort of messes up the overall look of both graphs.  But it was in the class sample, so it’s here.  As far as the shape of the distribution is concerned, it’s sort of bizarre.  It looks more skewed to the right, but it doesn’t gradually disperse either way.  Most bags are in the 55-60 range and then it drops significantly to both sides.  But there are much more candies above the 60 range, which leads to me conclude that the shape is skewed right.  I expected this too happen, considering that most of the numbers upon initial viewing were all clustered around 55-60.  As far as outliers are concerned, there are quite a few that exceed the upper fence.  Those outliers are 65, 66, 86, 89, 91, both 95’s, and, of course, 382.  Again, everything would look fine and be presented in an orderly manner, but the 382 bag really jacks things up."


Our next group discussion, we were asked these three questions.  Here are the responses to each question:
What is the difference between qualitative and quantitative data?  "Quantitative data is data that can be literally counted or measured can be determined through the use of numbers. Qualitative data is descriptive and based on traits and characteristics."
What types of graphs make sense and what types of graphs do not make sense for qualitative data? For quantitative data? Explain why.  "Pie and bar charts are best for qualitative data because you can split off for specific characteristics effectively. For quantitative data, histograms and dot plots are best because they show the desired numerical value."
What types of calculations (eg. summary statistics) make sense and what types do not make sense for qualitative data? For quantitative data? Explain why. "The relative frequency calculation is good for both qualitative and quantitative data because it proports the data to make it better to read."


          The last thing I did was construct a confidence interval of 99% for the population proportion of yellow Skittles, as well as population mean number of overall Skittles with a 95% confidence interval.  The equations didn't transfer over as well from Microsoft Word as I would have liked them to, but they did well enough for these purposes.  Here's what that looks like:

          1) According to the equation below (using both plus and minus versions and including the misplaced giant bag of skittles), we can be 99% confident that the proportion of yellow skittles in a typical bag of skittles population is in between 0.196 and 0.224 (19.6% and 22.4%).
𝟎.𝟐𝟏 ± 𝟐.𝟓𝟖(√ 𝟎.𝟐𝟏(𝟏 − 𝟎.𝟐𝟏)/𝟓𝟕𝟗𝟐 )

          2) According to this equation down below (using both plus and minus versions and including the misplaced giant bag of candy), we can be 95% confident that mean number of skittles in a typical bag of skittles population is in between 62 and 65 skittles.
𝟔𝟑.𝟔 ± 𝟏.𝟗𝟔𝟐(𝟑𝟒.𝟓/√𝟓𝟕𝟗𝟐 )


          For our last group discussion, we were asked two questions.  Here they are, as well as are responses to them:
Explain in general the purpose and meaning of a confidence interval.  "How well a statistic model estimates the underling population can have issues. Confidence intervals address these issues by giving a range of values which are likely to contain the population parameter in question."
What factors affect the width of a confidence interval? Why?  "A factor that can affect the width of a confidence interval is if the standard deviation is known or needs to be estimated. In most real world situations the standard deviation is unknown so the confidence interval width would need to be adjusted to accommodate the estimation.
Another factor would be sample size. The larger the sample size means the closer the data will be to the population parameter, so the confidence interval width can be smaller."
 

          This was a very fascinating study.  This was a good project to really teach us how to take sample statistics and construct a confidence interval to try and figure out a desired, specific proportion.  I feel much more confident that if I wanted to conduct another study of attempting to find a mean or proportion of something, I have the necessary tools and knowledge to do so.

No comments:

Post a Comment