Stochastic Modeling And Mathematical Statistics By Francisco J Samaniego Pdf

9,324,087 книг книги
84,837,646 статей статьи
ZLibrary Home
Home

Главная Stochastic Modeling and Mathematical Statistics: A Text for Statisticians and Quantitative...

Обложка книги Stochastic Modeling and Mathematical Statistics: A Text for Statisticians and Quantitative Scientists

Stochastic Modeling and Mathematical Statistics: A Text for Statisticians and Quantitative Scientists

Francisco J. Samaniego

Насколько Вам понравилась эта книга?

Какого качества скаченный файл?

Скачайте книгу, чтобы оценить ее качество

Какого качества скаченные файлы?

Provides a Solid Foundation for Statistical Modeling and Inference and Demonstrates Its Breadth of Applicability

Stochastic Modeling and Mathematical Statistics: A Text for Statisticians and Quantitative Scientists addresses core issues in post-calculus probability and statistics in a way that is useful for statistics and mathematics majors as well as students in the quantitative sciences. The book's conversational tone, which provides the mathematical justification behind widely used statistical methods in a reader-friendly manner, and the book's many examples, tutorials, exercises and problems for solution, together constitute an effective resource that students can read and learn from and instructors can count on as a worthy complement to their lectures.

Using classroom-tested approaches that engage students in active learning, the text offers instructors the flexibility to control the mathematical level of their course. It contains the mathematical detail that is expected in a course for "majors" but is written in a way that emphasizes the intuitive content in statistical theory and the way theoretical results are used in practice. More than 1000 exercises and problems at varying levels of difficulty and with a broad range of topical focus give instructors many options in assigning homework and provide students with many problems on which to practice and from which to learn.

Издательство:

Chapman and Hall/CRC

Серии:

Chapman & Hall/CRC Texts in Statistical Science

В течение 1-5 минут файл будет доставлен на Ваш email.

В течение 1-5 минут файл будет доставлен на Ваш kindle.

Примечание: Вам необходимо верифицировать каждую книгу, которую Вы отправляете на Kindle. Проверьте свой почтовый ящик на наличие письма с подтверждением от Amazon Kindle Support.

Ключевые фразы

Stochastic Modeling and Mathematical Statistics A Text for Statisticians and Quantitative Scientists CHAPMAN & HALL/CRC Texts in Statistical Science Series Series Editors Francesca Dominici, Harvard School of Public Health, USA Julian J. Faraway, University of Bath, UK Martin Tanner, Northwestern University, USA Jim Zidek, University of British Columbia, Canada Statistical Theory: A Concise Introduction F. Abramovich and Y. Ritov Practical Multivariate Analysis, Fifth Edition A. Afifi, S. May, and V.A. Clark Practical Statistics for Medical Research D.G. Altman Interpreting Data: A First Course in Statistics A.J.B. Anderson Introduction to Probability with R K. Baclawski Linear Algebra and Matrix Analysis for Statistics S. Banerjee and A. Roy Statistical Methods for SPC and TQM D. Bissell Bayesian Methods for Data Analysis, Third Edition B.P. Carlin and T.A. Louis Second Edition R. Caulcutt The Analysis of Time Series: An Introduction, Sixth Edition C. Chatfield Introduction to Multivariate Analysis C. Chatfield and A.J. Collins Problem Solving: A Statistician's Guide, Second Edition C. Chatfield Statistics for Technology: A Course in Applied Statistics, Third Edition C. Chatfield Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians R. Christensen, W. Johnson, A. Branscum, and T.E. Hanson Modelling Binary Data, Second Edition D. Collett Modelling Survival Data in Medical Research, Second Edition D. Collett Introduction to Statistical Methods for Clinical Trials T.D. Cook and D.L. DeMets Applied Statistics: Principles and Examples D.R. Cox and E.J. Snell Multivariate Survival Analysis and Competing Risks M. Crowder Statistical Analysis of Reliability Data M.J. Crowder, A.C. Kimber, T.J. Sweeting, and R.L. Smith An Introduction to Generalized Linear Models, Third Edition A.J. Dobson and A.G. Barnett Introduction to Optimization Methods and Their Applications in Statistics B.S. Everitt Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonpa; rametric Regression Models J.J. Faraway A Course in Large Sample Theory T.S. Ferguson Multivariate Statistics: A Practical Approach B. Flury and H. Riedwyl Readings in Decision Analysis S. French Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second Edition D. Gamerman and H.F. Lopes Bayesian Data Analysis, Third Edition A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin Multivariate Analysis of Variance and Repeated Measures: A Practical Approach for Behavioural Scientists D.J. Hand and C.C. Taylor Practical Data Analysis for Designed Practical Longitudinal Data Analysis D.J. Hand and M. Crowder Logistic Regression Models J.M. Hilbe Richly Parameterized Linear Models: Additive, Time Series, and Spatial Models Using Random Effects J.S. Hodges Statistics for Epidemiology N.P. Jewell Stochastic Processes: An Introduction, Second Edition P.W. Jones and P. Smith Time Series Analysis H. Madsen Pólya Urn Models H. Mahmoud Randomization, Bootstrap and Monte Carlo Methods in Biology, Third Edition B.F.J. Manly Introduction to Randomized Controlled Clinical Trials, Second Edition J.N.S. Matthews The Theory of Linear Models B. Jørgensen Statistical Methods in Agriculture and Experimental Biology, Second Edition R. Mead, R.N. Curnow, and A.M. Hasted Graphics for Statistics and Data Analysis with R K.J. Keen Beyond ANOVA: Basics of Applied Statistics R.G. Miller, Jr. Principles of Uncertainty J.B. Kadane Mathematical Statistics K. Knight Nonparametric Methods in Statistics with SAS Applications O. Korosteleva Modeling and Analysis of Stochastic Systems, Second Edition V.G. Kulkarni Exercises and Solutions in Biostatistical Theory L.L. Kupper, B.H. Neelon, and S.M. O'Brien Exercises and Solutions in Statistical Theory L.L. Kupper, B.H. Neelon, and S.M. O'Brien Design and Analysis of Experiments with SAS J. Lawson A Course in Categorical Data Analysis T. Leonard Statistics for Accountants S. Letchford Introduction to the Theory of Statistical Inference H. Liero and S. Zwanzig Statistical Theory, Fourth Edition B.W. Lindgren Stationary Stochastic Processes: Theory and Applications G. Lindgren The BUGS Book: A Practical Introduction to Bayesian Analysis D. Lunn, C. Jackson, N. Best, A. Thomas, and D. Spiegelhalter Introduction to General and Generalized Linear Models H. Madsen and P. Thyregod Statistics in Engineering: A Practical Approach A.V. Metcalfe A Primer on Linear Models J.F. Monahan Applied Stochastic Modelling, Second Edition B.J.T. Morgan Elements of Simulation B.J.T. Morgan Probability: Methods and Measurement A. O'Hagan Introduction to Statistical Limit Theory A.M. Polansky Applied Bayesian Forecasting and Time Series Analysis A. Pole, M. West, and J. Harrison Statistics in Research and Development, Time Series: Modeling, Computation, and Inference R. Prado and M. West Introduction to Statistical Process Control P. Qiu Sampling Methodologies with Applications P.S.R.S. Rao A First Course in Linear Model Theory N. Ravishanker and D.K. Dey Essential Statistics, Fourth Edition D.A.G. Rees Stochastic Modeling and Mathematical Statistics: A Text for Statisticians and Quantitative Scientists F.J. Samaniego Statistical Methods for Spatial Data Analysis O. Schabenberger and C.A. Gotway Large Sample Methods in Statistics P.K. Sen and J. da Motta Singer Decision Analysis: A Bayesian Approach J.Q. Smith Analysis of Failure and Survival Data P. J. Smith Applied Statistics: Handbook of GENSTAT Analyses E.J. Snell and H. Simpson Applied Nonparametric Statistical Methods, Fourth Edition P. Sprent and N.C. Smeeton Data Driven Statistical Methods P. Sprent Generalized Linear Mixed Models: Modern Concepts, Methods and Applications W. W. Stroup Survival Analysis Using S: Analysis of Time-to-Event Data M. Tableman and J.S. Kim Applied Categorical and Count Data Analysis W. Tang, H. He, and X.M. Tu Elementary Applications of Probability Theory, Second Edition H.C. Tuckwell Introduction to Statistical Inference and Its Applications with R M.W. Trosset Understanding Advanced Statistical Methods P.H. Westfall and K.S.S. Henning Statistical Process Control: Theory and Practice, Third Edition G.B. Wetherill and D.W. Brown Generalized Additive Models: An Introduction with R S. Wood Epidemiology: Study Design and Data Analysis, Third Edition M. Woodward Experiments B.S. Yandell Texts in Statistical Science Stochastic Modeling and Mathematical Statistics A Text for Statisticians and Quantitative Scientists Francisco J. Samaniego University of California Davis, USA CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2014 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20131126 International Standard Book Number-13: 978-1-4665-6047-5 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http:// www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Dedication and Acknowledgments It seems fitting that I dedicate this book to my students. I've had thousands of them over my long career, and I've learned a lot from them, from the genuine curiosity from which many questions came and from their growth as they confronted the intellectual challenges I set before them. The challenge of challenging them to think hard about new ideas has shaped who I am as a teacher. In particular, the many students I have taught in my probability and mathematical statistics courses over the years contributed mightily to the development of this book. I am grateful for their trust as I guided them through what occasionally looked to them like a minefield. Teaching these students has been a joy for me, and I will always appreciate the gracious reception they have given me as their teacher. The students whom I taught in the fall and winter quarters of this academic year were especially helpful to me as I "classroom tested" the penultimate version of the book. Readers have been spared hundreds of typos that were fixed due to their watchful eyes. I would like to thank a number of individuals who provided significant assistance to me in the writing of this text. I consider the problems to be the most important part of the book, and I am grateful to Apratim Ganguly, Kimi Noguchi, Anzhi Gu, and Zhijie Zheng for patiently working through hundreds of problems and providing solutions that could be made available to students and/or instructors who use the text. The steady and sage advice of my editor John Kimmel is much appreciated. A special benefit of John's stewardship was the high quality of the reviewers that he recruited to comment on early versions of the text. They all contributed to making this a better book. My sincere thanks to Adam Bowers (UC San Diego), James Gentle (George Mason University), Solomon Harrar (University of Montana), Wesley Johnson (UC Irvine), Lawrence Leemis (William & Mary University), Elena Rantou (George Mason University), Ralph P. Russo (U. of Iowa), and Gang Wang (DePaul University). I am also most grateful to Gail Gong for her helpful advice on Chapter 12 and to Ethan Anderes for his help with the graphics on the book's cover. Christopher Aden took my somewhat primitive version of the text and made it sing on Chapman and Hall's LaTeX template. Thanks, Chris, for your timely and highquality work! Finally, I thank my wife, Mary O'Meara Samaniego, for her patience with this project. I am especially grateful for her plentiful corporal and moral support. And a special thanks to Elena, Moni, Keb, Jack, and Will. It's hard for me to imagine a more supportive and loving family. F.J. Samaniego June 2013 vii This page intentionally left blank Contents Preface for Students xiii Preface for Instructors 1 2 3 The Calculus of Probability 1.1 A Bit of Background . . . . . . . . . 1.2 Approaches to Modeling Randomness 1.3 The Axioms of Probability . . . . . . 1.4 Conditional Probability . . . . . . . . 1.5 Bayes' Theorem . . . . . . . . . . . . 1.6 Independence . . . . . . . . . . . . . 1.7 Counting . . . . . . . . . . . . . . . . 1.8 Chapter Problems . . . . . . . . . . . xv 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Continuous Probability Models 3.1 Continuous Random Variables . . . . . . . . . . . . . . . . 3.2 Mathematical Expectation for Continuous Random Variables 3.3 Cumulative Distribution Functions . . . . . . . . . . . . . . 3.4 The Gamma Model . . . . . . . . . . . . . . . . . . . . . . 3.5 The Normal Model . . . . . . . . . . . . . . . . . . . . . . 3.6 Other Continuous Models . . . . . . . . . . . . . . . . . . . 3.6.1 The Beta Model . . . . . . . . . . . . . . . . . . . . 3.6.2 The Double Exponential Distribution . . . . . . . . . 3.6.3 The Lognormal Model . . . . . . . . . . . . . . . . . 3.6.4 The Pareto Distribution . . . . . . . . . . . . . . . . 3.6.5 The Weibull Distribution . . . . . . . . . . . . . . . . 3.6.6 The Cauchy Distribution . . . . . . . . . . . . . . . . 3.6.7 The Logistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discrete Probability Models 2.1 Random Variables . . . . . . . . . . . . . . . . . . . 2.2 Mathematical Expectation . . . . . . . . . . . . . . . 2.3 The Hypergeometric Model . . . . . . . . . . . . . . 2.4 A Brief Tutorial on Mathematical Induction (Optional) 2.5 The Binomial Model . . . . . . . . . . . . . . . . . 2.6 The Geometric and Negative Binomial Models . . . . 2.7 The Poisson Model . . . . . . . . . . . . . . . . . . 2.8 Moment-Generating Functions . . . . . . . . . . . . 2.9 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 6 13 24 33 38 42 58 71 71 75 82 86 89 95 101 104 111 123 123 126 130 134 141 151 151 152 153 154 155 156 158 ix x 4 5 6 7 8 Contents 3.7 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Multivariate Models 4.1 Bivariate Distributions . . . . . . . . . . . . . . . . 4.2 More on Mathematical Expectation . . . . . . . . . . 4.3 Independence . . . . . . . . . . . . . . . . . . . . . 4.4 The Multinomial Distribution (Optional) . . . . . . . 4.5 The Multivariate Normal Distribution . . . . . . . . 4.6 Transformation Theory . . . . . . . . . . . . . . . . 4.6.1 The Method of Moment-Generating Functions 4.6.2 The Method of Distribution Functions . . . . . 4.6.3 The Change-of-Variable Technique . . . . . . 4.7 Order Statistics . . . . . . . . . . . . . . . . . . . . 4.8 Chapter Problems . . . . . . . . . . . . . . . . . . . 167 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Limit Theorems and Related Topics 5.1 Chebyshev's Inequality and Its Applications 5.2 Convergence of Distribution Functions . . . 5.3 The Central Limit Theorem . . . . . . . . . 5.4 The Delta Method Theorem . . . . . . . . . 5.5 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 175 185 191 193 196 197 200 203 213 217 231 . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Estimation: Fixed Sample Size Theory 6.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Further Insights into Unbiasedness . . . . . . . . . . . . . . . . . 6.3 Fisher Information, the Cramér-Rao Inequality, and Best Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Sufficiency, Completeness, and Related Ideas . . . . . . . . . . . 6.5 Optimality within the Class of Linear Unbiased Estimators . . . . 6.6 Beyond Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 231 234 238 243 247 255 . . . . . . . . 255 261 . . . . . . . . . . 266 274 284 287 291 Statistical Estimation: Asymptotic Theory 7.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . 7.4 A Featured Example: Maximum Likelihood Estimation of the Risk of Disease Based on Data from a Prospective Study of Disease . . . . . . . . . 7.5 The Newton-Raphson Algorithm . . . . . . . . . . . . . . . . . . . . . . 7.6 A Featured Example: Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Interval Estimation 8.1 Exact Confidence Intervals . . . . 8.2 Approximate Confidence Intervals 8.3 Sample Size Calculations . . . . . 8.4 Tolerance Intervals (Optional) . . 345 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 304 311 322 327 330 335 345 350 356 361 xi Contents 9 8.5 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 The Bayesian Approach to Estimation 9.1 The Bayesian Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Deriving Bayes Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Exploring the Relative Performance of Bayes and Frequentist Estimators . 9.4 A Theoretical Framework for Comparing Bayes vs. Frequentist Estimators 9.5 Bayesian Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 10 Hypothesis Testing 10.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Standard Tests for Means and Proportions . . . . . . . . . . . . 10.3 Sample Size Requirements for Achieving Pre-specified Power . 10.4 Optimal Tests: The Neyman-Pearson Lemma . . . . . . . . . . 10.5 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . 10.6 Testing the Goodness of Fit of a Probability Model . . . . . . . 10.7 Fatherly Advice about the Perils of Hypothesis Testing (Optional) 10.8 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . 371 378 387 389 403 408 415 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Estimation and Testing for Linear Models 11.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 11.2 Some Distribution Theory for Simple Linear Regression . . . . . . . 11.3 Theoretical Properties of Estimators and Tests under the SLR Model 11.4 One-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . 11.5 The Likelihood Ratio Test in One-Way ANOVA . . . . . . . . . . . 11.6 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Nonparametric Statistical Methods 12.1 Nonparametric Estimation . . . 12.2 The Nonparametric Bootstrap . . 12.3 The Sign Test . . . . . . . . . . 12.4 The Runs Test . . . . . . . . . . 12.5 The Rank Sum Test . . . . . . . 12.6 Chapter Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 415 428 434 437 445 448 454 459 471 471 481 484 493 499 501 509 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 518 524 527 532 541 Appendix: Tables 549 Bibliography 593 Index 597 This page intentionally left blank Preface for Students Let me begin with a sincere welcome. This book was written with you in mind! As you probably know, new textbook projects are generally reviewed pretty carefully. The reviewers either tell the publisher to tank the project or they give the author their best advice about possible improvements. I'm indebted to the reviewers of this book for providing me with (a) much constructive criticism and (b) a good deal of encouragement. I especially appreciated one particular encouraging word. Early on, a reviewer commented about the style of the book, saying that he liked its conversational tone. He/she said that it read as if I was just talking to some students sitting around my desk during office hours. I liked this comment because it sort of validated what I had set out to do. Reading a book that uses mathematical tools and reasoning doesn't have to be a painful experience. It can be, instead, stimulating and enjoyable; discovering a new insight or a deeper understanding of something can be immensely satisfying. Of course, it will take some work on your part. But you know that already. Just like acquiring anything of value, learning about the mathematical foundations of probability and statistics will require the usual ingredients needed for success: commitment, practice, and persistence. Talent doesn't hurt either, but you wouldn't be where you are today if you didn't have that. If you concentrate on the first three attributes, things should fall into place for you. In this brief preface, my aim is to give you some advice about how to approach this textbook and a course in which it is used. First, I'd recommend that you review your old calculus book. It's not that calculus permeates every topic taken up in this book, but the tools of differential and integral calculus are directly relevant to many of the ideas and methods we will study: differentiating a moment-generating function, integrating a density function, minimizing a variance, maximizing a likelihood. But calculus is not even mentioned until the last couple of sections of Chapter 2, so you have time for a leisurely yet careful review. That review is an investment you won't regret. Most students will take a traditional-style course in this subject, that is, you will attend a series of lectures on the subject, will have the benefit of some direct interaction with the instructor and with graduate teaching assistants, and will work on assigned problem sets or on problems just for practice. While there is no unique strategy that guarantees success in this course, my prescription for success would certainly include the following: (1) Read ahead so that you place yourself in the position of knowing what you don't understand yet when you attend a lecture on a given topic. If you do, you'll be in a good position of focus on the particular elements of the day's topic that you need more information on, and you'll be prepared to ask questions that should clarify whatever seems fuzzy upon first reading. (2) Work as many problems as you have time for. "Practice makes perfect" is more than a worn out platitude. It's the truth! That's what distinguishes the platitudes that stick around from those that disappear. (3) Try to do problems by yourself first. The skill you are hoping to develop has to do with using the ideas and tools you are learning to solve new problems. You learn something from attempts that didn't work. But you learn the most from attempts that do work. Too much discussion or collaboration (where the answers are revealed to xiii xiv Stochastic Modeling and Mathematical Statistics you before you've given a problem your best effort) can interfere with your learning. (4) While it's true that mastering a new skill generally involves some suffering, you should not hesitate to seek help after giving a problem or a topic your honest effort. Receiving helpful hints from an instructor, TA, or tutor is generally more beneficial than just having the solution explained to you. It's also better than total frustration. So put in a decent effort and, if a problem seems resistant to being solved, go have a chat with your instructor or his/her surrogate. (5) Mathematical ideas and tools do not lend themselves to quick digestion. So give yourself some time to absorb the material in this course. Spreading out a homework assignment over several days, and studying for a test well before the eve of the exam, are both time-honored study habits that do help. To help make learning this subject less painful for you, I've included many reader-friendly explanations, hints, tutorials, discussion, and occasional revelations of the "tricks of the trade" in the text. I'd like to give you a few tips about how to "attack" this book. First, the book has lots of "problems" to be solved. I've placed exercises at the end of every section. I encourage you to work all of them after you've read a section, as they represent an immediate opportunity to test your understanding of the material. For your convenience, and because I am, regardless of what you may have heard, a compassionate person who wants to be helpful, I've included, in an Appendix, the answers (or helpful comments) for all the exercises. So you can check your answer to confirm whether or not you've nailed the exercise. Some of these exercises may be assigned for homework by your instructor. It's OK, while you are first learning the subject, to be working toward a particular answer, although you will usually get the most benefit from looking up the answer only after you've solved, or at least seriously attempted to solve, the exercise. I should add that not all the exercises are simple applications of the textual material. Some exercises address rather subtle notions within a given topic. If you are able to do all the exercises, you can be confident that you've understood the material at a reasonably high level. Now, let me give you a heads up about the sections in the book that are the most challenging. These sections will require special concentration on your part and may benefit from some collateral reading. I recommend that you spend more time than average reading and digesting Section 1.8 on combinatorics, Section 2.8 on moment-generating functions, Section 3.6 on "other" continuous distributions (since you may need to study and learn about these on your own), Section 4.6 on transformation theory, Sections 5.3 and 5.4 on the Central Limit Theorem and on the delta method, Section 6.3 on Fisher information and the Cramér-Rao inequality, Section 6.4 of sufficiency, completeness, and minimum variance unbiased estimators, Section 10.4 on optimality in hypothesis testing, Section 11.3 on properties of estimators in regression, Section 12.1 on nonparametric estimation, and Section 12.2 on the nonparametric bootstrap. I mention these sections specifically because they will require your careful attention and, perhaps, more "practice" than usual before you feel you have a good grasp of that material. You may wish to read more, and work additional problems, on these topics. Supporting material can be found in books that I've marked with an asterisk (*) in the bibliography. I wish you the best as you begin this exploration into stochastic modeling and mathematical statistics. I'll be very interested in hearing about your experience and also in having your feedback on the book. Feel free to contact me with your comments. Francisco J. Samaniego University of California, Davis fjsamaniego@ucdavis.edu Preface for Instructors There are quite a few textbooks that treat probability and mathematical statistics at the advanced undergraduate level. The textbooks used in courses on these topics tend to fall into one of two categories. Some of these texts cover the subject matter with the mathematical rigor that a graduate school–bound mathematics or statistics major should see, while the remaining texts cover the same topics with much less emphasis on mathematical developments and with more attention to applications of the models and statistical ideas they present. But isn't it desirable for students in a "theoretical" course to be exposed to serious statistical applications and for students in an "applications-oriented" course to be exposed to at least some of the mathematics that justifies the application of statistical modeling and inference in practice? This book offers instructors the flexibility to control the mathematical level of the course they teach by determining the mathematical content they choose to cover. It contains the mathematical detail that is expected in a course for "majors," but it is written in a way that facilitates its use in teaching a course that emphasizes the intuitive content in statistical theory and the way theoretical results are used in practice. This book is based on notes that I have used to teach both types of courses over the years. From this experience, I've reached the following conclusions: (1) the core material for both courses is essentially the same, (2) the ideas and methods used in mathematical proofs of propositions of interest and importance in the field are useful to both audiences, being essential for the first and being helpful to the second, (3) both audiences need to understand what the main theorems of the field say, and they especially need to know how these theorems are applied in practice, (4) it is possible, and even healthy, to have theory and application intertwined in one text. An appealing byproduct of this comingling of mathematical and applied thinking is that through assigned, recommended, or even optional reading of sections of the text not formally covered, an instructor can effectively facilitate the desired "exposure" of students to additional theoretical and applied aspects of the subject. Having often been disappointed with the quantity and range of the problems offered in textbooks I've used in the past, I embarked on the writing of this book with the goal of including tons of good problems from which instructors could choose. That is not to say that you won't have the inclination to add problems of your own in the course that you teach. What I'm really saying is that you may not have to work as hard as usual in supplementing this book with additional problems. Every section ends with a small collection of "exercises" meant to enable the student to test his/her own understanding of a section immediately after reading it. Answers to (or helpful comments on) all the exercises are given at the end of the book. A sizable collection of problems is gathered at the end of each chapter. For instructors who adopt this book as a text for a course, a Solutions Manual containing detailed solutions to all the even-numbered problems in the text is available from Chapman and Hall. This book is intended as a text for a first course in probability and statistics. Some xv xvi Stochastic Modeling and Mathematical Statistics students will have had a previous "pre-calculus" introduction to statistics, and while that can be helpful in various ways, it is by no means assumed in this text. Every new idea in the text is treated from scratch. What I expect is that a course from this book would be a student's first calculus-based statistics course and their first course emphasizing WHY (rather than HOW) probability and statistics work. The mathematical prerequisite for this text is a course on differential and integral calculus. While the stronger the mathematical background of the student, the better, students who have taken a calculus sequence for majors in the sciences (i.e., non-math majors) will do just fine in the course. Since it's not uncommon for a student's calculus skills to get rusty, an early review of one's old calculus text is recommended. Occasional calculus tutorials in the text (e.g., on integration by parts, on changing variables of integration, and on setting limits of double integrals) are aimed at assisting students in their ongoing review. In the Statistics Department at the University of California, Davis, separate courses are offered on probability and mathematical statistics at the upper division level. The yearlong, more mathematical course is intended for Statistics and Mathematics majors, but is also taken by a fair number of students in computer science and engineering and by other "non-majors" with strong mathematical backgrounds and interests. The alternative course is a two-quarter sequence (known as the "brief course") which is taken by applied statistics and applied mathematics majors, by students working on a minor in statistics, by graduate students in quantitative disciplines ranging from engineering to genetics to quantitative social science, and by a few ambitious undergraduates. The first group is, typically, already familiar with mathematical argumentation, and although the second group is capable of digesting a logical mathematical argument, they will need some careful guidance and encouragement before they get comfortable. If your course is mostly taken by stat and math majors, it can be thought of as the first course above. If such students are in the minority, your course may be thought of as the second. Both groups will get a solid grounding in the core ideas in probability and statistics. If this book is used in the course for majors, then most of the theorems treated in the book can be proven in the classroom or assigned as homework when not given in the text. The notions of combinatorial proofs and mathematical induction, which would typically be skipped in the brief course, can be treated and applied as in the text. When they are skipped, it is useful to state certain results proven by these methods that arise later in the text. In the first course, the instructor may wish to include problems involving proofs in both homework and exams. In the second course, I generally assign some "doable" proofs for homework, but exams don't ask for proofs. In both courses, I like to give open-book, problem-solving exams. After all, life itself is an open-book problem-solving exam. The present book retains the main topics, tools, and rigor of traditional math-stat books, and is thus suitable for a course for majors. But the book also contains careful intuitive explanations of theoretical results that are intended to provide students with the ability to apply these results with confidence, even when they have not studied or fully digested their proofs. I have found that this latter goal, while ambitious, is achievable in the classroom. This text is aimed at replicating successful classroom strategies in a text having the academic goals described above. Several sections of the book are labeled as "optional." I believe that the entire book is appropriate for the audience in the first course mentioned above, the course for "majors," and a two-semester or three-quarter course can very comfortably accommodate all twelve chapters of the text and still leave room for additional topics favored by the instructor. For Preface for Instructors xvii the course aimed at non-majors, I include below a chapter-by-chapter discussion of how the text might be used. I have always contended that our main mission as teachers of a mathematical topic is (1) to make sure that the topic (the idea, the method, or the theorem statement) is clear and well understood and (2) to make sure that students understand why, when, and how the result may be applied. These goals can often be accomplished without a formal proof, although a proof does have a nice way of convincing a reader that a proposition in unquestionably true. What are this text's "special features"? Here is a "top-10 list," in the order in which various topics arise. Some of the topics mentioned would be needed in a course for majors but can be trimmed or skipped in a brief course. The text (1) emphasizes "probability models" rather than "probability theory" per se, (2) presents the "key" stochastic tools: a careful treatment of moment-generating functions, bivariate models, conditioning, transformation theory, computer simulation of random outcomes, and the limit theorems that statisticians need (various modes of convergence, the central limit theorem, the delta method), (3) presents a full treatment of optimality theory for unbiased estimators, (4) presents the asymptotic theory for method of moments estimators (with proof) and for maximum likelihood estimators (without proof, but with numerous examples and a formal treatment of the Newton-Raphson and EM algorithms), (5) devotes a full chapter to the Bayesian approach to estimation, including a section on comparative statistical inference, (6) provides a careful treatment of the theory and applications of hypothesis testing, including the NeymanPearson Lemma and Likelihood Ratio Tests, (7) covers the special features of regression analysis and analysis of variance which utilize the theory developed in the core chapters, (8) devotes a separate chapter to nonparametric estimation and testing which includes an introduction to the bootstrap, (9) features serious scientific applications of the theory presented (including, for example, problems from fields such as conservation, engineering reliability, epidemiology, genetics, medicine, and wild life biology), and (10) includes well over 1000 exercises and problems at varying levels of difficulty and with a broad range of topical focus. When used in ways that soften the mathematical level of the text (as I have done in teaching the brief course some 20 times in my career), it provides students in the quantitative sciences with a useful overview of the mathematical ideas and developments that justify the use of many applied statistical techniques. What advice do I have for instructors who use this book? My answer for instructors teaching the first course described above, the course for "majors," is fairly straightforward. I believe that the text could be used pretty much as is. If an instructor wishes to enhance the mathematical level of the course to include topics like characteristic functions, a broader array of limit theorems, and statistical topics like robustness, such topics could be logically introduced in the context of Chapters 2, 5, 6, and 7. It seems likely that a year-long course will allow time for such augmentations. Regarding the augmentation of topics covered in the text, I believe that the most obvious and beneficial addition would be a broader discussion of linear model theory. The goal of Chapter 11 is to illustrate certain ideas and methods arising in earlier chapters (such as best linear unbiased estimators and likelihood ratio tests), and this goal is accomplished within the framework of simple linear regression and one-way analysis of variance. An expansion of my treatment of nonparametric testing in Chapter 12 is another reasonable possibility. My own choices for additional topics would be the Wilcoxon signed-rank test and the Kolmogorov-Smirnov tests for goodness of fit. Another topic that is often touched on in a course for majors is "decision theory." This is briefly introduced in Chapter 9 in the context of Bayesian inference, but the topic could xviii Stochastic Modeling and Mathematical Statistics easily have constituted a chapter of its own, and a broader treatment of decision theory could reasonably be added to a course based on the present text. My advice to instructors who use this text in teaching something resembling the brief course described above is necessarily more detailed. It largely consists of comments regarding what material might be trimmed or skipped without sacrificing the overall aims of the course. My advice takes the form of a discussion, chapter by chapter, of what I suggest as essential, optional, or somewhere in between. Chapter 1. Cover most of this chapter, both because it is foundational, but also because the mathematical proofs to which the students are introduced here are relatively easy to grasp. They provide a good training ground for learning how proofs are constructed. The first five theorems in Section 1.3 are particularly suitable for this purpose. Section 1.4 is essential. Encourage students to draw probability trees whenever feasible. Teach Bayes' Theorem as a simple application of the notion of conditional probability. The independence section is straightforward. Students find the subject of combinatorics the most difficult of the chapter. I recommend doing the poker examples, as students like thinking through these problems. I recommend skipping the final two topics of Section 1.8 . The material on multinomial coefficients and combinatorial proofs can be recommended as optional reading. Chapter 2. I recommend teaching all of Chapter 2, with the exception of Section 2.4 (on mathematical induction) even though Professor Beckenbach's joke (Theorem 2.4.3) offers some welcome comic relief for those who read the section. If you skip Section 2.4, I suggest you state Theorem 2.4.1, Theorem 2.4.2 and the result in Exercise 2.4.2 without proof, as these three facts are used later. Students find Section 2.8 on moment-generating functions to be the most challenging section in this chapter. It's true that mgfs have no inherent meaning or interpretation. The long section on them is necessary, I think, to get students to appreciate mgfs on the basis of the host of applications in which they can serve as useful tools. Chapter 3. The first two sections are fundamental. Section 3.3 can be treated very lightly, perhaps with just a definition and an example. In Section 3.4, the material on the Poisson process and the gamma distribution (following Theorem 3.4.4) may be omitted without great loss. It is an interesting connection which provides, as a byproduct, that the distribution function of a gamma model whose shape parameter α is an integer may be computed in closed form. But the time it takes to establish this may be better spent on other matters. (Some assigned reading here might be appropriate.) Section 3.5 is essential. Section 3.6 can be left as required reading. The models in this latter section will occur in subsequent examples, exercises, and problems (and probably also exams), so students would be well advised to look at these models carefully and make note of their basic properties. Chapter 4. This chapter contains a treasure trove of results that statisticians need to know, and know well. I try to teach the entire chapter, with the exception of Section 4.4 on the multinomial model, which is labeled as optional. Included are definitions and examples of bivariate (joint) densities (or pmfs), marginal and Preface for Instructors xix conditional densities, expectations in a bivariate and multivariate setting, covariance, correlation, the mean and variance of a linear combination of random variables, results on iterated expectations (results I refer to as Adam's rule and Eve's rule), the ubiquitous bivariate normal model, and a comprehensive treatment of methods for obtaining the distribution of a transformed random variable Y = g(X) when the distribution of X is known. Section 4.7 can be covered briefly. Students need to see the basic formulae here, as many examples in the sequel employ order statistics. Chapter 5. A light treatment is possible. In such a treatment, I recommend establishing Chebyshev's inequality and the weak law of large numbers, the definition of convergence in distribution, a statement of the Central Limit Theorem, and a full coverage of the delta method theorem. Chapter 6. I recommend treating the first two sections of this chapter in detail. In Section 6.3, I suggest omitting discussion of the Cauchy-Schwarz inequality and stating the Cramér-Rao Theorem without proof. From Section 6.4, I would suggest covering "sufficiency" and the Rao-Blackwell theorem and omitting the rest (or relegating it to assigned or optional reading). Section 6.5 on BLUEs is short and useful. Section 6.6 has important ideas and some lessons worth learning, and it should be covered in detail. Chapter 7. Section 7.1 may be covered lightly. The discussion up to and including Example 7.1.1 is useful. The "big o" and the "small o" notation are used sparingly in the text and can be skipped when they are encountered. Sections 7.2 and 7.3 are the heart of this chapter and should be done in detail. Section 7.4 treats an important problem arising in statistical studies in epidemiology and provides an excellent example of the skillful use of the delta method. I recommend doing this section if time permits. Otherwise, it should be required reading. Section 7.5 should be done in some form, as students need to be familiar with at least one numerical method which can approximate optimum solutions when they can't be obtained analytically. Section 7.6 on the EM algorithm covers a technique that is widely used in applied work. It should be either covered formally or assigned as required reading, Chapter 8. I suggest covering Sections 8.1 and 8.2, as they contain the core ideas. Sample size calculations, covered in Section 8.3, rank highly among the applied statistician's "most frequently asked questions," and is a "must do." Section 8.4 on tolerance intervals is optional and may be skipped. Chapter 9. This chapter presents the Bayesian approach to estimation, pointing out potential gains in estimation efficiency afforded by the approach. The potential risks involved in Bayesian inference are also treated seriously. The basic idea of the approach is covered in Section 9.1 and the mechanics of Bayesian estimation are treated in Section 9.2. Sections 9.3 and 9.4 provide fairly compelling evidence, both empirical and theoretical, that the Bayesian approach can be quite effective, even under seemingly poor prior assumptions. These two sections represent an uncommon entry in courses at this level, a treatment of "comparative inference" where the Bayesian and classical approaches to estimation xx Stochastic Modeling and Mathematical Statistics are compared side by side. Section 5 treats Bayesian interval estimation. If an instructor is primarily interested in acquainting students with the Bayesian approach, a trimmed down coverage of this chapter that would accomplish this would restrict attention to Sections 9.2 and 9.5. Chapter 10. This is the bread and butter chapter on hypothesis testing. Section 10.1 contains the needed concepts and definitions as well as much of the intuition. Section 10.2 treats the standard tests for means and proportions. The section is useful in tying the general framework of the previous section to problems that many students have seen in a previous course. Section 10.3 on sample size calculations for obtaining the desired power at a fixed alternative in an important notion in applied work and should be covered, even if only briefly. Section 9.4 presents the Neyman-Pearson Lemma, with proof. The proof can be skipped and the intuition of the lemma, found in the paragraphs that follow the proof, can be emphasized instead. The examples that complete the section are sufficient to make students comfortable with how the lemma may be used. Sections 10.5 and 10.6 cover important special topics that students need to see. Chapter 11. Sections 11.1, 11.3, and 11.4 carry the main messages of the chapter. Section 11.2, which presents the standard tests and confidence intervals of interest in simple linear regression, can be skipped if time is precious. Chapter 12. Section 12.1 treats nonparametric estimation of an underlying distribution F for either complete or censored data. In a brief presentation of this material, one might present expressions for the two resulting nonparametric MLEs and an example of each. The nonparametric bootstrap is described in Section 9.2. The widespread use of the bootstrap in applied work suggests that this section should be covered. Of the three remaining sections (which are each free standing units), the most important topic is the Wilcoxon Rank-Sum Test treated in Section 12.5. Sections 12.3 and 12.4 may be treated as assigned or recommended reading. Instructors who have comments, questions, or suggestions about the discussion above, or about the text in general, should feel free to contact me. Your feedback would be most welcome. Francisco J. Samaniego University of California, Davis fjsamaniego@ucdavis.edu 1 The Calculus of Probability 1.1 A Bit of Background Most scientific theories evolve from attempts to unify and explain a large collection of individual problems or observations. So it is with the theory of probability, which has been developing at a lively pace over the past 350 years. In virtually every field of inquiry, researchers have encountered the need to understand the nature of variability and to quantify their uncertainty about the processes they study. From fields as varied as astronomy, biology, and economics, individual questions were asked and answered regarding the chances of observing particular kinds of experimental outcomes. Following the axiomatic treatment of probability provided by A. N. Kolmogorov in the 1930s, the theory has blossomed into a separate branch of mathematics, and it is still under vigorous development today. Many of the early problems in mathematical probability dealt with the determination of odds in various games of chance. A rather famous example of this sort is the problem posed by the French nobleman, Antoine Gombaud, Chevalier de Méré, to one of the outstanding mathematicians of his day, Blaise Pascal: which is the more likely outcome, obtaining at least one six in four rolls of a single die or obtaining at least one double six in twenty-four rolls of a pair of dice? The question itself, dating back to the mid-seventeenth century, seems unimportant today, but it mattered greatly to Gombaud, whose successful wagering depended on having a reliable answer. Through the analysis of questions such as this (which, at this point in time, aptly constitutes a rather straightforward problem at the end of this chapter), the tools of probability computation were discovered. The body of knowledge we refer to as the probability calculus covers the general rules and methods we employ in calculating probabilities of interest. I am going to assume that you have no previous knowledge of probability and statistics. If you've had the benefit of an earlier introduction, then you're in the enviable position of being able to draw on that background for some occasionally helpful intuition. In our efficiency-minded society, redundancy has come to be thought of as a four-letter word. This is unfortunate, because its important role as a tool for learning has been simultaneously devalued. If some of the ideas we cover are familiar to you, I invite you to treat it as an opportunity to rethink them, perhaps gaining a deeper understanding of them. I propose that we begin by giving some thought here to a particularly simple game of chance. I want you to think about the problem of betting on the outcome of tossing two newly minted coins. In this particular game, it is clear that only three things can happen: the number of heads obtained will be zero, one, or two. Could I interest you in a wager in which you pay me three dollars if the outcome consists of exactly one head, and I pay you two dollars if either no heads or two heads occur? Before investing too much of your fortune on this gamble, it would behoove you to determine just how likely each of the three outcomes 1 2 Stochastic Modeling and Mathematical Statistics is. If, for example, the three outcomes are equally likely, this bet is a real bargain, since the 2-to-1 odds in favor of your winning will outweigh the 2-to-3 differential in the payoffs. (This follows from one of the "rules" we'll justify later.) If you're not sure about the relative likelihood of the three outcomes, you might pause here to flip two coins thirty times and record the proportion of times you got 0, 1, or 2 heads. That mini-experiment should be enough to convince you that the wager I've proposed is not an altogether altruistic venture. The calculus of probability can be thought of as the methodology which leads to the assignment of numerical "likelihoods" to sets of potential outcomes of an experiment. In our study of probability, we will find it convenient to use the ideas and notation of elementary set theory. The clarity gained by employing these simple tools amply justifies the following short digression. In the mathematical treatment of set theory, the term "set" is undefined. It is assumed that we all have an intuitive appreciation for the word and what it might mean. Informally, we will have in the back of our minds the idea that there are some objects we are interested in, and that we may group these objects together in various ways to form sets. We will use capital letters A, B, C to represent sets and lower case letters a, b, c to represent the objects, or "elements," they may contain. The various ways in which the notion of "containment" arises in set theory is the first issue we face. We have the following possibilities to consider: Definition 1.1.1. (Containment) (i) a ∈ A: a is an element in the set A; a ∈ / A is taken to mean that a is not a member of the set A. (ii) A ⊆ B : The set A is contained in the set B, that is, every element of A is also an element of B. A is then said to be a "subset" of B. (iii) A ⊂ B: A is a "proper subset" of B, that is, A is contained in B (A ⊆ B) but B contains at least one element that is not in A. (iv) A = B : A ⊆ B and B ⊆ A, that is, A and B contain precisely the same elements. In typical applications of set theory, one encounters the need to define the boundaries of the problem of interest. It is customary to specify a "universal set" U, a set which represents the collection of all elements that are relevant to the problem at hand. It is then understood that any set that subsequently comes under discussion is a subset of U. For logical reasons, it is also necessary to acknowledge the existence of the "empty set" ∅, the set with no elements. This set constitutes the logical complement of the notion of "everything" embodied in U, but also plays the important role of zero in the set arithmetic we are about to discuss. We use arithmetic operations like addition and multiplication to create new numbers from numbers we have in hand. Similarly, the arithmetic of sets centers on some natural ways of creating new sets. The set operations we will use in this text appear in the listing below: Definition 1.1.2. (Set Operations) (i) Ac : "the complement of A," that is, the set of elements of U that are not in A. (ii) A ∪ B: "A union B," the set of all elements of U that are in A or in B or in both A and B. (iii) A ∩ B: "A intersect B," the set of all elements of U that are in both A and B. 3 The Calculus of Probability (iv) A − B: "A minus B," the set of all elements of U that are in A but are not in B. We often use pictorial displays called Venn diagrams to get an idea of what a set looks like. The Venn diagrams for the sets in Definition 1.1.2 are shown below: Ac A∪B A∩B A−B Figure 1.1.1. Venn diagrams of Ac , A ∪ B, A ∩ B, and A − B. Venn diagrams are often helpful in determining whether or not two sets are equal. You might think "I know how to check set equality; just see if the sets contain the same elements. So what's the big deal?" This reaction is entirely valid in a particular application where the elements of the two sets can actually be listed and compared. On the other hand, we often need to know whether two different methods of creating a new set amount to the same thing. To derive a general truth that holds in all potential applications, we need tools geared toward handling abstract representations of the sets involved. Consider, for example, the two sets: A ∩ (B ∪C) and (A ∩ B) ∪ (A ∩C). The equality of these two sets (a fact which is called the distributive property of intersection with respect to union) can be easily gleaned from a comparison of the Venn diagrams of each. Figure 1.1.2. The Venn Diagram of either A ∩ (B ∪C) or (A ∩ B) ∪ (A ∩C). Often, general properties of set operations can be discerned from Venn diagrams. 4 Stochastic Modeling and Mathematical Statistics Among such properties, certain "decompositions" are especially important in our later work. First, it is clear from the Venn diagram of A ∪ B that the set can be decomposed into three non-overlapping sets. Specifically, we may write A ∪ B = (A − B) ∪ (A ∩ B) ∪ (B − A). (1.1) Another important decomposition provides an alternative representation of a given set A. For any two sets A and B, we may always represent A in terms of what it has (or hasn't) in common with B: A = (A ∩ B) ∪ (A ∩ Bc ). (1.2) In a Venn diagram showing two overlapping sets A and B, combining the sets (A ∩ B) and (A ∩ Bc ) clearly accounts for all of the set A, confirming the validity of the identity of (1.2). The identity also holds, of course, if A and B do not overlap. Why? There is something special about the sets involved in the unions which appear in (1.1) and (1.2). Sets D and E are said to be "disjoint" if D ∩ E = ∅, that is, if they have no elements in common. Both of the identities above represent a given set as the union of disjoint sets. As we will see, such representations are often useful in calculating the probability of a given set of possible experimental outcomes. As helpful as Venn diagrams are in visualizing certain sets, or in making the equality of two sets believable, they lack something that mathematicians consider to be of paramount importance: rigor. A rigorous mathematical argument consists of a logical sequence of steps which lead you from a given statement or starting point to the desired conclusion. Each step in the argument must be fully defensible, being based solely on basic definitions or axioms and their known consequences. As an example of a "set-theoretic proof," we will provide a rigorous argument establishing a very important identity known as De Morgan's Law. Theorem 1.1.1. (De Morgan's Law for two sets) For two arbitrary sets A and B, (A ∪ B)c = Ac ∩ Bc . (1.3) Proof. The two sets in (1.3) can be shown to be equal by showing that each contains the other. Let's first show that (A ∪ B)c ⊆ Ac ∩ Bc . Assume that x is an arbitrary element of the set (A ∪ B)c that is, assume that x ∈ (A ∪ B)c . This implies that x∈ / (A ∪ B), which implies that x∈ / A and x ∈ / B, which implies that x ∈ Ac and x ∈ Bc , which implies that x ∈ Ac ∩ Bc . The same sequence of steps, in reverse, proves that Ac ∩ Bc ⊆ (A ∪ B)c . The most efficient proof of (1.3) results from replacing the phrase "implies that" with the phrase "is equivalent to." This takes care of both directions at once. 5 The Calculus of Probability It is easy to sketch the Venn diagrams for the two sets in De Morgan's Law. You might decide that the two ways of convincing yourself of the law's validity (namely, the proof and the picture) are of roughly the same complexity, and that the visual method is preferable because of its concreteness. You should keep in mind, however, that convincing yourself that something is true isn't the same as formulating an air-tight argument that it's true. But there's an additional virtue in developing a facility with the kind of mathematical reasoning employed in the proof above. The bigger the problem gets, the harder it is to use visual aids and other informal tools. A case in point is the general version of De Morgan's Law, stating that, for n arbitrary sets A1 , A2 , . . . , An , the compliment of the union of the sets is equal to the intersection of their compliments. Theorem 1.1.2. (General De Morgan's Law) For n arbitrary sets A1 , A2 , . . . , An , !c n [ Ai = i=1 n \ (Aci ). (1.4) i=1 The general statement in (1.4) cannot be investigated via Venn diagrams, but it is easy to prove it rigorously using the same logic that we used in proving De Morgan's Law for two sets. Give it a try and you'll see! Exercises 1.1. 1. Suppose U = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and the sets A, B, and C are given by A = {2, 4, 6, 8, 10}, B = {2, 5, 6, 7, 10}, and C = {1, 6, 9}. Identify each of the following sets: (a) A ∪ B, (b) A ∩ B, (c) A − B (d) A ∪ Bc , (e) A ∩ B ∩C, ( f ) B ∩ (A ∪C)c (g) (A ∩C) ∪ (B ∩C), (h) (A −C) ∪ (C − A), (i) Ac ∩ B ∩Cc . 2. Let A, B, and C be three overlapping sets. Draw the Venn diagram for A ∪ B ∪C, and note that this union can alternatively be viewed as the union of seven disjoint sets. Using the set operations defined in Section 1.1, give each of these seven sets a name. (And I don't mean Jane, Harry, Gustavo, . . . ) 3. Construct two Venn diagrams which show that the set (A ∩ B) − C is equal to the set (A −C) ∩ (B −C). 4. For arbitrary sets A and B, give a set theoretic proof that A ∩ Bc = A − B. 5. For arbitrary sets A and B, prove that A ∪ B = A ∪ (B − A). 6. Let A1 , . . . , An be a "partition" of the universal S set U, i.e., suppose that Ai ∩ A j = ∅ for i 6= j (or, simply, the As are disjoint sets), and that ni=1 Ai = U. Let B be an arbitrary subset of U. Prove that B= n [ (B ∩ Ai ). (1.5) i=1 (Note that the identity in (1.5) is still true if the As are not disjoint, as long as the condition Sn i=1 Ai = U holds.) 6 1.2 Stochastic Modeling and Mathematical Statistics Approaches to Modeling Randomness Randomness is a slippery idea. Even the randomness involved in a simple phenomenon like flipping a coin is difficult to pin down. We would like to think of a typical coin toss as yielding one of two equally likely outcomes; heads or tails. We must recognize, however, that if we were able to toss a coin repeatedly under precisely the same physical conditions, we would get the same result each time. (Compare it, for example, to physical processes that you can come close to mastering, like that of throwing a crumpled piece of paper into a waste basket two feet away from you.) So why do we get about as many heads as tails when we toss a coin repeatedly? It's because the physical processes involved are well beyond our control. Indeed, they are even beyond our ability to understand fully. Consecutive coin tosses are generally different enough from each other that it is reasonable to assume that they are unrelated. Moreover, the physical factors (like force, point of contact, direction of toss, and atmospheric conditions, as when someone right in front of you is breathing heavily) are sufficiently complex and uncontrollable that it is reasonable to assume that the outcome in a given toss is unpredictable, with heads neither more nor less likely than tails. Thus, in modeling the outcomes of successive coin tosses, a full understanding of the mechanics of coin tossing is generally impossible, but also quite unnecessary. Our models should reflect, of course, the information we happen to have about the phenomenon we are observing. If we were able, for example, to ensure that our coin tosses were so precise that the coin made exactly eight full revolutions in the air at least 70 percent of the time, then we would wish our model for the outcomes of such tosses to reflect the fact that there was at least a 70 percent chance that the initial and final states of the coin agreed. What is important about the models we will employ to describe what might happen in particular experiments is that they constitute reasonable approximations of reality. Under normal circumstances, we'll find that the coin tossing we do is quite well described by the assumption that heads and tails are equally likely. Our criterion for the validity of a model will be the level of closeness achieved between real experimental outcomes and the array of outcomes that our model would have predicted. Our discussion of randomness will require a bit of jargon. We collect some key phrases in the following: Definition 1.2.1. (Random Experiments) A random experiment is an experiment whose outcome cannot be predicted with certainty. All other experiments are said to be deterministic. The sample space of a random experiment is the set of all possible outcomes. The sample space plays the role of the universal set in problems involving the corresponding experiment, and it will be denoted by S. A single outcome of the experiment, that is, a single element of S, is called a simple event. A compound event is simply a subset of S. While simple events can be viewed as compound events of size one, we will typically reserve the phrase "compound event" for subsets of S with more than one element. Developing a precise description of the sample space of a random experiment is always the first step in formulating a probability model for that experiment. In this chapter and the next, we will deal exclusively with "discrete problems," that is, with problems in which the sample space is finite or, at most, countably infinite. (A countably infinite set is one that is infinite, but can be put into one-to-one correspondence with the set of positive integers. For example, the set {2n : n = 1, 2, 3, . . . } is countably infinite.) We postpone our discussion of 7 The Calculus of Probability "continuous" random experiments until Chapter 3. We turn now to an examination of four random experiments from which, in spite of their simplicity, much can be learned. Example 1.2.1. Suppose that you toss a single coin once. Since you catch it in the palm of one hand, and quickly turn it over onto the back of your other hand, we can discount the possibility of the coin landing and remaining on its edge. The sample space thus consists of the two events "heads" and "tails" and may be represented as S1 = {H, T }. Example 1.2.2. Suppose that you toss a penny and then a nickel, using the same routine as in Example 1.2.1. The sample space here is S2 = {HH, HT, T H, T T }. You will recall that in the preceding section, I mentioned that only three things could happen in this experiment, namely, you could get either 0, 1, or 2 heads. While this is true, it is important to recognize that this summary does not correspond to the most elementary description of the experiment. In experiments like this, which involve more than one stage, it is often helpful to draw a "tree" picturing the ways in which the experiment might play out. Here, the appropriate tree is shown below. Figure 1.2.1. A tree displaying the sample space for tossing two coins. Example 1.2.3. A single die is rolled. The sample space describing the potential values of the number facing upwards is S3 = {1, 2, 3, 4, 5, 6}. 8 Stochastic Modeling and Mathematical Statistics Example 1.2.4. A pair of dice, one red, one green, are rolled. The sample space for this experiment may be represented as a tree with six branches at the first stage and six branches at the second stage, so that the two-stage experiment is represented by a tree with 36 paths or possible outcomes. The rectangular array below is an equivalent representation in which the first digit represents the outcome of the red die and the second digit corresponds to the outcome for the green die. The sample space S4 is shown below.  1, 1     2, 1    3, 1 S4 = 4, 1     5, 1    6, 1 1, 2 2, 2 3, 2 4, 2 5, 2 6, 2 1, 3 2, 3 3, 3 4, 3 5, 3 6, 3 1, 4 2, 4 3, 4 4, 4 5, 4 6, 4 1, 5 2, 5 3, 5 4, 5 5, 5 6, 5  1, 6   2, 6    3, 6 4, 6   5, 6    6, 6 Figure 1.2.2. The sample space S for rolling two dice. We now turn our attention to the problem of assigning "probabilities" to the elements of S. We will use the word "stochastic" as the adjective form of the noun "probability." A stochastic model is a model for a random experiment. It provides the basis for attaching numerical probabilities or likelihoods to sets of outcomes in which we might be interested. There are various schools of thought concerning the construction of stochastic models. Here, we will follow a route that is generally called the "classical" or the "frequentist" approach. Prominent among alternatives is the "subjective" school of probability, an approach that we will explore a bit in Section 1.5 and take up more seriously in Chapter 9. The classical approach to probability is based on the notion that the likelihood of an event should reveal itself in many repetitions of the underlying experiment. Thus, if we were able to repeat a random experiment indefinitely, the relative frequency of occurrence of the event A would converge to a number, that number being the true probability of A. It is helpful to begin with simpler, more concrete ideas. After all, that is how the field itself began. When the sample space is finite, as it is in the examples considered above, the specification of a stochastic model consists of assigning a number, or probability, to each outcome, that number representing, in the modeler's mind, the chance that this outcome will occur in a given trial of the experiment. Let's reexamine the four examples with which we began. If you had to specify stochastic models for these experiments, what would you do? It's not uncommon in these circumstances to encounter an irresistible urge to assign equal probabilities to all the simple events in each experiment. Pierre Simon Laplace (1749–1827), a brilliant mathematician who was eulogized as the Isaac Newton of France, had enough respect for this urge to have elevated it to the lofty status of a principle. Roughly speaking, his Principle of Insufficient Reason says that if you don't have a good reason to do otherwise, a uniform assignment of probabilities to the sample space is appropriate. In simple examples such as the ones before us, the principle does indeed seem compelling. Thus, in situations in which we are reasonably sure we are dealing with "fair" coins and "balanced" dice, we would consider the uniform stochastic model to be appropriate. More generally, we will call this first approach to the assignment of probabilities to simple events 9 The Calculus of Probability the "intuitive" approach, since it generally relies on intuitive judgments about symmetry or similarity. Suppose we assign each of the two simple events in S1 the probability 1/2, recognizing that, under normal circumstances, we believe that they will occur with equal likelihood. Similarly, suppose each simple event in S2 is assigned probability 1/4, each simple event in S3 is assigned probability 1/6, and each simple event in S4 is assigned probability 1/36. We then have, at least implicitly, a basis for thinking about the probability of any compound event that may interest us. To make the transition to this more complex problem, we need the link provided by the natural and intuitive rule below. Computation Rule. For any compound event A in the finite sample space S, the probability of A is given by P(A) = ∑ P(a), (1.6) a∈A where ∑ is the standard symbol for addition. The number P(A) should be viewed as an indicator of the chances that the event A will occur in a single trial of the experiment. The intuition behind this computation rule is fairly basic. If the two simple events a1 and a2 have probabilities p1 and p2 respectively, then, in many trials, the proportion of trials in which a1 occurs should be very close to p1 , and the proportion of trials in which a2 occurs should be very close to p2 . It follows that the proportion of trials in which the outcome is either a1 or a2 should be close to p1 + p2 . Since the probability we assign to each simple event represents our best guess at its true probability, we apply to them the same intuition regarding additivity. As we shall see, the assumption that probabilities behave this way is a version of one of the basic axioms upon which the theory of probability is based. Let's apply our computation rule to the experiment of rolling a pair of balanced dice. One of the outcomes of special interest in the game of Craps (in which two dice, assumed to be balanced, are rolled in each play) is the event "the sum of the digits (or dots) facing up is seven." When this happens on a player's first roll, the player is an "instant winner." The computation rule enables us to obtain the probability of this event: P(sum is seven) = P(16) + P(25) + P(34) + P(43) + P(52) + P(61) = 6/36 = 1/6. The probabilities of other compound events can be computed similarly: P(At Least One Digit is a 4) = 11/36 P(First Digit = Second Digit) = 1/6 P(First Digit > Second Digit) = 5/12. The latter computation can be done easily enough by identifying the 15 simple events, among the 36, which satisfy the stated condition. It is useful, however, to get accustomed to looking for shortcuts. Could you have obtained the answer 5/12 just by reflecting on the symmetry of the situation? Another type of shortcut reveals itself when we try to compute the probability of an event like "the sum of the digits is at least five." Again, we could collect the simple events for which this happens, and add their probabilities together. Even in the simple problem we are dealing with here, this seems like too much work. It's appealing to exploit the fact that the complementary event "the sum is less than five" is a lot 10 Stochastic Modeling and Mathematical Statistics smaller. Indeed, there are just six simple events in the latter compound event, leading to the computation P(Sum is at Least 5) = 1 − P(Sum is Less Than 5) = 1 − 1/6 = 5/6. The alternative computation is based on the fact that the probabilities assigned to the 36 sample events in this experiment add up to 1. You will have the opportunity to gather additional experience with the computation rule in the exercises at the end of this section. So far, we have dealt only with experiments in which the assignment of probabilities to simple events can be done intuitively from our assumptions of symmetry, balance, or similarity. We believe that the chances that a card drawn randomly from a standard 52card deck will be a spade is 1/4 simply because we see no reason that it should be more or less likely than drawing a heart. You will notice, however, that the applicability of the computation rule is not limited to uniform cases; it simply says that once the probabilities of simple events are specified, we can obtain probabilities for compound events by appropriate addition. The theory of probability, just like that of physics, economics, or biology, had better give answers that are in general agreement with what we actually observe. If the foundations of these subjects were radically inconsistent with what we see in the world around us, we would discard them and keep looking for a good explanation of our individual and collective experience. Your "intuition" about a random experiment is only trustworthy if it is compatible with your past experience and if it helps you predict the outcomes of future replications of the same kind of experiment. The real basis for modeling a random experiment is past experience. Your intuition represents a potential shortcut, and whenever you use it, you need to entertain, at least momentarily, the idea that your intuition may be off the mark. If you saw someone roll five consecutive sevens in the game of Craps, wouldn't the suspicion that the dice are unbalanced creep into your consciousness? But how do we approach the assignment of probabilities to simple events when we are unable to rely upon our intuition? Since most of the situations in which we wish to model randomness are of this sort, it is essential that we have a trustworthy mechanism for treating them. Fortunately, we do, and we now turn our attention to its consideration. For reasons that will be readily apparent, we refer to this alternative approach to stochastic modeling as the "empirical" approach. Consider now how you might approach stochastic modeling in an experiment in which a thumb tack, instead of a fair coin, was going to be flipped repeatedly. Suppose a "friend" offers you the opportunity to bet on the event that the tack lands with the point facing downwards (D), and proposes that he pay you $3 each time that happens, with you paying him $2 whenever the point faces upwards (U). In order to approach this bet intelligently, you need a stochastic model for the two possible outcomes D and U, down and up. But where does such a model come from? Would you trust your intuition in this instance? Since your intuition about this experiment is probably a bit hazy, and since your financial resources are not unlimited, that seems pretty risky. What is your best course of action? "I'll let you know tomorrow" seems like a prudent response. What you need is some experience in flipping thumb tacks. In a twenty-four hour period, you could gather a great deal of experience. Suppose you find that in one thousand flips, your thumb tack faced downwards only 391 times. What you have actually done is manufactured a stochastic model for flipping a thumb tack. Based on your experience, it is reasonable for you to assign the probabilities P(D) = .391 and P(U) = .609 to the simple events in your experiment. Is this model a The Calculus of Probability 11 good one? It is probably quite good in terms of predicting how many times you would observe the event "D" in a few future flips. It is a model that is undoubtedly better than the one you would come up with based on raw intuition, but it is probably not as reliable as the model you could construct if you'd had the patience to flip a thumb tack ten thousand times. On the basis of your model for this experiment, you should turn down the wager you have been offered. In Exercise 2.2.5, you will be asked to confirm, by making a suitable computation, that this bet is stacked against you. Most instances in which we confront the need for stochastic modeling don't actually involve wagering, at least not explicitly. Probability models are used in predicting the chances that it will rain tomorrow, in determining the appropriate premium for your automobile insurance (a touchy subject, I know), and in predicting how well an undergraduate student might do in law school. These are problems in which intuition is hard to come by, but they are also problems that arise in real life and require your, or at least someone's, occasional attention. How does an insurance company go about determining their costs in offering life insurance to your Aunt Mable? Suppose, for simplicity, that Aunt Mable wants a straightforward policy that will pay her heirs $1,000,000 if she dies before age sixty-five, and pays them nothing otherwise. An insurance company must consider a number of relevant issues, among them the chances that a thirty-two-year-old woman in good health will live to age sixty-five. You might think that, since there is only one Aunt Mable, one can't experiment around with her as easily as one can flip a thumb tack. Indeed, the thought of flipping Aunt Mable even a couple of times is enough to give one pause. But insurance companies do just that; they treat Aunt Mable as if she were a typical member of the population of all thirty-two-year-old women in good health. Under that assumption, they determine, from an appropriate volume of the mortality tables they maintain very diligently, what proportion of that population survives to age sixty-five. In fact, in order to set an annual premium, a company will need to have a stochastic model for Aunt Mable surviving one year, two years, . . . , thirty-three years. The number of insurance companies seeking Aunt Mable's business, and the size of the buildings in which they operate, attest to the success this industry tends to have in constructing stochastic models as needed. Actuarial science, as the mathematical theory of insurance is called, is an area of inquiry that is centuries old. Halley's life tables, published in 1693, estimated the probability that a thirty-two-year-old female would survive to age sixty-five to be 177/403. The art world has been rocked by recent reassessments of the authenticity of certain paintings that have traditionally been attributed to Rembrandt. The "connoisseurship" movement has, as you might imagine, made more than one museum curator sweat. "Authenticity studies" often involve the development of stochastic models in an attempt to determine the chances that a particular body of work could have been done by a given individual. A statistically based comparison of work that may or may not have been performed by the same individual has become a standard tool in such studies. One of the best-known examples involves the use of statistical techniques in settling a long-standing historical debate involving the authorship of the Federalist Papers. This rather famous piece of detective work is described below. In 1787–88, a series of anonymous essays were published and distributed in the state of New York. Their clear purpose was to persuade New Yorkers to ratify the Constitution. In total, eighty-five essays were produced, but serious interest in precisely who wrote which ones arose only after Alexander Hamilton's death in a duel in 1804. By a variety of means, definitive attribution appeared to be possible for seventy of the papers, with Hamil- 12 Stochastic Modeling and Mathematical Statistics ton identified as the author of forty-one, and James Madison being credited with writing fourteen of the others. Of the fifteen papers whose authorship was uncertain, three were determined to have been written by a third party, and each of a group of twelve was variously attributed to Hamilton or to Madison, with scholars of high repute stacking up on both sides in a vigorous game of academic tug of war. Enter Mosteller and Wallace (1964), two statisticians who were convinced that the appropriate classification could be made through a careful analysis of word usage. In essence, it could be determined from writings whose authorship was certain that there were words that Hamilton used a lot more than Madison, and vice versa. For example, Hamilton used the words "on" and "upon" interchangeably, while Madison used "on" almost exclusively. The table below, drawn from the Mosteller and Wallace study, can be viewed as a specification of a stochastic model for the frequency of occurrence of the word "upon" in arbitrary essays written by each of these two authors. The collection of written works examined consisted of forty-eight works known to have been authored by Hamilton, fifty works known to have been authored by Madison, and twelve Federalist Papers whose authorship was in dispute. By itself, this analysis presents a strong argument for classifying at least eleven of the twelve disputed papers as Madisonian. In combination with a similar treatment of other "non-contextual" words in these writings, this approach provided strong evidence that Madison was the author of all twelve of the disputed papers, essentially settling the authorship debate. Rate/1000 Words Exactly 0 (0.0, 0.4) [0.4, 0.8) [0.8, 1.2) [1.2, 1.6) [1.6, 2.0) [2.0, 3.0) [3.0, 4.0) [4.0, 5.0) [5.0, 6.0) [6.0, 7.0) [7.0, 8.0) Totals: Authored by Hamilton 0 0 0 2 3 6 11 11 10 3 1 1 48 Authored by Madison 41 2 4 1 2 0 0 0 0 0 0 0 50 12 Disputed Papers 11 0 0 1 0 0 0 0 0 0 0 0 12 Table 1.2.1. Frequency distribution of the word "upon" in 110 essays. Exercises 1.2. 1. Specify the sample space for the experiment consisting of three consecutive tosses of a fair coin, and specify a stochastic model for this experiment. Using that model, compute the probability that you (a) obtain exactly one head, (b) obtain more heads than tails, (c) obtain the same outcome each time. 2. Suppose a pair of balanced dice is rolled and the number of dots that are facing upwards are noted. Compute the probability that (a) both numbers are odd, (b) the sum is odd, (c) one number is twice as large as the other number, (d) the larger number exceeds the smaller number by 1, and (e) the outcome is in the "field" (a term used in the game of Craps), that is, the sum is among the numbers 5, 6, 7, and 8. The Calculus of Probability 13 3. A standard deck of playing cards consists of 52 cards, with each of the 13 "values" ace, 2, 3, . . . , 10, jack, queen, king appearing in four different suits—spades, hearts, diamonds, and clubs. Suppose you draw a single card from a well-shuffled deck. What is the probability that you (a) draw a spade, (b) draw a face card, that is, a jack, queen, or king, and (c) draw a card that is either a face card or a spade. 4. Refer to the frequency distributions in Table 1.2.1. Suppose you discover a dusty, tattered manuscript in an old house in Williamsburg, Virginia, and note that the house displays two plaques, one asserting that Hamilton slept there and the other asserting that Madison slept there. Your attention immediately turns to the frequency with which the author of this manuscript used the word "upon." Suppose that you find that the rate of usage of the word "upon" in this manuscript was 2.99 times per 1000 words. If Hamilton wrote the manuscript, what probability would you give to the event that the manuscript contains fewer than 3 uses of the word "upon" per 1000 words? If Madison wrote the manuscript, what probability would you give to the event that the manuscript contains more than 2 uses of the word "upon" per 1000 words? What's your guess about who wrote the manuscript? 1.3 The Axioms of Probability Mathematical theories always begin with a set of assumptions. Euclid built his science of plane geometry by assuming, among other things, the "parallel postulate," that is, the assertion that through any point outside line A, there exists exactly one line parallel to line A. The mathematical theory that follows from a set of assumptions will have meaning and utility in real-world situations only if the assumptions made are themselves realistic. That doesn't mean that each axiom system we entertain must describe our immediate surroundings. For example, Einstein's Theory of Relativity is based on a type of non-Euclidean geometry which discards the parallel postulate, thereby allowing for the development of the richer theory required to explain the geometry of time and space. Since the purpose of any mathematical theory is to yield useful insights into real-world applications, we want the axioms used as the theory's starting point to agree with our intuition regarding these applications. Modern probability theory is based on three fundamental assumptions. In this section, we will present these assumptions, argue that they constitute a reasonable foundation for the calculus of probability, and derive a number of their important implications. When you go about the business of assigning probabilities to the simple events in a particular random experiment, you have a considerable amount of freedom. Your stochastic model for flipping a thumb tack is not likely to be the same as someone else's. Your model will be judged to be better than another if it turns out to be more successful in predicting the frequency of the possible outcomes in future trials of the experiment. It still may be true that two different models can justifiably be considered "good enough for government work," as the old saying goes. We will not ask the impossible—that our model describe reality exactly—but rather seek to specify models that are close enough to provide reasonable approximations in future applications. In the words of statistician George E. P. Box, "all models are wrong, but some are useful." While two models may assign quite different probabilities to simple events, they must have at least a few key features in common. Because of the way we interpret probability, 14 Stochastic Modeling and Mathematical Statistics for example, we would not want to assign a negative number to represent the chances that a particular event will occur. We thus would want to place some intuitively reasonable constraints on our probability assignments. The following axiom system is generally accepted as the natural and appropriate foundation for the theory of probability. Let S be the sample space of a random experiment. Then the probabilities assigned to events in S must satisfy Axiom 1. For any event A ⊆ S, P(A) ≥ 0. Axiom 2. P(S) = 1. Axiom 3. For any collection of events A1 , A2 , . . . , An , . . . satisfying the conditions Ai ∩ A j = ∅, for all i 6= j, (1.7) the probability that at least one of the events among the collection {Ai , i = 1, 2, 3, . . . } occurs may be computed as P( ∞ [ i=1 ∞ Ai ) = ∑ P(Ai ). (1.8) i=1 The first two of these axioms should cause no anxiety, since what they say is so simple and intuitive. Because we generally think of probabilities as long-run relative frequencies of occurrence, we would not want them to be negative. Further, we know when we define the sample space of a random experiment that it encompasses everything that can happen. It thus necessarily has probability one. Axiom 3 is the new and somewhat exotic assumption that has been made here. Let's carefully examine what it says. In Section 1.1, we referred to non-overlapping sets as "disjoint." In probability theory, it is customary to use the alternative phrase "mutually exclusive" to describe events which do not overlap. The message of Axiom 3 may be restated as: For any collection of mutually exclusive events in a discrete random experiment, the probability of their union is equal to the sum of their individual probabilities. The assertion made in Axiom 3 is intended to hold for a collection of arbitrary size, including collections that are countably infinite. When Axiom 3 is restricted to apply only to finite collections of mutually exclusive events, it is called "the axiom of finite additivity." Otherwise, we call it "the axiom of countable additivity." The argument over which of the two is more appropriate as part of the axiomatic foundation of probability theory is not fully settled, though the proponents of the more general form of the axiom far outnumber the champions of finite additivity. The latter group argues that our intuition regarding additivity is based on our experience, which can only involve finite collections of events. Thus, we should not assert, as a fundamental truth, a rule for combining infinitely many probabilities. The other side argues that countable additivity does agree with our intuition and experience, and, in fact, enables us to obtain intuitively correct answers to problems that simply can't be treated otherwise. The following example illustrates this point. Example 1.3.1. Suppose you toss a fair coin until you get a head. Let X represent the number of tosses it takes you. What probability would you attach to the event that X will be an odd number? It is clear that P(X = 1) = 1/2, since the very first toss will yield a head with probability 1/2. If a fair coin is tossed twice, the probability that you get the outcome TH, that is, a tail followed by a head, is 1/4. In your sequence of tosses, this is the one and 15 The Calculus of Probability only way in which the event {X = 2} can happen. It follows that P(X = 2) = 1/4. This same logic extends to arbitrary values of X, yielding, for n = 1, 2, 3, 4, . . . , the result that P(X = n) = 1/2n . (1.9) With the axiom of countable additivity, we can represent the desired probability as P(X is odd) = P(X = 1) + P(X = 3) + P(X = 5) + . . . = 1/2 + 1/8 + 1/32 + . . . . (1.10) If you are acquainted with geometric series, you might recognize the infinite sum in (1.10) as something you can evaluate using an old familiar formula. If not, you needn't worry, since the appropriate formula, which will turn out to be useful to us in a number of different problems, will be introduced from scratch in Section 2.6. For now, we will take a handy shortcut in evaluating this sum. Since the sample space in this experiment can be thought of as the set of positive integers {1, 2, 3, . . . }, the sum of all their probabilities is 1. Think of each of the terms you wish to add together as being paired with the probability of the even number that follows it. Thus, 1/2 goes with 1/4, 1/8 goes with 1/16, etc. For each term that is included in your series, a term with half its value is left out. If we denote the sum of the series in (1.10) by p, you can reason that, since the sum of all the terms P(X = k) for k ≥ 1 is 1, you must have p + (1/2)p = 1. From this, you may conclude that p = P(X is odd) = 2/3. What does your own intuition have to say about the probability we have just calculated? Perhaps not very much. The two-to-one ratio between the probabilities of consecutive integer outcomes for X is itself an intuitively accessible idea, but it is not an idea that you would be expected to come up with yourself at this early stage in your study of the subject. What should you do when your intuition needs a jump-start? You proceed empirically. You will find that it's not that difficult to convince yourself of the validity of the answer obtained above. Take a coin out of your pocket and perform this experiment a few times. You will see that the event {X is odd} does indeed occur more often that the event {X is even}. If you repeat the experiment enough, you will end up believing that the fraction 2/3 provides an excellent forecast for the relative frequency of occurrence of an odd value of X. This, and many examples like it, give us confidence in assuming and using the axiom of countable additivity. Let us now suppose that we have specified a stochastic model for the sample space S of a random experiment, and that the probabilities assigned by our model obey the three axioms above. From this rather modest beginning, we can derive other rules which apply to and facilitate the calculation of probabilities. Some of these "derived" results will seem so obvious that you will want to react to them with the question "What's to prove here?" This reaction is especially understandable when the claim we wish to prove seems to be nothing more than an alternative way to state the contents of one or more axioms. Here is the attitude I suggest that you adopt in our initial mathematical developments. Erase everything from your mind except the three axioms. Suppose that this constitutes the totality of your knowledge. You now want to identify some of the logical implications of what you know. 16 Stochastic Modeling and Mathematical Statistics If some of the early results we discuss seem obvious or trivial, keep in mind that one must learn to walk before one learns to run. Each of the conclusions discussed below will be stated as a theorem to be proven. Our first result is an idea that we used, on intuitive grounds, in some of our probability computations in the last section. We now consider its formal justification. Theorem 1.3.1. For any event A ⊆ S, the probability of Ac , the complement of A, may be calculated as P(Ac ) = 1 − P(A). (1.11) Proof. Note that the sample space S may be represented as the union of two mutually exclusive events. Specifically, S = A ∪ Ac , where A ∩ Ac = ∅. It follows from Axioms 2 and 3 that 1 = P(S) = P(A ∪ Ac ) = P(A) + P(Ac ), an equation that immediately implies (1.11). Theorem 1.3.2. The empty event ∅ has probability zero. Proof. Since ∅ = Sc , we have, by Theorem 1.3.1 and Axiom 2, P(∅) = 1 − P(S) = 0. One simple consequence of Theorem 1.3.2 is the fact that it makes it apparent that the finite additivity of the probabilities assigned to the simple events of a given sample space is a special case of the countable additivity property stated in Axiom 3. To apply Axiom 3 to a finite collection of mutually exclusive events A1 , A2 , . . . , An , one simply needs to define Ai = ∅ for i > n. With this stipulation, we have that ∞ [ Ai = i=1 so that P( n [ i=1 Ai ) = P( ∞ [ n [ Ai , i=1 n Ai ) = ∑ P(Ai ) + i=1 i=1 n ∞ ∑ i=n+1 P(∅) = ∑ P(Ai ). i=1 Another consequence of Axiom 3 is the important Theorem 1.3.3. (The Monotonicity Property) If A ⊆ B, then P(A) ≤ P(B). Proof. Since A ⊆ B, we may write B = A ∪ (B − A). Moreover, the events A and B − A are mutually exclusive. Thus, we have, by Axiom 3, P(B) = P(A) + P(B − A). (1.12) Since, by Axiom 1, P(B − A) ≥ 0, we see from (1.12) that P(B) ≥ P(A), as claimed. 17 The Calculus of Probability You will notice that the first axiom states only that probabilities must be positive; while it seems obvious that probabilities should not exceed one, this additional fact remains unstated, since it is implicit in the axioms, being an immediate consequence of the monotonicity property above. As is typical of axiom systems in mathematics, Axioms 1–3 above represent a "lean" collection of axioms containing no redundancy. In the present context, we derive from our axioms, and their three known consequences proven above, that no event A can have a probability greater than 1. Theorem 1.3.4. For any event A ⊆ S, P(A) ≤ 1. Proof. By Theorem 1.3.3 and Axiom 2, we have that P(A) ≤ P(S) = 1. The axiom of countable additivity gives instructions for computing the probability of a union of events only when these events are mutually exclusive. How might this axiom give us some leverage when we are dealing with two events A and B that have a non-empty intersection? One possibility: we could capitalize on the identity (1.1.1) in Section 1.1, which expresses A ∪ B as a union of disjoint sets. By Axiom 3, we could then conclude that P(A ∪ B) = P(A − B) + P(A ∩ B) + P(B − A). (1.13) This formula, while perfectly correct, is not as convenient as the "addition rule" to which we now turn. In most applications in which the probability of a union of events is required, the probabilities on the right-hand side of 1.13 are not given or known and must themselves be derived. A more convenient formula for P(A∪B) would give this probability in terms of the probabilities of simpler, more basic events. Such a formula is provided in Theorem 1.3.5. (The Addition Rule) For arbitrary events A and B, P(A ∪ B) = P(A) + P(B) − P(A ∩ B). (1.14) Remark 1.3.1. If you consider the Venn diagram for A ∪ B, you will see immediately why merely adding the probabilities P(A) and P(B) together tends to give the wrong answer. The problem is that this sum will count the event A ∩ B twice. The obvious fix is to subtract A ∩ B once. We now show that this is precisely the right thing to do. Proof. Note that A ∪ B may be written as A ∪ (B − A); since A ∩ (B − A) = ∅, Axiom 3 implies that P(A ∪ B) = P(A) + P(B − A). (1.15) Similarly, since B = (A ∩ B) ∪ (B − A), and (A ∩ B) ∩ (B − A) = ∅, we also have that P(B) = P(A ∩ B) + P(B − A). (1.16) P(B − A) = P(B) − P(A ∩ B). (1.17) We may rewrite (1.16) as Substituting (1.17) into (1.15) yields the addition rule in (1.14). Proving the extension of the addition rule below is left as an exercise. Theorem 1.3.6. For any events A, B, and C, P(A ∪ B ∪C) = P(A) + P(B) + P(C) − P(A ∩ B) −P(A ∩C) − P(B ∩C) + P(A ∩ B ∩C). (1.18) 18 Stochastic Modeling and Mathematical Statistics A full generalization of the addition rule will be a useful formula to have in your toolbox. The formula is known as the Inclusion-Exclusion Rule and is stated below. The theorem is true for any collection of events in an arbitrary sample space. A proof for the discrete case, that is, for the case in which the sample space is finite or countably infinite, uses Theorem 2.4.2 (the binomial theorem), which we have yet to discuss, so we will state the result, at this point without proof. Notice that when n = 3, the result reduces to Theorem 1.3.6. (See Problem 1.8.15 for a proof when S is finite and n = 4.) Theorem 1.3.7. (The Inclusion-Exclusion Rule) Consider the events A1 , A2 , . . . , An in a given random experiment. Then P( n [ Ai ) = ∑ni=1 P(Ai ) − ∑1≤i< j≤n P(Ai ∩ A j ) i=1 + ∑1≤i< j<k≤n P(Ai ∩ A j ∩ Ak ) Tn − · · · + (−1)n+1 P( i=1 Ai ). (1.19) We now come to a somewhat more complex claim, a property of probabilities called "countable subadditivity." It is a result which is useful in providing a lower bound for probabilities arising in certain statistical applications (for example, when you want to be at least 95% sure of something). In addition to its utility, it provides us with a vehicle for garnering a little more experience in "arguing from first principles." Before you read the proof (or any proof, for that matter), it is a good idea to ask yourself these questions: i) Does the theorem statement make sense intuitively? ii) Do I know it to be true in any special cases? and iii) Can I see how to get from what I know (that is, definitions, axioms, proven theorems) to what I want to prove? The last of these questions is the toughest. Unless you have a mathematical background that includes a good deal of practice in proving things, you will tend to have problems getting started on a new proof. At the beginning, your primary role is that of an observer, learning how to prove things by watching how others do it. In the words of the great American philosopher Yogi Berra, you can observe a lot by watching. Eventually, you learn some of the standard approaches, just like one can learn some of the standard openings in the game of chess by watching the game played, and you find yourself able to construct some proofs on your own. You should be able to do well in this course without perfecting the skill of mathematical argumentation, but you will have difficulty understanding the logic of probability and statistics without gaining some facility with it. The proof of countable subadditivity is our first serious opportunity to exercise our mathematical reasoning skills. The inequality below is generally attributed to the English mathematician George Boole. Theorem 1.3.8 (Countable subadditivity). For any collection of events A1 , A2 , . . . , An , . . . , P( ∞ [ i=1 ∞ Ai ) ≤ ∑ P(Ai ). (1.20) i=1 Remark 1.3.2. While the inequality in (1.20) is written in a way that stresses its applicability to countably infinite collections of events, it applies as well to the special case of the finite collections of events. To obtain