User:IssaRice/Scoring rule

how can we formalize the idea of a rule for scoring predictions?

first pass: statements and probabilities

we can start with a list $s_{1},\ldots ,s_{n}$ of statements. each statement makes a yes/no prediction about the future, like "the die will show 3 when rolled". then we have a list of probabilities $p_{1},\ldots ,p_{n}$ where $p_{j}$ is the probability someone assigns to $s_{j}$ being true. now, reality evaluates each statement, giving us a yes/no answer $r(s_{j})\in \{0,1\}$ . our probabilities are scored against this response from reality. so a scoring rule S can be some function of $p_{1},\ldots ,p_{n},r(s_{1}),\ldots ,r(s_{n})$ . so the type can be $S:[0,1]^{n}\times \{0,1\}^{n}\to \mathbf {R}$ .

if we are an ordinary statistician [1], we might pick a rule like ${\textstyle \sum _{j=1}^{n}(p_{j}-r(s_{j}))^{2}}$ . (this is actually almost the brier score)

second pass: events

in probability theory, we are used to dealing with events and random variables. in the previous section, we naively stated scoring rules in terms of statements and probabilities. but we might try now to phrase things in terms of events.

instead of statements $s_{1},\ldots ,s_{n}$ , we could have a list of events $A_{1},\ldots ,A_{n}$ . here, $A_{j}$ is an event expressing the fact that $s_{j}$ is true. then $p_{j}=P(A_{j})$ , where P is the probability measure which encodes our knowledge of what events are likely. $r(s_{j})$ is the outcome in some possible world, so $r(s_{j})=1_{A_{j}}(\omega )$ . the idea here is we have some implicit sample space $\Omega$ of all "possible worlds". then each $\omega \in \Omega$ is a possible world. but this is exactly the idea expressed by our reality function $r$ -- we could have had some other reality $r'$ in which our same probabilities would perform differently.

so our second pass is that we can define a scoring rule as something that takes a list of events $A_{1},\ldots ,A_{n}$ , a probability measure (which encodes the numbers $p_{1},\ldots ,p_{n}$ , assuming we have access to the events), and a world $\omega$ . so $S:(2^{\Omega })^{n}\times [0,1]^{2^{\Omega }}\times \Omega \to \mathbf {R}$

you might complain that in the previous section, we didn't tell the scoring function anything about what the statements were that were being predicted, whereas here we do (since we pass in the events). this is right, but it's just that here if we don't pass in the events, then just given P and $\omega$ we can't tell how to compute the probabilities.

third pass: non-binary predictions

instead of having all our predictions be about yes/no questions, we could allow more kinds of responses. for instance, instead of six predictions "the next roll of the die will be a i", yes=1/6, no=5/6, for i=1,...,6, we could instead have a single prediction like "the next roll will be a ..." 1: 1/6, 2: 1/6, ..., 6: 1/6.

given one prediction with k options (mutually exclusive and collectively exhaustive), and a second prediction with m options (again mutually exclusive and collectively exhaustive), we could roll them into a single prediction with km options. so in a way, we could work as if there was only one thing to be predicted.

so given probabilities $p_{1},\ldots ,p_{k}$ of the k options for a single prediction, a scoring function could take those probabilities, plus information about which option was correct, so some number $i\in \{1,\ldots ,k\}$ . so $S:[0,1]^{k}\times \{1,\ldots ,k\}\to \mathbf {R}$ .

here we get to one of the confusions i have about the explanation of scoring rules in [2]. there, we score each probability separately, i.e. $S':[0,1]\to \mathbf {R}$ . and we assume that $S(p_{1},\ldots ,p_{k},i)=S'(p_{i})$ . i think it's better to write this as ${\textstyle \sum _{j=1}^{k}S'(p_{j})[j=i]}$ , where $[j=i]$ is 1 if j=i and 0 otherwise. this makes the dependence on all the probabilities clear.

how do we make sense of stuff like the invariance $S[P(A,B)]=S[P(A)]+S[P(B|A)]$ from Technical Explanation? we first need two partitions $A_{1},\ldots ,A_{k}$ and $B_{1},\ldots ,B_{m}$ of the sample space. these two partitions induce a third partition where each element of the new partition is the intersection of one element each from As and Bs, e.g. $A_{1}\cap B_{1},A_{1}\cap B_{2},\ldots$ . call this new partition $C_{1},\ldots ,C_{r}$ . then we want something like $S(P(C_{1}),\ldots ,P(C_{r}),(a,b))=S(P(A_{1}),\ldots ,P(A_{k}),a)+S(P(B_{1}|A_{j}),\ldots ,P(B_{m}|A_{j}),b)$ for every $A_{j}$ (need better notation?).

proving logarithmic scoring is proper

from [3], if ${\overline {p}}$ is the actual beliefs of the forecaster, and $p$ is the submitted probabilities, then to be proper we want ${\overline {p}}=\mathrm {arg\,max} _{p}\mathbb {E} _{\overline {p}}[S(p,X)]$ . to make sense of this, first just pay attention to the bare p's. we're looping through all possible p's we can submit to the scoring process, and we want to find the p that maximizes the score. but we can't just do this raw kind of maximization, because then we'd obviously get that the best probability is to assign 1 to something if it will happen, and 0 if it won't happen -- but we can't predict the future so well, so this is useless! so instead of maximizing the raw score, we want to maximize the expected score, given our current beliefs ${\overline {p}}$ . so this maximization process spits out some probabilities that we should submit that will get us a high score, given our current beliefs. now, a score is proper if one way to do good is to just submit your actual beliefs.

for logarithmic scoring the expectation is $\sum _{x\in \{0,1\}}(x\log p+(1-x)\log(1-p))\cdot {\overline {p}}(x)={\overline {p}}(1)\log p+(1-{\overline {p}}(1))\log(1-p)$ . differentiating this, setting it to 0, and solving for p [is this even valid here? we need to check for convexity/increasing/whatever optimization stuff but idk how] gets us $p={\overline {p}}(1)$ .