165 Comments

You might consider unicode icons (e.g. ✔, ✘ or ☑, ☒) rather than links-to-nowhere to mark things that did/didn't happen.

Expand full comment

Well, to get the conversation about 50% predictions started... I think they're fine. The actual phrasing of each individual prediction could be flipped, sure, but what's important is that you actually publish a particular framing of such such positive/negative predictions, and then grade that set's outcomes. The results of each set of predictions at a particular confidence level should be drawn from a binomial distribution at with theta at that particular confidence level. With theta at 50% and large N of predictions, there are plenty of different ways to end up near N*0.5. What you don't want is for your particular set of outcomes to end up at, say, N*0.22 :P

Expand full comment

Intellectually uninteresting but clearly phrased question: how much does the desire to see your predictions be accurate influence the ones you have control over?

Intellectually interesting but vaguely worded question: something something desire to see predictions be accurate something something general model of brain something something actions one has control over

Expand full comment

So uh... when can we purchase the Unsong revision? Asking for a friend

Expand full comment

50% predictions don't help to see if you are calibrated, but other than that sure they mean something, assuming 50% results from a genuinely best attempt.

Expand full comment

What biohacking projects did you try?

Expand full comment

So Scott believes that the balance of evidence supports Tara Reade's accusation of sexual assault? I haven't followed the case closely, but I wasn't aware of any convincing evidence that it was credible.

Expand full comment

Would it be possible to see Scott's calibration aggregated over all years? Or maybe some error bars? I always find myself wondering if the deviations are due to small sample size rather than miscalibration overall.

Expand full comment

so basically 2020 is the lizard man's constant for years. makes sense.

Expand full comment

Wait, so does 58 mean that Scott's been sitting on a revised Unsong since the start of the year? Or did it get published somewhere and I just didn't notice?

Expand full comment

"At the beginning of every year, I make predictions. At the end of every year, I score them (this year I’m very late). Here are 2014, 2015, 2016, 2017, 2018, and 2019. And here are the predictions I made for 2020."

Wait, hold up, you made all these Coronavirus predictions at the *beginning* of 2020? When it was still only in Wuhan? I feel like I'm missing something here.

Expand full comment

Doesn't a 50% prediction mean that you would have predicted the inverse at 50% as well? How did you decide which side of that is "right" and which side is "wrong?"

Expand full comment

Are we still going to get to read [redacted]? Particularly those [redacted] that you were relatively confident you would end up writing...

Won't lie that I got kind of excited when I saw those were in blue; it's nice to hear that you have ideas you consider significant enough to predict in this way for future blogposts.

Expand full comment

My mind treats a low prediction (10%, 20%, 30%) which happens as a success, rather than a failure, even though Scott's math treats, for instance, a 10% prediction of X as a 90% prediction of not-X.

I think this is because Scott picked the statements. Imagine if he predicted, with 10% certainty, something so unlikely that most of us aren't even thinking about it. For instance, that the events described in the Book of Revelation would happen, exactly as written. Then imagine that it actually happened. Making that particular 10% prediction, rather than any other, would make him seem amazingly smart, next to all the rest of us who weren't even talking about it! We wouldn't care that he had (technically) made a 90% prediction that it wouldn't happen. He wouldn't be sad that his 90% not-the-eschaton-this-year prediction failed -- he'd feel vindicated.

But I'm wrong to think of it that way, because the topics he's predicting are things we're all talking about all the time. He's not adding any new information just by bringing up that these things are possibilities. I ought to mentally flip low-probability predictions into high-probability predictions of the opposite, the way he does for the graph.

Expand full comment

Black Swan:

The problem with calibration is that it only makes sense if your predictions are independent. If a black swan appears and affects everything, they are highly correlated and you will probably be overconfident that year. But, yes, that's OK if you average over other years when you were underconfident. But covid wasn't a black swan: you knew about before making the predictions. It should have been obvious that it messed everything up. But you can still have the problem that the predictions were all correlated for other reasons, in particular that they depended on the single variable of the strength of lockdowns.

Expand full comment

As I pointed out last year: Yes, of course 50% predictions are meaningful, it's just assessing them for *calibration* in this way that's meaningless. Assess them for something other than calibration and that they're meaningful will be obvious.

Expand full comment

Surprised you rated the Tara Reade evidence like you did. I personally also rated it highly till I found information about her Russian links that implied an covert ops angle. I think that came out well in advance of the election as well. Did that influence your rating at all?

Expand full comment

I think your conversion 5%->95% etc is flawed. Say, you believe an event happens at 20% probability, but it actually happens at 40%, then you are underconfident in that prediction. But you guessed the negation would happen at 80% probability but it actually happens at 60%, so you are overconfident in the negation. That means that once you combine your data you can no longer talk about over/underconfidence, but only about distance from the green line, because an original data point has a different "direction of improvement" than a flipped data point. (Also, then 50% predictions are suddenly a lot less mysterious; they were only confusing because we assumed that you can flip the direction of your prediction at will which you can't).

Expand full comment

Disappointing that Substack doesn't allow strikethrough, as you've used it for expressive effect in some of your previous posts. Please add it to the list of things you're asking Substack to implement, if you haven't already.

Expand full comment

In this scoring system, I'm not certain "overconfident" and "overconfident" are different.

If you're 40% confident of some set of things happening, and 30% of them do, you were "overconfident". But if you'd stated each of this inversed (60% chance of each thing not happening), then 70% of them did happen, then you were "underconfident". For the exact same predictions and results.

There isn't over or under confidence, just accuracy.

This also resolves the 50% issue: direction is arbitrary anyway, it's the distance from consistent accuracy that matters, for 50% and every other number.

Expand full comment

"35. UK, EU extend “transition” trade deal: 80%"

I think this one should be false - the UK and EU signed a new trade deal and did not extend the "transition" phase beyond 31st Dec. Concretely, this is when the UK stopped being part of the EU free market and customs union.

Expand full comment

I'm not mathy enough to weigh in on whether 50% predictions are meaningful for calibration, but I do have the unrelated concern that they're kind of... unnatural? I feel like if I were doing a project like this I would use them either extremely sparingly or not at all. Obviously there are cases where it's clearly correct, like predicting a literal coin flip, or predicting election results in a country you've barely heard of between two candidates you know nothing about, but I don't think it's appropriate when we're talking about any reasonably complex real-life scenario that you're even somewhat well-informed about.

I think my objection is that 50% seems suspiciously precise to me, in the same way that a 71.18522% prediction would seem suspiciously precise. Because it implies that you're holding both possible outcomes as being not only close but exactly equal. Like, when you said Trump had a 50% chance of being re-elected, you're saying that all the evidence you have in favour of his re-election just so happens to be exactly as convincing as all the evidence against? Really? Isn't that kind of a huge coincidence?

And you might reply, "well, predicting 80% would mean claiming that the evidence in favour is exactly four times as convincing as the evidence against, and isn't that a big coincidence too?" But I don't think that's right. If you predict something at 80% you've told us which outcome you think is more likely to happen, and we understand implicitly that the exact percentage is at best a rough indication of how confident you are that it will happen. 50% is unique in that you're declaring that you're completely unwilling to order the possible outcomes by likelihood, even at the lowest possible levels of confidence! It feels more like skipping the question rather than registering a genuine prediction. I think even in cases where I was really really unsure what would happen, my thoughts would be better represented by an ugly-looking prediction like 51% or 49.5% rather than an artificially clean-looking 50%.

I'd be interested to hear if other peoples intuitions about this accord with mine. Also interested to hear whether I'm just being an idiot, because while I stand by the above reasoning, it did feel more solid in my head than it looks now I've typed it out.

Expand full comment

برازیل زندہ باد! قومی آزادی کے مقدس بینر کو پکڑو!

Expand full comment

I'm a little new here (well, not really, but I've commented only few times and don't usually read the predictions posts), so I may have missed something, but....

....when did you make the coronavirus predictions? At the beginning of 2020, or was it after March or so, or later? To even know that hyrdochlorquine would be an issue (regardless of whether it turns out to be effective or not) would be prescient indeed if the predictions were made at the beginning of 2020.

Expand full comment

I think you're hurting yourself a bit trying to fit everything into the paradigm of binary classification. Some of these predictions are naturally multiclass and some of them are really regression problems, notably the "how many people will die of Covid in the U.S." question. It's not really continuous, obviously, as half a person can't die and the number is also bounded between 0 and <population of the United States>, but I don't see how squeezing it into an adhoc multiclass turned into several binaries helps. Just predict the actual value and give 95% confidence interval error bars and score yourself using some standard loss function for regression. Level of confidence in this case isn't calibrated by assigning some percentage value to the point estimate, but by how tightly you bound your confidence interval.

Expand full comment

ਮੁਕੰਮਲ ਸੱਤਾਧਾਰੀ ਸਮੂਹ ਨੂੰ ਖਤਮ ਕਰੋ. ਜ਼ੁਲਮ ਦੀਆਂ ਜ਼ੰਜੀਰਾਂ ਨੂੰ ਸੁੱਟ ਦਿਓ. ਅੰਤਰਰਾਸ਼ਟਰੀ ਏਕਤਾ ਦਾ ਬੈਨਰ ਉੱਚਾ ਚੁੱਕਿਆ.

Expand full comment

"8. NYC widely considered worst-hit US city: 90%"

I'm not so sure about this one - or at least, I'd sure love to see some polling data on it. Depends on your definition of "widely considered" to be sure, but I'd bet at least 1/3 of Americans believe that Florida and/or Texas were "worst-hit" (because they didn't do what the experts demanded, therefore they *must* have worse outcomes), and that California would get quite a few votes too (because length/severity of lockdown *must* surely correlate with severity of infection)...

Expand full comment

What character do you play in D&D, Scott?

Expand full comment

Regarding number 35: the Brexit transition period (during which time the UK remained part of the EU single market and customs union) was not extended, instead the UK and EU agreed a new trade deal. So that prediction should be in blue.

Expand full comment

Even a well-calibrated predictor will have situations of appearing to be systematically off, just because of small sample size. We should put some confidence intervals on that graph to see if you were overconfident to a statistically significant degree.

Expand full comment

"US has highest death toll as per expert guesses of real numbers: 70%"

Shouldn't this be in blue?

Expand full comment

There is a reason “Acts of God” exists in all legal contracts. So we know now that you are human...

Expand full comment

36. Kim Jong-Un alive and in power: 60%

What? Why? Northern Korea didn't have a violent transition of power for longer than USSR was around. Nothing Un does suggests he's significantly worse at maintaining power internally than his father and grandfather. Nor he's doing particularly bad internationally. I would understand such prediction made in 2017 when Trump promised fire and fury (and even then, it's giving Trump way too much credit). ...Perhaps it was indeed made earlier than 2019?

Expand full comment
founding

Out of curiosity, why did you predict you'd get a Surface Book 3, and why didn't you get one?

Expand full comment
founding

In expectation we missed out on at least one [redacted] being published last year.

Expand full comment

I'm interested in your analysis of the balance of evidence favoring Tara Reade as accuser, as someone with no particular opinion on Tara Reade, a general skepticism of accusations, and the impression that the vast, vast majority of those normally politically inclined to believe all accusers make an explicit exception for Reade.

Expand full comment

It seems to me the 50% problem is less about the percent itself and has more to do with self-selecting the questions, which makes grading systems make less sense in general.

I think 50% answers make more sense in cases where a third party supplies a list of questions to fill out and you compare answers to others.

Expand full comment

Of course 50% predictions can be meaningful. 1) it on the error bars, and 2), it depends in what other people *expect* the outcome to be.

For instance if everyone is expecting event X to have a 90% chance of occurring and you give it 50% chance, you're signaling your belief that the data or the argument is weaker than the consensus.

Expand full comment

Scott's scored 577 predictions since 2014. Here are the results:

50% Level: 43% correct (32 of 74)

60% Level: 60% correct (68 of 113)

70% Level: 73% correct (72 of 98)

80% Level: 80% correct (101 of 127)

90% Level: 93% correct (100 of 107)

95% Level: 91% correct (41 of 45)

99% Level: 100% correct (13 of 13)

Expand full comment

50% predictions are clearly useful. You should get 50% of your 50% predictions correct; if you get 75% of them correct, you're underconfident, and if you get 25% correct, you're overconfident.

A single 50% prediction isn't very useful but a large number of them aggregated together clearly is.

Expand full comment