Woohoo, I beat putting 50% on every single answer, Vox Future Perfect, and roughly one out of four other participants!

(Sigh. I knew from GJ Open that I'm not great at predictions, but I thought that long experience over there would have actually helped. Apparently not.)

Just out of curiosity, how do you score answers that are not "yes" or "no" but rather Bayesian-style percentages? One take is that, if someone who guessed that Gorg will do Blah with probability 10%, and Gorg does do Blah, that person should get one tenth of a point. Another take is that it shouldn't be linear, and someone who guessed 0% should be executed immediately.

I was going to ask on the next open thread why people were not all that worried about a nuclear war - but since we are on a thread about forecasting does anybody want to chime in. Or is it in the contest (which I didn’t get involved in).

I’m talking about the Ukraine war here, not in general. In particular if the Russians start to lose badly, will a Ukrainian army that pushes the Russians back to the border stop at the border? And yet the former seems to be what the western world demands. Clearly if the Ukraine army is routing the Russians, if that happens, they may not have the particular desire to stop.

That said, of course, it’s unlikely that the west would continue to supply arms to Ukraine if they cross the border, but you never know. Everybody’s backs are up.

Despite my pessimism I would put the odds at 10-20% but that’s still far too high for comfort.

Well, one consequence of the prediction markets overperformance is it suggests that, at least if you only have a few minutes to research, you should look things up on prediction markets rather than trying to do your own research.

Huh, I scored much higher than I expected (on par with the "median superforecaster" apparently).

Now I'm curious to see what specific answers I put. But the Google doc is confusing. Is there some way of finding this out? (I can see my "score" for each question, but I don't know what that means.)

I think it would help vividly illustrate the mystical-ness if you turned the log-loss numbers back into something intuitively interpretable. For the best forecasters, when they say there's an 80% chance something happens, how often does it happen? Can you draw the calibration graphs that you did when scoring your own predictions?

The median score in the table of this post is 34.63, but I was given a 68th percentile for a score of 36.96.

How do I reconcile this? Am I missing something obvious? The results page says higher is better for percentiles so I don't think I'm misinterpreting the ordering of the percentiles.

Side note: Would really love a calibration plot on the personal results page as well!

I've never understood what people mean when they individually (rather than in aggregate) assign a probability to a macro event like a political outcome. Under what fundamental understanding of probability does it make sense to say that there's a 40% chance of a ceasefire? That in 40% of the universes that split off from the present moment there is a ceasefire but in the other 60% there isn't?

One could say that probability is a high-level description of events that are not probabilistic intrinsically - e.g. the outcome of a coin toss is always deterministically 100% one way or the other, but the chaos of the physical system leads us to say that our predictive potential is only 50%.

By that analogy, the figures people assign would have to be some sort of confidence rating grounded in the parity of background conditions. That much I can accept but I suppose it's the resolution that gets to me. One might say "I'm confident that the outcome is more likely than not" - and that's more a statement of personal conviction, or the setting of being odds, rather than historical analysis - but what sort of quantitative reasoning could lead someone to think that their own individual confidence can be rated on a percentage scale?

It seems like it has to say something behavioural about the person rather than something mathematical about the proposition but I can't put my finger on it.

1. It is really easy to apply a linear transformation to the scores to make them more interpretable; for example, a common transformation sets "guessing 50% for everything" at a score of 0, and "giving the correct answer to everything with perfect confidence" at a score of k (where k is the number of questions). Note that this linear transformation also makes larger scores better than smaller ones, which is more intuitive.

Why not apply such a linear transformation? The current scores seem strictly worse.

2. Aside from the linear transformation, another degree of freedom is the choice of scoring rule. I understand that you want a proper scoring rule to incentivize people to reveal their true Bayesian estimates. But there are a LOT of proper scoring rules, and the log-score or Brier are not the only ones.

If you want to disincentivize the strategy of "be overconfident on purpose", you can do this (at least partially, not perfectly) by picking a proper scoring rule that harshly penalizes overconfidence -- even more harshly than the log score. There are several of those! One choice is

sqrt(1/p-1)

where p is the probability assigned to the correct outcome. (For comparison, the log score is log(1/p) and the Brier score is (1-p)^2, up to linear transformations.)

Note that as p->0 (that is, if you very confidently predict something wrong), the log score goes to infinity, but the score I proposed above goes to infinity *much faster*, at a rate of around sqrt(1/p) instead of log(1/p). So being overconfident is much more expensive! But it is still a proper scoring rule.

I guess I don't find forecasting very interesting because it's all bounded by the fact you have to know which questions to ask to begin with, meaning you can't predict something you didn't think to ask.

The bar chart says the _median_ superforecaster scored in the 84th percentile; the text 2 paragraphs below it says the _average_ superforecaster. Are both true, or is one of those a mistake?

Wow what can I say...I mean I would like to thank Scott but I am not sure if it's appropriate :D... but still, a fantastic contest and I am flabbergasted that I am in the top 5...

Has anyone ever looked at super-antiforcasters (people who reliably forecast worse than random betting) in aggregate like they do super-forcasters? I'm curious how effective of a strategy it would be to bet strictly *against* the most reliably terrible forecasters.

Naively, one would expect that the worst forecasters would aggregate to be equivalent of "bet 50% on everything", but perhaps in reality the worst forecasters actually have a strategy that is just upside down, like they apply a NOT to what otherwise would have been a correct result or something.

I am glad I submitted for 2023 and regret not submitting for 2022. If all my predictions for the year are worse than chance, I am going to have some hard conversations with myself. I think at that point you have to employ the famous Costanza strategy of just doing the exact opposite of what you think is a good idea and figure out why later.

Sadly, I was banned from the Motte because I don't show enough respect for other people's feelings. I guess making sure idiots feel validated is more important than being able to predict and stop a potential war, at least in the eyes of Reddit mods

>If you analyze raw scores, IQ correlates with score pretty well. But when you analyze percentile ranks, you need to group people into <150 and >150 to see any effect.

How is this possible? I would expect raw scores and percentile ranks to correlate very well.

I'm guessing Scott just means that the effect was only statistically significant with raw scores?

I wonder how I would've done at predicting what percentile I'd end up taking. Feels like I'd have been about right, but that's easy to say now that I know.

Can you give the percentiles for people with >130, 140 and 160 IQ (self reported) to determine that the 150 IQ cutoff is not cherrypicked? Does it work like that?

If I'm not wrong, almost 1 in 5 people predicted probabilities of Lula being elected and Bolsonaro being re-elected as adding up to 110% or more (3 out of 12 superforcasters)

" Maybe it’s possible to say with confidence that a 41% chance to be better than a 40% chance, and for us to discover this, and to hand it to policy-makers charting plans that rely on knowing what will happen."

That's the bit that gives me the heebie-jeebies about using prediction markets. A change to 41% from 40% isn't a huge increase so probably the policy-makers, if they take any notice of it at all, won't adjust their plans too far from what they originally thought. And of course people who make policies are already using experts and trying to shave more and more uncertainty off predictions.

But what about when predictors start going "We're 70% sure. We're 80% sure. We're 90% sure"? That's a big divergence and *if* the policy-makers trust the predictors, that means a big change from what they were originally intending to do.

Some predictions are simple yes/no - there will or there won't be a cease-fire in Ukraine. But what concerns me is the logic-chopping even in the toy example: did Nancy Pelosi retire or not? Is Eric Lander a Cabinet-level official or not?

It's not much consolation to the smoking crater where a village used to be that "Well, *technically* the question was resolved correctly if you re-do the wording" which basically means "Not our fault the policy-makers picked the wrong decision based on our predictions".

If a simple term like "what does 'retire' mean?" can't be concluded without arguing "She did retire" "But not as a Congresswoman so she didn't in fact retire retire", or "No Cabinet official quit" "By a technical definition this obscure guy is Cabinet level", then why expect any policy-maker to give you the time of day?

What are the questions that entrants got wrong the most?

As Skerry said above, 2022 didn't seem as weird as, say, 2020.

Personally, I didn't think Russia would invade Ukraine (figuring it would do something more debatable like 2014), but I was wrong and the U.S. government (as of 12/31/21) was right, so I wouldn't rank that as too weird, more just me being wrong.

One problem with forecasting contests is that the the really weird events of importance don't have any questions about them because nobody saw them coming. For instance, I doubt if anybody at the end of 2014 asked if Angela Merkel would let a million Muslims in in 2015. Of course, she did, and that wound up making more likely more weird events in 2016 like Brexit and Trump.

You should form a focus panel of the top 10 forecasters for 2022 and have them make joint predictions for 2023, and out to 2028. Dissenting opinions would be noted.

1. Of course it is a game (so blue ribbon for lowest score or best coin flipper of heads). But if we are serious unless a score was below 19.6 why is it remarkable? (IOW, why should we confuse systemic cause variation from special cause variation.)

2. Handling outliers - there are some scores where the participants did so badly as to be well beyond 3 standard deviations. For example, 106.36!, 92.65!, 74.47, 71.64, 70.52, 63.97, 63.8, 59.93. That is what you should be investigating!

If you remove these (really should have a reason for removing them), then Mean 38.75204 UCL 51.8119332986286 LCL 25.6921467013714. This will still leave no particularly noteworthy 'best score' and still a few outside on the tail. In fact, there is quite a tail as is obvious from looking at the histogram. Why is that? Here is a plausible theory - some people weren't really "forecasting" instead they were trying to "win" by making some contrarian guesses that would make them standout from the crowd.

So I did not do well last year; somewhere around the 20th percentile.

My failed predictions were highlighted by "while Russia is probably going to do something militarily in Ukraine, it certainly isn't going to do (the exact invasion we got)". After I adjusted for "Putin has gone mad and is going to pursue losing strategies consistently" in March, my predictions on the topic have gotten better.

At a per-question level, I lost the most points on "will any state legalize a psychedelic in 2020"; I said 20% (average 75%, and it did happen), and still think that was a defensible guess even though it did happen in Colorado.

Unfortunately the Google spreadsheet is too large and cumbersome for me to find my other specific predictions.

My university is inviting us to attend a 2-day "foresight fundamentals workshop" offered by the Institute for the Future (https://www.iftf.org/). I never see this organization mentioned in discussions of forecasting. Does anyone know anything about this group? Would attending their trainings be worthwhile?

> A person who estimates a 99.99999% chance of a cease-fire in Ukraine next year is clearly more wrong than someone who says a 41% chance.

Technically, if there is a cease-fire in Ukraine next year, the person who gives a probability of 99.99999% is *less* wrong than someone who gives a probability of 41%. At least, in terms of probability as a thing that is scored with reference to reality.

Some epistemologists think there is an objective notion of "evidence" that makes some probabilities be a "correct" report of the evidence. But if there is such a thing, you can't use calibration or scoring rules to measure it (at least not directly).

I don't believe in an idea of an objectively correct report of evidence. Instead, I think the way we do this work is by asking whether a person's *method* of forming probabilities does well in terms of score (as match with reality) not just in the actual world, but in nearby possibilities. I think that reliably getting relatively accurate is the only evidence-type thing that we can have.

> Actually, if you analyze raw scores, liberals did outperform conservatives, and old people did outperform young people. [...] some people did extremely badly, so their raw scores could be extreme outliers

This seems to imply that conservatives/young'uns have a greater number of individuals who are confidently *very wrong* (countered by a segment that are slightly more correct than average).

So in other words, the takeaway is that the high-temperature right wing influencer takes that look like dumb predictions probably *are* dumb, but reasonable conservatives that don't base their predictions on Ben Shapiro are likely to be grounded in reality.

Well, either that or liberals are just boring centrists as usual.

It's interesting to see Ryan explicitly calling out that he sought to maximize the probability of winning the contest, not to minimize his expected log loss. In case anyone is interested, colleagues and I have a paper on forecasting contest design. We show that any contest where the winner is chosen deterministically will suffer from a similar problem (truthfully reporting probabilities might not be an optimal strategy). If you're willing to choose a winner randomly (typically: non-uniform), you can get around this problem, at the cost of selecting a bad forecaster as your winner with some probability. Given independence of event outcomes, this probability gets smaller and smaller the more events are in the contest.

Can I check: your footnote refers to group people into <150 and >150, but which group are you putting people who record an IQ of exactly 150? It makes a surprisingly big difference, because the first round responses have 29 people claiming an IQ>150 and 8 people claiming an IQ of exactly 150. Incidentally, one person claims an IQ of 212, which I think must be either a typo or a lizard: I've removed it by hand.

So, I looked at my scores (in the excel file) again...overall, I wasn't that different from average, but where I really did better was in election predictions , particularly for the US midterms. Seems like most people were much more bullish on the GOP to win both the Senate and House, which was reflected in the prediction markets...Idk why I was more cautious here, but I guess it might have to do with me taking into account the education realignment benefiting the Democrats more than the GOP (especially at midterms)?

This reminded me of amaybe a too popularised classic, The Art of War, written roughly 5th century BC.

Just compare a quote from the blog: ".. people who can do lots of research beat people who do less research."

with a quote from a text a written a couple of millenia before:

"The general who wins a battle makes many calculations the battle is fought. The general who loses a battle makes but few calculations beforehand. It is by attention to this point that I can foresee who is likely to win or lose."

And just for the record: people who are near to Russia might be more likely to predict what happens here. I spotted an anomaly in the amount of updates in the domains related to Kremlin propaganda during the summer of 2021. Only when it started peaking I mentioned it publicly.

There needs to be some justification as to why logistic loss is the correct loss function to use here. Log loss has properties that make it nice for training ML classifiers, but those properties make it really weird in other contexts. For example:

* predicting 1% for something that happens 5% of the time has the same expected loss as predicting 80% for something that happens 99% of the time. Being 4% off is the same as being 19% off!

* predicting 90% for something that happens 80% of the time is 50% more lossy on average than predicting 80% for something that happens 90% of the time. If you're going to be wrong, it pays to be wrong on the less extreme side of the equation.

There is also a symmetric cost assumption in log loss which likely jars with human intuition. The cost of false positives is rarely the same as the cost of false negatives for any real prediction. Nor for that matter is the value of true positives and true negatives likely to be the same.

In doing these kind of surveys, I think people also tend to assume bounded loss per question. I'm not sure that many people realize that getting one question wrong with a prediction of 1% is 6 times worse than answering 50%. This is likely why averaging tends to improve loss: it will tend to reduce the effect of outlier incorrect predictions that dominate the loss total.

I'm curious to see what tools and code you all use for belief aggregation. I already entered so it's too late for me to profit from this, but I want to understand this better for some of my other work project like bit.ly/eaunjournal.

DM me if interested and I can share my code and data on this on a 'share for share' basis.

deletedJan 25·edited Jan 25Comment deletedIs the 2023 contest still open?

People with self-reported IQ above 150 doing better is pretty surprising to me. I would have expected them to do worse.

Woohoo, I beat putting 50% on every single answer, Vox Future Perfect, and roughly one out of four other participants!

(Sigh. I knew from GJ Open that I'm not great at predictions, but I thought that long experience over there would have actually helped. Apparently not.)

Just out of curiosity, how do you score answers that are not "yes" or "no" but rather Bayesian-style percentages? One take is that, if someone who guessed that Gorg will do Blah with probability 10%, and Gorg does do Blah, that person should get one tenth of a point. Another take is that it shouldn't be linear, and someone who guessed 0% should be executed immediately.

I was going to ask on the next open thread why people were not all that worried about a nuclear war - but since we are on a thread about forecasting does anybody want to chime in. Or is it in the contest (which I didn’t get involved in).

I’m talking about the Ukraine war here, not in general. In particular if the Russians start to lose badly, will a Ukrainian army that pushes the Russians back to the border stop at the border? And yet the former seems to be what the western world demands. Clearly if the Ukraine army is routing the Russians, if that happens, they may not have the particular desire to stop.

That said, of course, it’s unlikely that the west would continue to supply arms to Ukraine if they cross the border, but you never know. Everybody’s backs are up.

Despite my pessimism I would put the odds at 10-20% but that’s still far too high for comfort.

Well, one consequence of the prediction markets overperformance is it suggests that, at least if you only have a few minutes to research, you should look things up on prediction markets rather than trying to do your own research.

edited Jan 24Huh, I scored much higher than I expected (on par with the "median superforecaster" apparently).

Now I'm curious to see what specific answers I put. But the Google doc is confusing. Is there some way of finding this out? (I can see my "score" for each question, but I don't know what that means.)

I think it would help vividly illustrate the mystical-ness if you turned the log-loss numbers back into something intuitively interpretable. For the best forecasters, when they say there's an 80% chance something happens, how often does it happen? Can you draw the calibration graphs that you did when scoring your own predictions?

I definitely don't understand it, but it's very cool!

edited Jan 24The median score in the table of this post is 34.63, but I was given a 68th percentile for a score of 36.96.

How do I reconcile this? Am I missing something obvious? The results page says higher is better for percentiles so I don't think I'm misinterpreting the ordering of the percentiles.

Side note: Would really love a calibration plot on the personal results page as well!

I've never understood what people mean when they individually (rather than in aggregate) assign a probability to a macro event like a political outcome. Under what fundamental understanding of probability does it make sense to say that there's a 40% chance of a ceasefire? That in 40% of the universes that split off from the present moment there is a ceasefire but in the other 60% there isn't?

One could say that probability is a high-level description of events that are not probabilistic intrinsically - e.g. the outcome of a coin toss is always deterministically 100% one way or the other, but the chaos of the physical system leads us to say that our predictive potential is only 50%.

By that analogy, the figures people assign would have to be some sort of confidence rating grounded in the parity of background conditions. That much I can accept but I suppose it's the resolution that gets to me. One might say "I'm confident that the outcome is more likely than not" - and that's more a statement of personal conviction, or the setting of being odds, rather than historical analysis - but what sort of quantitative reasoning could lead someone to think that their own individual confidence can be rated on a percentage scale?

It seems like it has to say something behavioural about the person rather than something mathematical about the proposition but I can't put my finger on it.

A few comments:

1. It is really easy to apply a linear transformation to the scores to make them more interpretable; for example, a common transformation sets "guessing 50% for everything" at a score of 0, and "giving the correct answer to everything with perfect confidence" at a score of k (where k is the number of questions). Note that this linear transformation also makes larger scores better than smaller ones, which is more intuitive.

Why not apply such a linear transformation? The current scores seem strictly worse.

2. Aside from the linear transformation, another degree of freedom is the choice of scoring rule. I understand that you want a proper scoring rule to incentivize people to reveal their true Bayesian estimates. But there are a LOT of proper scoring rules, and the log-score or Brier are not the only ones.

If you want to disincentivize the strategy of "be overconfident on purpose", you can do this (at least partially, not perfectly) by picking a proper scoring rule that harshly penalizes overconfidence -- even more harshly than the log score. There are several of those! One choice is

sqrt(1/p-1)

where p is the probability assigned to the correct outcome. (For comparison, the log score is log(1/p) and the Brier score is (1-p)^2, up to linear transformations.)

Note that as p->0 (that is, if you very confidently predict something wrong), the log score goes to infinity, but the score I proposed above goes to infinity *much faster*, at a rate of around sqrt(1/p) instead of log(1/p). So being overconfident is much more expensive! But it is still a proper scoring rule.

I guess I don't find forecasting very interesting because it's all bounded by the fact you have to know which questions to ask to begin with, meaning you can't predict something you didn't think to ask.

typo: with a background in economist (should be economics)

The bar chart says the _median_ superforecaster scored in the 84th percentile; the text 2 paragraphs below it says the _average_ superforecaster. Are both true, or is one of those a mistake?

Wow what can I say...I mean I would like to thank Scott but I am not sure if it's appropriate :D... but still, a fantastic contest and I am flabbergasted that I am in the top 5...

I am still not sure where I can see how I answered each individual question?

Would there be any interest in the following contest? I could offer a $1000 prize.

- Participants will submit one prediction for 2023 on any topic or event

- If there are many participants, the top 30 most interesting predictions, as chosen by me or by Scott if he wishes to publicize it, will be selected

- People will vote on whether they expect each of these predictions to come true

- The winner will be the person with the least likely correct prediction

Basically, I am not fond of the format of pre-selected predictions. Any potential improvements?

Has anyone ever looked at super-antiforcasters (people who reliably forecast worse than random betting) in aggregate like they do super-forcasters? I'm curious how effective of a strategy it would be to bet strictly *against* the most reliably terrible forecasters.

Naively, one would expect that the worst forecasters would aggregate to be equivalent of "bet 50% on everything", but perhaps in reality the worst forecasters actually have a strategy that is just upside down, like they apply a NOT to what otherwise would have been a correct result or something.

I am glad I submitted for 2023 and regret not submitting for 2022. If all my predictions for the year are worse than chance, I am going to have some hard conversations with myself. I think at that point you have to employ the famous Costanza strategy of just doing the exact opposite of what you think is a good idea and figure out why later.

Well, I predicted the Ukraine invasion in advance. Here's proof.

https://old.reddit.com/r/TheMotte/comments/s2gk0v/will_nato_expansionism_lead_to_a_war_between_the/?sort=controversial

Sadly, I was banned from the Motte because I don't show enough respect for other people's feelings. I guess making sure idiots feel validated is more important than being able to predict and stop a potential war, at least in the eyes of Reddit mods

>If you analyze raw scores, IQ correlates with score pretty well. But when you analyze percentile ranks, you need to group people into <150 and >150 to see any effect.

How is this possible? I would expect raw scores and percentile ranks to correlate very well.

I'm guessing Scott just means that the effect was only statistically significant with raw scores?

I wonder how I would've done at predicting what percentile I'd end up taking. Feels like I'd have been about right, but that's easy to say now that I know.

Can you give the percentiles for people with >130, 140 and 160 IQ (self reported) to determine that the 150 IQ cutoff is not cherrypicked? Does it work like that?

edited Jan 24If I'm not wrong, almost 1 in 5 people predicted probabilities of Lula being elected and Bolsonaro being re-elected as adding up to 110% or more (3 out of 12 superforcasters)

edited Jan 24" Maybe it’s possible to say with confidence that a 41% chance to be better than a 40% chance, and for us to discover this, and to hand it to policy-makers charting plans that rely on knowing what will happen."

That's the bit that gives me the heebie-jeebies about using prediction markets. A change to 41% from 40% isn't a huge increase so probably the policy-makers, if they take any notice of it at all, won't adjust their plans too far from what they originally thought. And of course people who make policies are already using experts and trying to shave more and more uncertainty off predictions.

But what about when predictors start going "We're 70% sure. We're 80% sure. We're 90% sure"? That's a big divergence and *if* the policy-makers trust the predictors, that means a big change from what they were originally intending to do.

Some predictions are simple yes/no - there will or there won't be a cease-fire in Ukraine. But what concerns me is the logic-chopping even in the toy example: did Nancy Pelosi retire or not? Is Eric Lander a Cabinet-level official or not?

It's not much consolation to the smoking crater where a village used to be that "Well, *technically* the question was resolved correctly if you re-do the wording" which basically means "Not our fault the policy-makers picked the wrong decision based on our predictions".

If a simple term like "what does 'retire' mean?" can't be concluded without arguing "She did retire" "But not as a Congresswoman so she didn't in fact retire retire", or "No Cabinet official quit" "By a technical definition this obscure guy is Cabinet level", then why expect any policy-maker to give you the time of day?

What are the questions that entrants got wrong the most?

As Skerry said above, 2022 didn't seem as weird as, say, 2020.

Personally, I didn't think Russia would invade Ukraine (figuring it would do something more debatable like 2014), but I was wrong and the U.S. government (as of 12/31/21) was right, so I wouldn't rank that as too weird, more just me being wrong.

One problem with forecasting contests is that the the really weird events of importance don't have any questions about them because nobody saw them coming. For instance, I doubt if anybody at the end of 2014 asked if Angela Merkel would let a million Muslims in in 2015. Of course, she did, and that wound up making more likely more weird events in 2016 like Brexit and Trump.

edited Jan 24You should form a focus panel of the top 10 forecasters for 2022 and have them make joint predictions for 2023, and out to 2028. Dissenting opinions would be noted.

I could help you do it.

Mean ~39.3

Standard Deviation ~6.6

M + 3SD ~59.0

M - 3SD ~19.6

1. Of course it is a game (so blue ribbon for lowest score or best coin flipper of heads). But if we are serious unless a score was below 19.6 why is it remarkable? (IOW, why should we confuse systemic cause variation from special cause variation.)

2. Handling outliers - there are some scores where the participants did so badly as to be well beyond 3 standard deviations. For example, 106.36!, 92.65!, 74.47, 71.64, 70.52, 63.97, 63.8, 59.93. That is what you should be investigating!

If you remove these (really should have a reason for removing them), then Mean 38.75204 UCL 51.8119332986286 LCL 25.6921467013714. This will still leave no particularly noteworthy 'best score' and still a few outside on the tail. In fact, there is quite a tail as is obvious from looking at the histogram. Why is that? Here is a plausible theory - some people weren't really "forecasting" instead they were trying to "win" by making some contrarian guesses that would make them standout from the crowd.

So I did not do well last year; somewhere around the 20th percentile.

My failed predictions were highlighted by "while Russia is probably going to do something militarily in Ukraine, it certainly isn't going to do (the exact invasion we got)". After I adjusted for "Putin has gone mad and is going to pursue losing strategies consistently" in March, my predictions on the topic have gotten better.

At a per-question level, I lost the most points on "will any state legalize a psychedelic in 2020"; I said 20% (average 75%, and it did happen), and still think that was a defensible guess even though it did happen in Colorado.

Unfortunately the Google spreadsheet is too large and cumbersome for me to find my other specific predictions.

My university is inviting us to attend a 2-day "foresight fundamentals workshop" offered by the Institute for the Future (https://www.iftf.org/). I never see this organization mentioned in discussions of forecasting. Does anyone know anything about this group? Would attending their trainings be worthwhile?

Can we have linked footnotes, please?

> A person who estimates a 99.99999% chance of a cease-fire in Ukraine next year is clearly more wrong than someone who says a 41% chance.

Technically, if there is a cease-fire in Ukraine next year, the person who gives a probability of 99.99999% is *less* wrong than someone who gives a probability of 41%. At least, in terms of probability as a thing that is scored with reference to reality.

Some epistemologists think there is an objective notion of "evidence" that makes some probabilities be a "correct" report of the evidence. But if there is such a thing, you can't use calibration or scoring rules to measure it (at least not directly).

I don't believe in an idea of an objectively correct report of evidence. Instead, I think the way we do this work is by asking whether a person's *method* of forming probabilities does well in terms of score (as match with reality) not just in the actual world, but in nearby possibilities. I think that reliably getting relatively accurate is the only evidence-type thing that we can have.

edited Jan 24> Actually, if you analyze raw scores, liberals did outperform conservatives, and old people did outperform young people. [...] some people did extremely badly, so their raw scores could be extreme outliers

This seems to imply that conservatives/young'uns have a greater number of individuals who are confidently *very wrong* (countered by a segment that are slightly more correct than average).

So in other words, the takeaway is that the high-temperature right wing influencer takes that look like dumb predictions probably *are* dumb, but reasonable conservatives that don't base their predictions on Ben Shapiro are likely to be grounded in reality.

Well, either that or liberals are just boring centrists as usual.

Edit: Zvi's twitter list[1] gave me an example of the exact weird bullshit predictions I was talking about: https://twitter.com/RichardHanania/status/1617765690693521408

[1] https://twitter.com/i/lists/83102521

Stop ruining my intricate user interaction theories with your lived experience.

It's interesting to see Ryan explicitly calling out that he sought to maximize the probability of winning the contest, not to minimize his expected log loss. In case anyone is interested, colleagues and I have a paper on forecasting contest design. We show that any contest where the winner is chosen deterministically will suffer from a similar problem (truthfully reporting probabilities might not be an optimal strategy). If you're willing to choose a winner randomly (typically: non-uniform), you can get around this problem, at the cost of selecting a bad forecaster as your winner with some probability. Given independence of event outcomes, this probability gets smaller and smaller the more events are in the contest.

https://pubsonline.informs.org/doi/10.1287/mnsc.2022.4410

We'd be interested to hear anyone's thoughts.

PPV should to go to school and work in finance or something

Given that superforecasters seem to be fairly accurate in their predictions, what do they predict regarding AGI risk?

Can I check: your footnote refers to group people into <150 and >150, but which group are you putting people who record an IQ of exactly 150? It makes a surprisingly big difference, because the first round responses have 29 people claiming an IQ>150 and 8 people claiming an IQ of exactly 150. Incidentally, one person claims an IQ of 212, which I think must be either a typo or a lizard: I've removed it by hand.

So, I looked at my scores (in the excel file) again...overall, I wasn't that different from average, but where I really did better was in election predictions , particularly for the US midterms. Seems like most people were much more bullish on the GOP to win both the Senate and House, which was reflected in the prediction markets...Idk why I was more cautious here, but I guess it might have to do with me taking into account the education realignment benefiting the Democrats more than the GOP (especially at midterms)?

User was banned for this comment. Showedited Jan 29This reminded me of amaybe a too popularised classic, The Art of War, written roughly 5th century BC.

Just compare a quote from the blog: ".. people who can do lots of research beat people who do less research."

with a quote from a text a written a couple of millenia before:

"The general who wins a battle makes many calculations the battle is fought. The general who loses a battle makes but few calculations beforehand. It is by attention to this point that I can foresee who is likely to win or lose."

And just for the record: people who are near to Russia might be more likely to predict what happens here. I spotted an anomaly in the amount of updates in the domains related to Kremlin propaganda during the summer of 2021. Only when it started peaking I mentioned it publicly.

https://twitter.com/riiajarvenpaa/status/1484591975642902538?t=xX-ugJjgQzlahODnb_7i0g&s=19

There needs to be some justification as to why logistic loss is the correct loss function to use here. Log loss has properties that make it nice for training ML classifiers, but those properties make it really weird in other contexts. For example:

* predicting 1% for something that happens 5% of the time has the same expected loss as predicting 80% for something that happens 99% of the time. Being 4% off is the same as being 19% off!

* predicting 90% for something that happens 80% of the time is 50% more lossy on average than predicting 80% for something that happens 90% of the time. If you're going to be wrong, it pays to be wrong on the less extreme side of the equation.

There is also a symmetric cost assumption in log loss which likely jars with human intuition. The cost of false positives is rarely the same as the cost of false negatives for any real prediction. Nor for that matter is the value of true positives and true negatives likely to be the same.

In doing these kind of surveys, I think people also tend to assume bounded loss per question. I'm not sure that many people realize that getting one question wrong with a prediction of 1% is 6 times worse than answering 50%. This is likely why averaging tends to improve loss: it will tend to reduce the effect of outlier incorrect predictions that dominate the loss total.

I'm curious to see what tools and code you all use for belief aggregation. I already entered so it's too late for me to profit from this, but I want to understand this better for some of my other work project like bit.ly/eaunjournal.

DM me if interested and I can share my code and data on this on a 'share for share' basis.