Astral Codex Ten

My Bookshelf Runneth Over

I think your second poll question's caveat that you were off "by a non-trivial amount" may play in here. If I was really confident in the distance from Paris to Moscow, or the weight of a cow, my second guess would be pretty close to the first. But the way the question was phrased, most people would feel compelled to change it up for their second one, even if they were very confident the first time.

Expand full comment

Reply (7)

will-e

This feels a bit like a human "let's think step by step" hack. Also, seems like some part of this benefit is obtained from common advice to "sleep on a important decision" and not make super important decisions impulsively.

Expand full comment

Jack

I’m mad because I was actually super happy with how close my first guess was - but I didn’t read the question right and guessed in miles, not km. My second guess was in the wrong direction, anyways, so i mostly just got lucky.

Expand full comment

Max Goodbird

Superb Owl

I'm out of the loop: OP == "overpowered"?

Expand full comment

In theory if there is no systematic bias the error vs crowd size graph should be an inverse square root, not the inverse logarithm you fit to the curve. This follows from the central limit theorem if we have a couple assumptions about individual errors (ie finite moments).

This actually makes the wisdom of crowds much more impressive as the inverse square root tends to zero much more quickly.

Expand full comment

John Roxton

Pedunculate

I think the poll's instruction to assume that your first answer was wrong by some 'non trivial amount' is important. It's effectively simulating the addition of new data and telling you to update accordingly. Whether the update is positive will depend on the quality of the new data, which in turn depends on the quality of the first answer!

ie. If my first answer was actually pretty close to reality (mine was; I forget the numbers and the question now but I remember checking after I finished the survey and seeing that I was within 100km of reality), a 'non trivial' update is pretty likely to make your second answer worse, not better. That's quite different to simply 'chuck your guess out and try again'. It also suggests that ACX poll-takers may be relatively good at geography (compared to... pollees whose first guess were more than what they think of as a trivial amount wrong? I don't know what the baseline is here).

Without reading through all the links above it's not clear whether the internal crowds referenced were subject to the same 'non trivial error' second data point. In the casino presumably there was some feedback because they didn't win, but I don't know how much feedback. I'm about to go to bed so I will leave that question to the wisdom of the ACX crowd and check back in the morning.

Expand full comment

Max Goodbird

Superb Owl

Feb 6, 2023·edited Feb 6, 2023

You gave a handful of examples where we could hypothetically benefit from the wisdom of crowds. But in each case, we *already* leverage the wisdom of crowds, albeit in an informal way.

E.g. my decision of academia vs industry is based not just on a vague personal feeling, but also aggregating the opinions of my friends and mentors, weighted by how much I trust them. True, the result is still a vague feeling, but somewhere under the hood that feeling is being driven by a weighted average of sorts.

I'm not sure there'd be much utility in formalizing and quantifying that--we'd probably only screw it up in the process (as you point out).

Expand full comment

Grum

I use wisdom of the crowds when I cut wood and I don't have my square; If I need a perpendicular line across the width of a piece, I'll just measure a constant from the nearest edge and draw a dozen or so markings along at that constant. They won't all line up (because I can't measure at a perfect right angle) but I just draw a line through the middle of them and more often than not it's square enough, because I'm off evenly either side of 90°.

Expand full comment

With your last point, an important part of this is whether "wisdom of crowds" is a spooky phenomenon that comes from averaging numeric responses, or whether it's an outcome of individuals mostly having uncorrelated erroneous ideas and correlated correct ideas (so that the mistakes get washed out in the averages).

If it's the second, you'd expect that all sorts of informal and non-quantitative ways of aggregating beliefs should also work. If you want to know whether to go to academia or industry, you ask 10 friends for advice and notice if lots of them are all saying the same thing (both in terms of overall recommendation or in terms of making the same points). If you want to build a forecasting model, you can hire 10 smart analysts to work on it together.

Of course, the details matter--if you have people make a decision together, maybe you end up with groupthink because one person dominates the discussion, pulls everyone else to their point of view, and then becomes overconfident about their ideas because they're being echoed by a bunch of other people. If the "consensus information" and "individual errors" in people's thinking are fairly legible, on the other hand, you might do a lot better with discussion and consensus than with averages because people can actually identify and discard their erroneous assumptions by talking to other people.

Expand full comment

Maybe later

Feb 6, 2023·edited Feb 6, 2023

What happens if you compare people's second guesses against their first? I.e., is the model predicting “thinking longer causes better guesses” excluded by the data?

My intuition is that wisdom of the crowd of one would predict that the second guess shouldn't be consistently better.

Expand full comment

gwern

Gwern.net Newsletter

The systematic error might be better known as Jaynes's "emperor of china fallacy".

One question I have is whether language models (and NNs in general) can be used to generate very large 'crowds'. They are much better at flexibly roleplaying than we are, can be randomized much more easily, have been shown to be surprisingly good at replicating economics & survey questions in human-like fashions, and this sort of 'inner crowd' is already how several inner-monologue approaches work, particularly the majority-voting (https://gwern.net/doc/ai/nn/transformer/gpt/inner-monologue/index#wang-et-al-2022-section “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, Wang et al 2022).

Expand full comment

Reply (4)

General

I use the single-player mode a lot when I'm guessing what something will cost - and I use it on my wife too. I start with two numbers, one obviously too low and one obviously too high. I then ask:

Would it cost more than [low}?

Would it cost less than [high]?

Would it cost more than [low+$10]?

Would it cost less than [high-$10]?

. . . . and so on. You know you're getting close when the hesitance becomes more thoughtful.

I'm sure I'm not the only one who does this, but I believe that many of us do something similar in a less deliberate or structured way. If you've lived in in Europe, you probably have a good feel for the scale of the place and of one country relative to the next. You may even have travelled from Paris to Moscow. If you live in North America, you may zoom out and rotate a globe of the Earth in your mind's eye until you reach Europe, and then do some kind of scaling. Estimating by either method will almost certainly give a better result than a WAG most of the time. So your "very wrong" answers weren't necessarily from lizardmen, but were just WAGs rather than thoughtful estimates.

Expand full comment

Reply (4)

Cups and Mugs

Feb 6, 2023·edited Feb 6, 2023

Post gets only a 7/10 enjoyment factor, I still don't know how far apart Paris and Moscow are in surface kilometres and am now forced to have to go look it up. Upon reflection that my personal enjoyment might have been wrong, I've revised my estimate to 5/10 and have now averaged this out to 6/10...or was it ....the square root of 5*7 or 35^(1/2) for an enjoyment of 5.92/10? I don't even know anymore!

Expand full comment

Vicki Williams

Impact Frontiers

I took the instruction to assume that I was off by a significant amount seriously. I decided i thought i was more likely to be greatly underestimating than over estimating and so took my first estimate and x10. In other words, i really didn’t re-estimate from scratch at all. If this analysis was your intention all along, perhaps explaining your intentions would have gotten people to rethink it in a more straight forward way.

Expand full comment

Journal of working-class medici…

Gregory Schmouse

This is both a great example for and a horrible case of the "wisdom of crowds" fallacy in forecasting - the problem isn't that your guessing at something known to a large part of the population approximately and so a larger sample more reliably gives you a median that is close to the ideal median of the entire population, which will be somewhere in the vicinity of the real thing because there is some decent penetration of the real value into the populace.

In forecasting you're guessing at something that isn't known to a large amount of the population, but the population and ergo your sample will have some basic superstitions on the issue that come mostly from mass media and social media and so even when you get a good measurement of the median, the prediction is still crap because you polled yourself an accurate representation of the superstition and not the real thing.

Say you want to know when Putin will end the Ukraine war - only Putin and a few select individuals know when that will be - if at all and this isn't made up on the go. But everybody will have some wild guesstimate, since newsperson A or blogger B or socialite Z (pun intended) posted some random ass-pull on twitter not necessarily claiming but certainly implying to know when it will happen. This is the result you're gonna get in your poll.

Wisdom of crowds is useless as forecasting and only works when the superstition has some bearing on the issue at hand, i.e. the policy itself is influential on public opinion or there is a strong feedback loop which ensures conformity of what's happening with the emotional state of "the masses". That, mostly, doesn't appear to be the case.

Expand full comment

Rafael Cosman

This is something that I've been thinking about in the context of LLMs. Ask an LLM a question once, and you are sampling its distribution. Ask it the question 10 times, consider the mean and variance, and you have a much better sense of the LLM's actual state of knowledge.

Here is an LMTK script I wrote in Jan which demonstrates this in the context of math problems: https://github.com/veered/lmtk/blob/main/examples/scripts/math_problem.md

Expand full comment

Neo

I guess Walt Whitman was on to something when he wrote "I contain multitudes"!

Expand full comment

Pycea

Is the data from the study saying that the average guess was many times larger that the actual answer? It seems that that might part of the reason why you got different error measurements. Guessing geographical distances has a limit on upper bounds in a way that guessing a number of objects doesn't.

Expand full comment

BFL

Doesn't Caplan's Myth of the Rational Voter deal with how the wisdom of the crowds only works when people aren't systematically biased on the subject in question?

Expand full comment

myst_05

myst_05’s Newsletter

For those who were (like me) confused by what "geometric_mean[absolute_value(geometric_mean<$ANSWERX, $ANSWERY> - 2487)]" is supposed to mean, here's the ChatGPT explanation which makes sense:

This expression calculates the geometric mean of the absolute value of the difference between the geometric mean of two values ($ANSWERX, $ANSWERY) and 2487.

The geometric mean of two values is calculated by multiplying the two values and taking the square root of the result. So the expression "geometric_mean<$ANSWERX, $ANSWERY>" calculates the geometric mean of the two values.

The difference between this geometric mean and 2487 is then taken, and the absolute value of this difference is calculated, ensuring that the result is always positive.

Finally, the geometric mean of this absolute value is calculated, which gives a single value as the final result.

Expand full comment

Bugmaster

Mike's Top-Secret Research Diary

> So is the percent chance that your country would win. If you could cut your error rate by 2/3 by using wisdom of crowds techniques with a crowd of ten, isn’t that really valuable?

Aren't people doing that all the time ? Governmental organizations have committees; large corporations have teams; some of them even hire smaller companies as contractors specifically to answer these types of questions.

Expand full comment

michaelsklar

Feb 6, 2023·edited Feb 6, 2023

re: "What about larger crowds? I found that the crowd of all respondents, ie a 6924 person crowd, got higher error than the 100 person crowd (243 km). This doesn’t seem right to me..."

I'm also suspicious, and would predict this observation will reverse with enough resamples of the 100-person subsets. 6,000 instead of 60 would probably do it? Or maybe some outlier was simply missed.

There is likely a simple proof based on sum-of-squares decompositions that would show the average over all subsets of the 100-person group has higher error

Expand full comment

c1ue

This looks like much ado about nothing.

Part of the crowd guesses high, the other part guesses low.

Nobody is right, but it averages out closer to right.

'Nuff said.

Expand full comment

Dweomite

I think the spookiness of "inner crowds" improving your answers mostly comes from an intuition that whatever you were doing originally can be approximated as being an ideal reasoner. An ideal reasoner shouldn't be able to improve their answers by making multiple guesses.

But humans are often pretty far from being ideal reasoners. If this works, I see that more as an indictment of how bad humans are at numerical estimates, rather than a spooky oracle.

(Though this doesn't prevent it from being useful...)

Expand full comment

Njnnja

One hypothetical mechanism for why it works is that it forces the forecaster to make an estimate of their uncertainty and take a second draw from the implied distribution. It’s similar to when someone wants to “sleep on it”, even though they aren’t going to get any new information. They are just going to think about the worst case (and maybe best case) and get a second draw after thinking more about the distribution of results

Expand full comment

Olivier Faure

> I think the answer is something like: you can only use wisdom of crowds on numerical estimates, very few people (currently) make those decisions numerically, and the cost of making those decisions numerically is higher (for most people) than the benefit of using wisdom of crowds on them.

Actually, I think you're wrong on this one: wisdom of the crowds really is OP and we're severely under-using it.

An example that immediately comes to mind is peer programming: by having two people work on the same code simultaneously, you can immensely increase their productivity. Every time I've tried it, I had positive results, and yet most companies are *very* hostile to the idea.

The part about getting diminishing returns as you add more people is interesting too. I wonder if you could drastically reduce design-by-committee problems in an organizations by making sure all committees involved have at most three or four people in them.

Expand full comment

Josh Levine

Maybe a non-spooky explanation is that when we do not know the exact answer to a question, we instead have a distribution of possible answers. When you force someone to collapse that wave function down to a single scaler measurement, they will randomly pick one possible answer as per the probably distribution. But you have lost all the rest of the information contained in the distribution. When you ask again, you make a 2nd sampling from the distribution, which adds precision. Note that if you keep asking you will get more points, but inevitably you still loose information.

Example, I might know that there is either $100 or $200 in my bank account becuase I don't know if a check cleared yet. If you force me to pick a single value I'll pick either at random. Ask me twice and 50% chance I'll pick the other. By your way of measuring it looks like I don't know much, which in fact I have complete information less a single bit.

Similar argument works for crowds as well.

Expand full comment

demost_

https://astralcodexten.substack.com/p/acx-survey-results-2022/comment/12089011

I also made an analysis of the inner crowd on the same survey question, using different statistics.

tl;dr: I got similar results as Scott: the inner crowd helps a bit, but not too much. Strangely, the second estimate was much worse than the first. Some speculated that this was due to Scott's phrasing "off by a non-trivial amount" in the second question, but the same effect (worse second estimate) was also in the literature, where probably they didn't have such a phrasing. (But my source was much less sophisticated than Scott's VD and VDA paper.)

Highlight numbers, GM stands for "geometric mean":

- The first estimate was off by a factor 1.815. (This means that the GM of all those factors was 1.815)

- The second estimate was off by a factor 1.901.

- The GM of the two estimates was off by a factor 1.791.

- How often was the first estimate better than the second: in 53.3% of the cases.

- How often was the GM better than the first estimate: in 52.8% of the cases.

- How often was the GM better than the second estimate: in 60.0% of the cases.

Expand full comment

Joel Long

When asked to guess a number, my mental process is to first find a range, then pick (somewhat arbitrarily, honestly) within that range. I suspect that repeatedly sampling the same person is just a rough, inefficient way to find their range estimate.

I suggest trying a similar question but asking for the 70th percentile upper and lower bound on the distance (with another question asking if the person knows what that means as a filter).

Expand full comment

Blackthorne

What if you tried bootstrapping the larger groups of individuals (i.e. sample with replacement)? I’m on vacation or I’d do it myself but I’d be curious on if that improves the error

Expand full comment

I think you raise a really interesting question.

Expand full comment

sclmlw

This suggested to me that the 'internal crowd' was almost entirely worthless. "P < 0.001!" Yes, but magnitude <2% improvement? I have low confidence in a result like this one (even with a great p-value!) that purports to demonstrate a method for 1.5% improvement in guessing accuracy.

Expand full comment

FractalCycle

IIRC the reasoning for *why* the (outer) wisdom of crowds works, is that the crowd contains a few experts who will be biased in favor of the correct answer... while everyone else errs randomly above or below the correct answer. So there was no inner wisdom of crowds in this version.

Expand full comment

Thecommexokid

“Estimate the number of balls in this jar” and “Estimate the distance between Paris and Moscow” seem like qualitatively very different tasks to me.

Estimating the balls in the jar seems like a visual reasoning task, whereas estimating the distance seems like a preexisting knowledge task.

I didn’t know where Moscow is within Russia. I didn’t know how many countries were between France and Russia. I didn’t remember whether a kilometer was bigger or smaller than a mile. And I didn’t know any reference large distances to use for comparison except that the radius of the earth is 4000 mi. Therefore there were so many inferential steps in my distance guesses wherein to introduce additional error; as compared to my guess about balls in a jar, which seems to just be testing my skill at 1 thing.

Expand full comment

Paul Goodman

I remember unfortunately ruining my results for this by immediately looking up the answer after putting in my guess for the first question (since I didn't know there was going to be a second).

Expand full comment