I think your second poll question's caveat that you were off "by a non-trivial amount" may play in here. If I was really confident in the distance from Paris to Moscow, or the weight of a cow, my second guess would be pretty close to the first. But the way the question was phrased, most people would feel compelled to change it up for their second one, even if they were very confident the first time.

This feels a bit like a human "let's think step by step" hack. Also, seems like some part of this benefit is obtained from common advice to "sleep on a important decision" and not make super important decisions impulsively.

I’m mad because I was actually super happy with how close my first guess was - but I didn’t read the question right and guessed in miles, not km. My second guess was in the wrong direction, anyways, so i mostly just got lucky.

In theory if there is no systematic bias the error vs crowd size graph should be an inverse square root, not the inverse logarithm you fit to the curve. This follows from the central limit theorem if we have a couple assumptions about individual errors (ie finite moments).

This actually makes the wisdom of crowds much more impressive as the inverse square root tends to zero much more quickly.

I think the poll's instruction to assume that your first answer was wrong by some 'non trivial amount' is important. It's effectively simulating the addition of new data and telling you to update accordingly. Whether the update is positive will depend on the quality of the new data, which in turn depends on the quality of the first answer!

ie. If my first answer was actually pretty close to reality (mine was; I forget the numbers and the question now but I remember checking after I finished the survey and seeing that I was within 100km of reality), a 'non trivial' update is pretty likely to make your second answer worse, not better. That's quite different to simply 'chuck your guess out and try again'. It also suggests that ACX poll-takers may be relatively good at geography (compared to... pollees whose first guess were more than what they think of as a trivial amount wrong? I don't know what the baseline is here).

Without reading through all the links above it's not clear whether the internal crowds referenced were subject to the same 'non trivial error' second data point. In the casino presumably there was some feedback because they didn't win, but I don't know how much feedback. I'm about to go to bed so I will leave that question to the wisdom of the ACX crowd and check back in the morning.

You gave a handful of examples where we could hypothetically benefit from the wisdom of crowds. But in each case, we *already* leverage the wisdom of crowds, albeit in an informal way.

E.g. my decision of academia vs industry is based not just on a vague personal feeling, but also aggregating the opinions of my friends and mentors, weighted by how much I trust them. True, the result is still a vague feeling, but somewhere under the hood that feeling is being driven by a weighted average of sorts.

I'm not sure there'd be much utility in formalizing and quantifying that--we'd probably only screw it up in the process (as you point out).

I use wisdom of the crowds when I cut wood and I don't have my square; If I need a perpendicular line across the width of a piece, I'll just measure a constant from the nearest edge and draw a dozen or so markings along at that constant. They won't all line up (because I can't measure at a perfect right angle) but I just draw a line through the middle of them and more often than not it's square enough, because I'm off evenly either side of 90°.

With your last point, an important part of this is whether "wisdom of crowds" is a spooky phenomenon that comes from averaging numeric responses, or whether it's an outcome of individuals mostly having uncorrelated erroneous ideas and correlated correct ideas (so that the mistakes get washed out in the averages).

If it's the second, you'd expect that all sorts of informal and non-quantitative ways of aggregating beliefs should also work. If you want to know whether to go to academia or industry, you ask 10 friends for advice and notice if lots of them are all saying the same thing (both in terms of overall recommendation or in terms of making the same points). If you want to build a forecasting model, you can hire 10 smart analysts to work on it together.

Of course, the details matter--if you have people make a decision together, maybe you end up with groupthink because one person dominates the discussion, pulls everyone else to their point of view, and then becomes overconfident about their ideas because they're being echoed by a bunch of other people. If the "consensus information" and "individual errors" in people's thinking are fairly legible, on the other hand, you might do a lot better with discussion and consensus than with averages because people can actually identify and discard their erroneous assumptions by talking to other people.

What happens if you compare people's second guesses against their first? I.e., is the model predicting “thinking longer causes better guesses” excluded by the data?

My intuition is that wisdom of the crowd of one would predict that the second guess shouldn't be consistently better.

The systematic error might be better known as Jaynes's "emperor of china fallacy".

One question I have is whether language models (and NNs in general) can be used to generate very large 'crowds'. They are much better at flexibly roleplaying than we are, can be randomized much more easily, have been shown to be surprisingly good at replicating economics & survey questions in human-like fashions, and this sort of 'inner crowd' is already how several inner-monologue approaches work, particularly the majority-voting (https://gwern.net/doc/ai/nn/transformer/gpt/inner-monologue/index#wang-et-al-2022-section “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, Wang et al 2022).

I use the single-player mode a lot when I'm guessing what something will cost - and I use it on my wife too. I start with two numbers, one obviously too low and one obviously too high. I then ask:

Would it cost more than [low}?

Would it cost less than [high]?

Would it cost more than [low+$10]?

Would it cost less than [high-$10]?

. . . . and so on. You know you're getting close when the hesitance becomes more thoughtful.

I'm sure I'm not the only one who does this, but I believe that many of us do something similar in a less deliberate or structured way. If you've lived in in Europe, you probably have a good feel for the scale of the place and of one country relative to the next. You may even have travelled from Paris to Moscow. If you live in North America, you may zoom out and rotate a globe of the Earth in your mind's eye until you reach Europe, and then do some kind of scaling. Estimating by either method will almost certainly give a better result than a WAG most of the time. So your "very wrong" answers weren't necessarily from lizardmen, but were just WAGs rather than thoughtful estimates.

Post gets only a 7/10 enjoyment factor, I still don't know how far apart Paris and Moscow are in surface kilometres and am now forced to have to go look it up. Upon reflection that my personal enjoyment might have been wrong, I've revised my estimate to 5/10 and have now averaged this out to 6/10...or was it ....the square root of 5*7 or 35^(1/2) for an enjoyment of 5.92/10? I don't even know anymore!

I took the instruction to assume that I was off by a significant amount seriously. I decided i thought i was more likely to be greatly underestimating than over estimating and so took my first estimate and x10. In other words, i really didn’t re-estimate from scratch at all. If this analysis was your intention all along, perhaps explaining your intentions would have gotten people to rethink it in a more straight forward way.

This is both a great example for and a horrible case of the "wisdom of crowds" fallacy in forecasting - the problem isn't that your guessing at something known to a large part of the population approximately and so a larger sample more reliably gives you a median that is close to the ideal median of the entire population, which will be somewhere in the vicinity of the real thing because there is some decent penetration of the real value into the populace.

In forecasting you're guessing at something that isn't known to a large amount of the population, but the population and ergo your sample will have some basic superstitions on the issue that come mostly from mass media and social media and so even when you get a good measurement of the median, the prediction is still crap because you polled yourself an accurate representation of the superstition and not the real thing.

Say you want to know when Putin will end the Ukraine war - only Putin and a few select individuals know when that will be - if at all and this isn't made up on the go. But everybody will have some wild guesstimate, since newsperson A or blogger B or socialite Z (pun intended) posted some random ass-pull on twitter not necessarily claiming but certainly implying to know when it will happen. This is the result you're gonna get in your poll.

Wisdom of crowds is useless as forecasting and only works when the superstition has some bearing on the issue at hand, i.e. the policy itself is influential on public opinion or there is a strong feedback loop which ensures conformity of what's happening with the emotional state of "the masses". That, mostly, doesn't appear to be the case.

This is something that I've been thinking about in the context of LLMs. Ask an LLM a question once, and you are sampling its distribution. Ask it the question 10 times, consider the mean and variance, and you have a much better sense of the LLM's actual state of knowledge.

Is the data from the study saying that the average guess was many times larger that the actual answer? It seems that that might part of the reason why you got different error measurements. Guessing geographical distances has a limit on upper bounds in a way that guessing a number of objects doesn't.

Doesn't Caplan's Myth of the Rational Voter deal with how the wisdom of the crowds only works when people aren't systematically biased on the subject in question?

For those who were (like me) confused by what "geometric_mean[absolute_value(geometric_mean<$ANSWERX, $ANSWERY> - 2487)]" is supposed to mean, here's the ChatGPT explanation which makes sense:

This expression calculates the geometric mean of the absolute value of the difference between the geometric mean of two values ($ANSWERX, $ANSWERY) and 2487.

The geometric mean of two values is calculated by multiplying the two values and taking the square root of the result. So the expression "geometric_mean<$ANSWERX, $ANSWERY>" calculates the geometric mean of the two values.

The difference between this geometric mean and 2487 is then taken, and the absolute value of this difference is calculated, ensuring that the result is always positive.

Finally, the geometric mean of this absolute value is calculated, which gives a single value as the final result.

> So is the percent chance that your country would win. If you could cut your error rate by 2/3 by using wisdom of crowds techniques with a crowd of ten, isn’t that really valuable?

Aren't people doing that all the time ? Governmental organizations have committees; large corporations have teams; some of them even hire smaller companies as contractors specifically to answer these types of questions.

re: "What about larger crowds? I found that the crowd of all respondents, ie a 6924 person crowd, got higher error than the 100 person crowd (243 km). This doesn’t seem right to me..."

I'm also suspicious, and would predict this observation will reverse with enough resamples of the 100-person subsets. 6,000 instead of 60 would probably do it? Or maybe some outlier was simply missed.

There is likely a simple proof based on sum-of-squares decompositions that would show the average over all subsets of the 100-person group has higher error

I think the spookiness of "inner crowds" improving your answers mostly comes from an intuition that whatever you were doing originally can be approximated as being an ideal reasoner. An ideal reasoner shouldn't be able to improve their answers by making multiple guesses.

But humans are often pretty far from being ideal reasoners. If this works, I see that more as an indictment of how bad humans are at numerical estimates, rather than a spooky oracle.

(Though this doesn't prevent it from being useful...)

One hypothetical mechanism for why it works is that it forces the forecaster to make an estimate of their uncertainty and take a second draw from the implied distribution. It’s similar to when someone wants to “sleep on it”, even though they aren’t going to get any new information. They are just going to think about the worst case (and maybe best case) and get a second draw after thinking more about the distribution of results

> I think the answer is something like: you can only use wisdom of crowds on numerical estimates, very few people (currently) make those decisions numerically, and the cost of making those decisions numerically is higher (for most people) than the benefit of using wisdom of crowds on them.

Actually, I think you're wrong on this one: wisdom of the crowds really is OP and we're severely under-using it.

An example that immediately comes to mind is peer programming: by having two people work on the same code simultaneously, you can immensely increase their productivity. Every time I've tried it, I had positive results, and yet most companies are *very* hostile to the idea.

The part about getting diminishing returns as you add more people is interesting too. I wonder if you could drastically reduce design-by-committee problems in an organizations by making sure all committees involved have at most three or four people in them.

Maybe a non-spooky explanation is that when we do not know the exact answer to a question, we instead have a distribution of possible answers. When you force someone to collapse that wave function down to a single scaler measurement, they will randomly pick one possible answer as per the probably distribution. But you have lost all the rest of the information contained in the distribution. When you ask again, you make a 2nd sampling from the distribution, which adds precision. Note that if you keep asking you will get more points, but inevitably you still loose information.

Example, I might know that there is either $100 or $200 in my bank account becuase I don't know if a check cleared yet. If you force me to pick a single value I'll pick either at random. Ask me twice and 50% chance I'll pick the other. By your way of measuring it looks like I don't know much, which in fact I have complete information less a single bit.

tl;dr: I got similar results as Scott: the inner crowd helps a bit, but not too much. Strangely, the second estimate was much worse than the first. Some speculated that this was due to Scott's phrasing "off by a non-trivial amount" in the second question, but the same effect (worse second estimate) was also in the literature, where probably they didn't have such a phrasing. (But my source was much less sophisticated than Scott's VD and VDA paper.)

Highlight numbers, GM stands for "geometric mean":

- The first estimate was off by a factor 1.815. (This means that the GM of all those factors was 1.815)

- The second estimate was off by a factor 1.901.

- The GM of the two estimates was off by a factor 1.791.

- How often was the first estimate better than the second: in 53.3% of the cases.

- How often was the GM better than the first estimate: in 52.8% of the cases.

- How often was the GM better than the second estimate: in 60.0% of the cases.

When asked to guess a number, my mental process is to first find a range, then pick (somewhat arbitrarily, honestly) within that range. I suspect that repeatedly sampling the same person is just a rough, inefficient way to find their range estimate.

I suggest trying a similar question but asking for the 70th percentile upper and lower bound on the distance (with another question asking if the person knows what that means as a filter).

I can think of a couple of mundane reasons this is probably correct.

In one case, you *sorta* know the answer and thus can make a guess about how to improve your first guess. On the Moscow question, I knew Moscow was probably longer than it seemed and thus if my first answer was wrong, it was likely I had guessed too low rather than too high. My second guess was higher and closer to the real answer and thus my average was better.

In some other case you have 0 idea at all. I have 0 idea how far some specific exoplanet is from Earth. Thus I'm likely to make wild guesses that cover my bases. Uh, 5 light years? Uh, 1500 light years? Almost assuredly the average here will be better even if I have 0 idea.

What if you tried bootstrapping the larger groups of individuals (i.e. sample with replacement)? I’m on vacation or I’d do it myself but I’d be curious on if that improves the error

This suggested to me that the 'internal crowd' was almost entirely worthless. "P < 0.001!" Yes, but magnitude <2% improvement? I have low confidence in a result like this one (even with a great p-value!) that purports to demonstrate a method for 1.5% improvement in guessing accuracy.

IIRC the reasoning for *why* the (outer) wisdom of crowds works, is that the crowd contains a few experts who will be biased in favor of the correct answer... while everyone else errs randomly above or below the correct answer. So there was no inner wisdom of crowds in this version.

“Estimate the number of balls in this jar” and “Estimate the distance between Paris and Moscow” seem like qualitatively very different tasks to me.

Estimating the balls in the jar seems like a visual reasoning task, whereas estimating the distance seems like a preexisting knowledge task.

I didn’t know where Moscow is within Russia. I didn’t know how many countries were between France and Russia. I didn’t remember whether a kilometer was bigger or smaller than a mile. And I didn’t know any reference large distances to use for comparison except that the radius of the earth is 4000 mi. Therefore there were so many inferential steps in my distance guesses wherein to introduce additional error; as compared to my guess about balls in a jar, which seems to just be testing my skill at 1 thing.

I remember unfortunately ruining my results for this by immediately looking up the answer after putting in my guess for the first question (since I didn't know there was going to be a second).

Hi Scott; it's the inverse-square-root. The standard error of an estimate declines as a function of 1 / sqrt(n) for sample size n (because the variance declines with 1/n).

If the estimates are biased, the root-mean-square error is going to be sqrt(bias^2 + (variance / n)) for sample size n, i.e. the mean squared error will decline hyperbolically. This isn't something the study found; it's a mathematically-derived formula, which they then fit to the data to get estimates for bias^2 and variance. Because estimates taken from 1 person are going to be substantially biased, the error will never reach 0; it asymptotes out very quickly. The average of many people is going to be much less biased, such that the variance probably dominates.

I probably produced two of the very far outliers because of being very bad at geography and spatial reasoning generally. I think I put down a guess that was an order of magnitude wrong, and then, being told by the second question to answer as though my first was wrong, changed my answer by an order of magnitude in the wrong direction. I don't if this information is helpful to anybody; but some of us don't realize we're being lizardmen because we have no idea how to meaningfully connect the ideas "kilometer" "Paris" and "Moscow". 1,000 km seems as reasonable to me as 200,000 km.

I'd be curious to see if those with dissociative identity disorder (or those who self-identify as systems, since that's probably more common than an official diagnosis) are better than the rest of us at this internal wisdom of the crowds.

This hits on why I don’t see fast AI takeoff being a thing. GPT is wisdom of the crowds. A bunch of text is averaged together and gets you an answer that is directionally correct (as far as text completion goes) but is only going to asymptotically approach reality.

To “know” facts you need a different methodology, that is essentially brute force. How do you know the distance? You looked it up from a reputable source, which is reputable thanks to a reputation that took thousands to million of person hours to cultivate, and on top of that someone had to actually physically go and measure (or just wait until we launch satellites that account for general relativity into space and compute it from their data.)

Wisdom of the crowd works because it is actually very very hard to obtain real knowledge, but we think it is easy because we have a superficial experience of “knowing” many different things. Averaging a bunch of estimates allows more real knowledge to contribute.

All this gives me a low prior on AI takeoff even being a thing. We will burn out on modelling existing human knowledge and then begin the hard work of developing machines that can do the hard and painstaking work of actually gaining new knowledge. It will not be fast because knowing things is really a lot of work. Those 10^46 simulated humans will probably get bored and want to do something easier.

My initial impulse is to ask for control! What happens if you pick a random number in a given range to guess (say, for Paris to Moscow the range would be something like 50 to 50,000 km, and yes I know that no two points on earth's surface are separated by more than 20,000 km, but some of your readers might now know it), then take a random distribution on the log scale, then pick two random samples? Would the "wisdom of crowds" effect be random chance?

I've also found the "wisdom of the random duo" effect in my research (https://braff.co/advice/f/forecasting-masterclass-7-find-the-martha-to-your-snoop). I wonder if you or I could simulate the inner crowd by looking at forecasts on props that are highly correlated within the same contest? You have a bunch of Ukraine props where the average-across-3-props for a given forecaster may be a more accurate read on the whole battle than any one forecast?

I also have poor intuition about this problem. However, when I got to the second question about the distance from Paris to Moscow, essentially asking me if I wanted to change my first guess, Monte Hall came immediately to mind.

Is there a logical comparison between this and the Monte Hall problem? Did anyone else think this? Should I look up some old Marilyn vos Savant posts?

If people's first answer was generally closer than their second answer, then means that it'd probably be best to take a weighted average that puts more weight on the first answer than the second.

In games, like Codenames or Wavelength, people on my team independently come up with their guesses before we share and discuss them with each other.

In forecasting, I consider what range of forecasts I might plausibly make and average them. I also make multiple forecasts using different methods (e.g. using two different relevant reference classes) and then average them, to make use of information from independent sources. I also consult others' forecasts on the question when available to aggregate their views.

In general, when a group is collaboratively seeking the truth on a topic or trying to make a decision, I encourage giving everyone time to think of their own independent impression before having individuals share their view.

Just came here to say when I answered the distance question in the survey, I was SO off. I had no concept of the size of the earth so no idea what a reasonable distance would be. I can't remember now in which direction,but I was off by a whole order of magnitude. So yeah, probably one of the outliers. Just to put it out there that we're not all lizardmen, some of us just don't have a good model of these distances.

I find this really bizarre. I thought the basis for the wisdom of crowds was Condorcet's Jury Theorem: assume (Independence) that individual voters have independent probabilities of voting for the correct alternative. Also assume (Competence) that these probabilities exceed ½ for each voter. It follows that as the size of the group of voters increases, the probability of a correct majority increases and tends to one (infallibility) in the limit. Suppose the number of voters = 1. While the single voter could make multiple guesses, how would that not violate the independence condition?

Is it possible that since most of your readers are American, they had some idea in miles, and many just gave that same guess in km due to unfamiliarity with the conversion? The mean guess in km and changing the units to miles would be a lot closer to the true answer.

Thinking of the mechanism behind the "crowd of one" effect. At first I thought it's a variant of the Monty Hall effect - first guess under complete uncertainty, second guess somewhere else in the spectrum, with some uncertainty removed. But more likely it is, combining sources of incomplete information. People will have different hypotheses or heuristics in mind to make a guess. They will only use one heuristic for the first guess. They will use a different one for the second guess. So now there is more information present than with a single guess. Example, if a person is completely uncertain about Paris-Moscow, they first might use the heuristic of "Russia is huge", then the heuristic "but Europe is small". The average of both biases produces a better result

>>> What about in finance, where people often make numerical estimates (eg what a stock will be worth a year from now)? Maybe they have advanced models calculating that, and averaging their advanced models with worse models or people’s vague impressions would be worse than just trusting their most advanced model, in a way that’s not true of an individual trusting their first best guess?

In fact this is standard practice in finance and most other ML applications, see https://en.wikipedia.org/wiki/Ensemble_learning , and is known to be one of the few methods systematically resulting in better predictions (another is increasing the dataset size). Multiple different models are typically created using different sources of information, underlying architectures, training techniques, etc, which are then "averaged" to make the final predictions. The models are usually as advanced as possible (i.e. they are a crowd of experts), and the averaging is typically also learned (i.e. instead of choosing between arithmetic and geometric means, you would learn the actual ensambling function to better account for each of the model's biases, ideally making use of their self-reported uncertainty). I doubt there's any big financial trading firm that does not have this in place, including the presence of multiple uncomunicated teams working on various models for the same purpose, each of them without access to the other models or final ensamble.

I have heard 'Wisdom of the crowds' described very differently, when you get large groups of people a small number will have specialized knowledge of the question, and a larger number will have general knowledge. If the wildly ignorant are simply guessing then their errors will frequently (but not always) cancel each other out and what you are left with pushing the data are the experts. You aren't averaging a bunch of guesses, you are asking enough people to find someone who knows the answer and then averaging out all the bad guesses.

Perhaps quite tangential, but this has me thinking about how criticism and ratings can help us use “the wisdom of crowds” to predict things that aren’t objective in any real sense.

For example a movie’s quality is pretty arguably entirely subjective, and whether any one person will like a given film is hard to predict, but we all commonly use the wisdom of crowds to estimate a film’s quality and help us predict if it will be with our time or not.

Each person who rates a film is “guessing” the film’s objective quality, since no one person actually gets to claim that objective perspective. But if we add up enough subjective guesses, we can kind of approximate some kind of “objective” value.

I think there are probably a lot of ways we use a kind of vague sense of what “the wisdom of the crowd” Is about certain issues to help us make judgment calls.

> This looks like some specific elegant curve, but which one? A real statistician would be able to give a good answer to this question.

Under the simplest hypotheses, it should be the sum in quadrature (i.e., a ⊕ b = √(a² + b²)) of a s.c. "statistical uncertainty" proportional to 1/√n and a s.c. "systematic uncertainty" which stays constant.

Regarding the answers for the Paris - Moscow distance. I think it's hilarious how you were surprised at some very wrong answers and assumed the reason is lizardmen/trolls. You're really just underestimating how bad some people are regarding distances and geography. I tried hard to give a good estimate but ended up with what is essentially a random number that could have been the distance to the moon for all I know.

Note that the error can never go completely to zero for the infinite crowd. There should be a lower bound on persistent error set merely by the resolution of typical maps - plus an additional contribution from people's natural tendency to round large numbers. Sorry if this comes across as too pedantic, but I think generally these limits set by resolution are interesting and often neglected!

I think on non-numerical things, we already instinctively use wisdom of the crowds. You feel vaguely positive about academia *because* you’ve heard people say more good stuff than bad stuff about academia. Our brains are very good at subconsciously "averaging" status signals, perceived utils, etc, but not so good at averaging actual numbers, so it’s only once we start putting numbers on things that we have to remember to do the averaging step explicitly.

(This is Eric; I helped run the 2022 forecasting contest.)

I've thought a lot about this -- indeed, the first paper I wrote in grad school can be summarized as "the wisdom of crowds is a *mathematical* fact" (if you aggregate forecasters in a way that accords with how you score them). I'm planning to write a blog post about this, but let me briefly illustrate what's going on in this comment.

Suppose you put 100 candies in a jar and ask people to estimate how many candies there are. You're then going to score each person based on how far off they were, and compare two quantities: the average of everyone's scores, versus the score of the average of all the estimates (the latter is the wisdom of the crowd).

We're gonna score each participant based on the *square* of the distance to the right answer. (Why the square? Briefly, this choice incentivizes each participant to truthfully report how many candies they expect are in the jar.)

Let's say that the estimates are 90, 100, 110, 120, and 130: so, the participants disagree with each other but are also somewhat biased upward.

From first to last, the (squared) errors of the five participants are 100, 0, 100, 400, and 900, for an average of 300. By contrast, the average of all five estimates is 110, which is only off by 100.

In fact, it is *always* the case that the second number (error of the average) will be smaller than the first (average error), no matter which numbers I chose for my example. An intuition you could have is that the first number is equal to the second number, *plus noise*, where the "noise" is the variance in the participants' estimates. (Check it out: the (population) variance of {90, 100, 110, 120, 130} is 200, which is equal to 300 - 100!)

(Feel free to skip this aside, but: what's the math behind this? Briefly, let X be the random variable equal to the signed error of a randomly chosen expert -- so in our example, X would take on the values -10, 0, 10, 20, and 30 with equal probability. Then the average error is E[X^2], whereas the error of the average estimate is E[X]^2. The former quantity is larger, and the difference is E[X^2] - E[X]^2, which is the variance of X.)

The math here is sensitive to the fact that I chose squared error (and to the fact that I chose to aggregate estimates by averaging them). If -- as Scott did -- you take the *absolute value* of error instead of the squared error, it's no longer *mathematically* true. However, I would bet that it's empirically true a large fraction of the time. That's because if some participants underestimate the quantity and others overestimate it, they both count positively toward the average error, but the *cancel each other out* when you look at the average.

As for whether your error will go to zero as the crowd size goes to infinity: no. This is only true under a really strong assumption, which is that the crowd is *unbiased*. So for example, if in my example you have a huge crowd but they're systematically biased so their estimates are centered at 110 instead of 100, then in the limit of an infinite crowd you're still going to be off (your average will be 110).

And -- last point -- regarding making multiple estimates on your own and averaging them: it's definitely an interesting frame, but I'd say that you've reinvented the art of *thinking longer about the problem* :)

Here's what I mean: suppose you're weighing going into grad school versus getting a tech job. You think for a while, and you realize: "I'll be 9/10 happy with my pay at the tech job, but only 5/10 happy with my grad school pay." Then you think longer and realize: "I'll be 8/10 happy with the sorts of problems I'll be thinking about in grad school, but only 6/10 happy with the sorts of problems I'll be thinking about in tech." Then you think longer and realize: "the weather at the grad school I'm considering is 3/10, while the weather in the Bay Area tech job is 9/10". And so on. If you wanted to, you could think of each of these things (pay; intellectual interestingness; weather) as separate estimates. And then you can be like "wow, my decision will be more accurate if I average all my estimates together than if I make my decision based on a single factor!" -- I think that's basically all that's going on with the "wisdom of the crowds" here.

Stupid question: hasn't this topic been done to death in statistics? I'm not an expert, but from what I remember, yes, you can combine lots of inaccurate predictors into a more accurate predictor - provided the individual predictors are unbiased, i.e., they don't systematically over- or underestimate.

My gut feeling is that this is the hard part - finding a dozen people knowledgeable enough to give a meaningful estimate is doable. Finding a dozen people who are not all influenced by the same sources of information to be overly optimistic or pessimistic is the hard part, and if you don't, you converge with great confidence on an inaccurate answer.

Edit: should have read Unexpected Values' answer above before I wrote this...

There is a UK quiz show called "Who Wants to be a Millionaire" in which an individual is selected from a dozen or so competitors by their correctnesss and speed in answering a preliminary question, such as "Put such and such into alphabetical order" and is then asked a series of questions by the host. Each question has four possible answers, shown to the contestant, and one of these is the correct answer.

The contestant starts with three so-called "lifelines", which they can use once each for any question whose answer they are unsure of or don't know: "50 50" (which halves the number of alternative answers), "Phone a friend", and "Ask the Audience".

The "Ask the Audience" lifeline is the most relevant to this discussion. When it is invoked, each audience member selects on a key pad the answer they know or guess is correct, and the contestant is then shown a bar chart of the percentage of selections of each remaining possible answer.

For a commonly known answer, to a question relating to sport or soap operas for example, the "Ask the audience" lifeline is usually fairly conclusive, and one of their choices obviously predominates and turns out to be the correct answer. But sometimes a majority, occasionally spectacularly so, chooses the wrong answer!

It is interesting to speculate why so many people would choose the same wrong answer, presumably guessed. From my observation of several examples, the main reason for this is that they are biased toward a name they have heard of, or association familiar to them, among others they have not.

I have also observed that another source of audience bias obviously occurs when a contestant is unsure of the answer and the host asks them, before they commit to a choice, which answer they think is correct. It seems very foolish for a contestant to divulge a guess in that situation and then go on to use the "Ask the Audience" lifeline, as they will have influenced equally unsure audience members in advance, but many do!

I just asked my partner: First answer was 2000 km. Second answer was 1500 km.

In that case the error got bigger. Could there be a failure mode for this technique, that while on average it may make you more correct, there are doom cases where it makes you catastrophically more wrong?

>Since we only have one datapoint for the n = 6924 crowd size, it’s not significant and we should throw it out.

I have no background in statistics, but that seems wrong to me. Is that the only rationale for throwing it out? On average, it should be at least a good a crowd like any other, and according to the theory (larger crowd = better), it should be the most representative of the average participant's wisdom.

Also, how did you come up with the crowd size of 100 for doing the analysis? If you tried different crowd sizes, were the results different on average? Did you try a sample of random crowd sizes?

This is similar to something I sometimes have to do for my job. We need to get estimates from experts on quantities of interest. There are various techniques you can employ to get them to give unbiased answers. The simplest and most useful one is after you ask for their best guess you ask "is it more likely that the true answer is above or below your guess?"

It's a technique I employ in my own decision making too, and from the comments it seems that lots of other people do.

> As mentioned above, the average respondent was off by 918 km on their first guess. They were off by 967 km on their second guess.

Was the second guess (on average) higher than the first? Estimating a distance has this asymmetry where there are a finite number of ways to undershoot, but no limit to how far you can overshoot.

If so, maybe the you-are-the-crowd hypothesis has a better shot at holding true in something like betting the point differential in a game?

You should check out 'Noise' by Kahneman, Sibony and Sunstein. It's a whole book about this stuff. They discuss lots of experiments on the wisdom crowds including a crowd of 1. Especially interesting is when they address real world applications - in sentencing, insurance, executive search and more.

Isn't the wisdom of the crowd sort of the whole idea of democracy?

Assuming everyone makes some kind of internal estimate of how good/bad each candidate's policies are, a fair election should spit out the best option according to the median estimate. It's a lossy compression - we lose the numbers themselves and skip straight to the decision, and I'm not sure how well crowd wisdom works with the median, but I think our systems do *try* to apply this principle more than we give them credit for.

I'm a bit late to this, and I haven't read all the comments, so it's possible someone else mentioned this, but it seems like "wisdom of crowds" becomes less useful for highly subjective future predictions that do not involve estimations of objective concrete facts which are not influenced by future decisions made after making estimates. If wisdom-of-crowds predictions are made about things over which our decisions have influence, the merely knowledge of the wisdom of crowds "answer" for a question influences our future decisions about things, rendering the prediction unreliable, because the learning of the prediction changes the likelihood of the outcome (making it either more or less likely).

I don't know if anybody else did this, but when I guessed the second time, I imagined how I would guess if I knew I was off significantly with my first guess. So it wasn't a clean guess. I guessed 5 or 10 times my first guess because I was imagining how I'd react if someone told me my first guess was way off... If that makes sense.

My first guess unburdened by the thought process was1500 miles which turns out to be within 72k of the right answer I surprised myself. My second guess my reasoning was, well if I'm off by a non-trivial amount....

So, maybe the second question should be just "guess again" which is closer to how a crowd works

I've taken a quick gander at the survey results, and I think you might have ballsed this one up.

There's a problem with the first question, in that there are two possible answers; by road (2834), or by air (2834). That's a difference of ~350km or about 12% before you start.

As you used "non-trivial amount" in the second question, there's a spot of priming/framing going on, such that the second answer can be reasonably expected to be further away from the first than would otherwise be the case.

I'm a little mystified that you assert nobody thinks about the wisdom of the crowds, either inner or outer, in their ordinary lives.

In my world, asking people who you know (and sometimes even that you don't know, like in a blog comment section) for their thoughts before you make an important decision -- consulting the "outer" crowd -- is ubiquitous. I can't think of anyone who *doesn't* do this. SImilarly, "sleeping on it" or "not making decisions hastily," which amounts to asking the inner crowd (i.e. "re-evaluate this estimate again after some time has passed") is also ubiquitous.

We use wisdom of crowds all the time, and have for thousands of years, without a numerical component. A king's advisors are literally a core example, but extend to something like the Cabinet in the US, or a board of directors at a large company. If I'm thinking of making an important life decision, I may check in with my spouse, my sister, my best friend, my pastor, my financial advisor, and whoever else. All are using a type of wisdom-of-crowds.

I already do this when I make estimates, and I think many other people (less than 1 in ten, but at least 1% or so) do too!

Specifically, when I am making a you-only-get-one-guess kind of guess, and it's important that I'm maximally precise (such as when the best guess, of hundreds, wins a prize, but there's no prize for being almost as close), I start by asking what number I'd throw out. I just cough something out via whatever estimation tool pops to mind. Then I try to identify, assuming that's *wrong*, which *way* it's wrong--meaning take a second guess, with a different estimation tool, or a more careful use of the first tool. Then a third. And etc. I'll also put error bars on my guesses' estimation tools (e.g. an estimate arrived at via multiplying four numbers with plus-or-minus 50% has a much bigger error range than an estimate arrived at via adding four numbers with plus-or-minus 50% error bars.)

I think when the stakes are high, people already do this. When the stakes are low, mostly they don't--so the "this is OP" will show up in survey questions, but not in real life (or less so) for stuff that matters.

There are 6,379 questionaries where the question was answered both times. 6,076 had both answers between 500 and 20,000 km. I am using the 6,076 as my population for the analysis. I binned the answers into increments of 250 (e.g. Bin 2,500 would contain the count of all guesses from 2,251 to 2,500) in order to remove odd cutoffs since most answers were round numbers, I didn’t want a bin ending in a 0 or a 9 to change my results much.

For the 2 closest bins (2,500 and 2,750 which would account for guesses between 2,251 and 2,750 km) There were the following number of guesses in that range:

Guess 1s: 579

Guess 2s: 626

Averages: 914

Of those that originally guessed in that range, 227 (39%) had Averages in those 2 bins.

When you include the next 2 bins (2,001-3,000 range) the effect gets smaller

Guess 1s: 1,485

Guess 2s: 1,583

Averages: 1,627

When you include the next 2 bins (1,751-3,250 range) the average gets worse

Guess 1s: 2,424

Guess 2s: 2,375

Averages: 2,185

870 had their Guess 1 in the 2,000 bin, of those, 444 had a better average (Between 2,250 and 2,750). 102 had higher averages that were just as bad or worse (>=3,000) and 320 had lower averages that were just as bad or worse. (<=2000)

817 had their Guess 1 in the 3,000 bin, of those, 311 had a better average (Between 2,500 and 2,750).

456 had higher averages that were just as bad or worse (>=3,000) and 50 had lower averages that were just as bad or worse (<=2,250)

The below contains most of the data 5,357 of the observations. The first column is the bin their Guess 1 was in, the second column is the count of guesses, the third column is the amount of averages that were better than the Guess 1, and the fourth column is just column 3 as a percentage.

Bin Count Improvements % Improvements

500 66 62 94%

750 80 48 60%

1000 533 314 59%

1250 182 118 65%

1500 412 277 67%

1750 146 88 60%

2000 870 444 51%

2250 89 39 44%

2500 526 0 0%

2750 53 0 0%

3000 817 311 38%

3250 69 28 41%

3500 276 135 49%

3750 30 13 43%

4000 521 261 50%

4250 25 14 56%

4500 139 59 42%

4750 10 5 50%

5000 513 264 51%

Overall, it looks like those who initially guessed low had their average improve their score while those who initially guessed high did not. I am not sure what to take away from all of this, besides it’s not obvious that an individual guessing is a good way to go about increasing accuracy. There may be an effect where the best original guessers could do even better by multiple attempts.

Maybe the real world doesn't use proper scoring rules, so we don't instinctively make decisions that are good at maximising brier scores or whatever metric you want to use.

Like if you come to a fork in the road, with one path taking you north and the other east, and you think that the eastern path is likely to be 3 times worse than the northern one, you don't average them and set off NNE.

Random fun and reasonably relevant math fact: arithmetic mean and geometric mean are both special cases of an elegant generalization of pretty much all possible kinds of means: the power mean. https://en.wikipedia.org/wiki/Generalized_mean

Not at all what I thought you were testing. I thought you had found some different magic hack, because when you asked the first time I said 3000km and the second time I said 2500km, almost bang on when I checked.

About your last few paragaphs, I agree with most of your remarks about how the general inability of people to consistently and adequately quantify "vague feelings", and that knowing that, finance people will prefer models they trust to relying on people.

However, something in your paragraph about forecasting and the way it's not used in the real world brought me back to the fact that most decision-making is going to be not an applied probability, but a black-or-white decision. "What is the probability that I should work in academia" actually makes little actionable sense if it's not >80% or <20% (adapt the numbers to your risk-aversedness). And actually, going back to forecasting, it is very counter-intuitive that black-or-white events should be probabilized: either you believe they will happen, or you believe they will not -- it's not like there can be anything in between. What people making decisions want is not a 66% chance: they want a 0-or-1 belief that can be explained.

Your method of using the geometric mean of the absolute error doesn't work well as a summary of how far off the typical answer was. Suppose for example the true answer to some question is 20, and the guesses are distributed uniformly randomly within the interval [19,21] (the exact form of the distribution doesn't matter, so long as it's continuous with a non-zero density near the true answer). After taking the absolute error, it's uniformly random in [0,1]. If this average is well behaved, in the limit we can replace the product with an integral exp(integral from 0 to 1 of ln(x) dx). The integral is negative infinity [edit: As Matthieu pointed out this is not correct, and therefore neither are the things that follow.] so the answer (exp of that) is 0. This means the average is not well-behaved, but intuitively it implies that the geometric mean of the absolute error will tend to zero as the number of samples tends to infinity, even if the actual average error remains constant. Note that this is not the error of the average, but the average of the error, which should remain non-zero. This is probably why the size-ten crowds appeared better than the full-size crowd, because this method of averaging over-emphasises values near zero.

A more reasonable way to combine the geometric mean with estimating errors would be to take the logarithms of all the estimates and the true value, calculate the mean-squared error or mean absolute error or something of the logarithms, then either use this result as-is ("estimates were off by X orders of magnitude on average") or take the exponential of it ("estimates were off by a factor of [e^X] on average"). In either case the result is dimensionless rather than being a kilometer value.

I may at some point re-do your analysis with this method and see how much it changes the results.

Could you get valuable results for a single person by asking about the distances between two different pairs of cities which are (roughly) the same distance away from each other? You would get some confounding results based on personal knowledge of geography, but it might be useful way of averaging multiple rough guesses from a single person.

If you're asking about a single data point, I wouldn't expect multiple guesses from a single person to be particularly useful. Asking for an error margin (or a min/max value) is the only way I can think of to actually get additional useful information from a single person.

Opinions are not formed independently, there are latent (hidden) relationships that imply correlation structure among the responses (gender, politics, shared exposure to blogs, location, SES, ...) effectively reducing the size of the crowd.

I don't think your second question's formulation was useful for testing your hypothesis, but it is an interesting illustration of the wisdom of the crowd effect. If everybody gives their best guess first, and is told to think it is very wrong, and made to give a second "guessier" guest you don't have a wisdom of a crowd of size = 2x, you have the wisdom of a crowd of size X, and the intentional worse effort of a crowd also of size x, and are taking the average.

Here, everybody gave their best guess and were 918 km off. They were told to assume their first guess was non-trivially wrong, and take a second guess. Collectively they were able to reduce their individual "wisdom" by making a guess that was not their actual first choice, and as a crowd were successfully not as correct as their actual collective first choice.

I'm curious if this would be replicable. "If you ask a crowd a knowledge question and ask them to guess, then tell them to assume their first question was nontrivially incorrect and guess again, the average of the first guesses will be closer than the average of the second guesses to the correct answer". In other words, respondents are successfully and correctly assigning a relative probability rating to their top two guesses

Generally interesting interview with Edward Thorp-- this is the section where he explains how he found out that Madoff was a fraud, and the importance (quite specifically) of actually understanding what's going on rather that trusting the wisdom of crowds. The crowd trusted Madoff.

I *think* the wisdom of crowds is about estimates, while it's possible that there are things which can actually be understood. How can you tell if you're dealing with something which can be understood?

Not the first place I've heard the idea, but Edward Thorp argues that that holding index funds are the best investment, and part of that is because getting in and out of stocks (and other investments?) is heavily taxed.

Is substantially taxing getting in and out of particular stocks a good idea? Are there costs to pushing people into index funds?

Isn't democracy wisdom of the crowd at scale? And explicitly so if we look att the jury theorem, which was an argument for democracy from Mr. Condorcet.

"If you could cut your error rate by 2/3 by using wisdom of crowds techniques with a crowd of ten, isn’t that really valuable?"

Yes! We use wisdom of crowds all the time! When you ask your friends for advice, that's wisdom of crowds. You don't ask them for numerical estimates of happy you'd be in academia vs. industry because our brains didn't evolve to deal with numbers. Instead, you ask them whether they think you should go into academia and industry. They subconsciously weigh all the information they have, and give you a subjective sentiment. If Alice, Bob, and Charlie all say you'll be miserable in academia because of X, Y, and Z, except Alice and Bob say you'll be slightly miserable and Charlie says you'll be very miserable, that's pretty good evidence against going into academia.

Democracy is also wisdom of crowds. You don't ask people what the chances of Russia invading Ukraine are. Instead, you do opinion polls to ask them what the proper foreign policy toward Russia should be, which is what you want to know anyways. The war hawks that would get us into a nuclear war cancel out the hippies who would have Russia roll all over us, and in the median you get a more or less reasonable foreign policy.

I was one of the people who gave a ridiculously far-off estimate. Partially this was due to me not remembering how far a kilometer is and going "well, it's either roughly half a mile or roughly two miles ... roughly two miles it is"

There's an easier analysis of the guessing twice strategy:

For each individual, which guess or average is closest to the correct answer?

Using the public data of everyone who answered both,

First guess is best 3036 times.

Second guess is best 2569 times.

The average of the two guesses is best 643 times.

The geometric of the two is best 358 times.

So in this data the first guess is most often the closest one. However, if you combine the second guess and the two averages, one of them (but you don't know which) is best more often than the first guess.

That doesn't feel surprising. It's another way of saying "Consider other values, and move your guess closer to values you consider probable." Which is something you're probably doing when you're guessing anyways. There might be some value in writing down a number of guesses, rating them, and then iterating with more guesses and ratings until you feel like you can't improve any more. As a way to simply force yourself to have more thinking time on the problem considering different possibilities.

You have used the abbreviation “OP” twice in this post without mentioning it what it means.

I have noticed that ACX and LW posters also tend to use abbreviations a lot and just assume that their audience is in the know. This is very frustrating and does a disservice to those who are new to your community and are trying to learn and enjoy being a part of these discussions.

In the case of prediction market contests, there's an additional factor that would cause averages to be more accurate, even beyond the usual wisdom of crowds. The goal of a participant isn't to minimize their *expected* error, the goal is to *maximize the chances of rising to the top*. Realistically, the only way to win a contest like this is to gamble on some unlikely outcomes and hope for the best. But if you average a crowd together, they'll probably gamble on different things and the outliers disappear.

I'm pretty sure at least one of the very low outliers was my honest answer. I guessed that the round-number km distance from equator to pole was probably 1000 rather than 1,000,000, and Paris-Moscow was probably about a fifth of that.

This phenomenon is similar to test time augmentation in ML. There you do multiple predictions with a bit of randomness injected into the same inputs. Then you take those multiple predictions sourced from the same model and average them.

If you instead enable dropout, which is kind of like a human being on lsd. It's called monte carlo dropout'

Spooky. The Van Dolder paper is on my very short list of studies to read soon.

The Good Judgement Project had an experimental condition where some participants made predictions by simultaneously betting in a prediction market and sharing the expected likelihood of an event directly. Their aggregation algorithm achieved highest accuracy when both the bet and the direct prediction were incorporated, implying that they each held some different information. They speculated this was due to the "inner wisdom of crowds" thing, and I think cited that same study.

Why would you have to convert life decisions into numerical values? Consulting family, friends, etc is pretty basic human behaviour, and they certainly may be of clearer mind about your preferences, character, capacities etc than you, at least in certain aspects, and many instances.

Just yesterday, at a mall, I spotted one of these "guess how many marbles are inside and post the answer online" contests. As it happens, I read this post just recently, so I figured I might stand a good chance if I tried this technique: make a few logical guesses, each time assuming my previous guess was wrong in some way, then average them all.

My guesses? 6000, 9000, 4851, 5198. So my answer was their root mean square, 6472. (For some reason I felt the r.m.s. was better than a mere average.)

The actual answer? 6498.

99.6% accuracy. My mind was blown.

(In case you're wondering what I won, the answer is: nothing. Turns out the contest ended 18 months ago, and then they just left the big glass case with marbles there, complete with out-of-date instructions.)

Since multiple estimates from one person seem to be more powerful the less correlated they are, I wonder if there are any strategies-of-thumb which might reduce the correlation among an individual's estimates (other than the obvious wait-until-they-forget-their-other-answers).

e.g.

Estimate 1: "gut feeling",

Estimate 2: "Fermi estimate",

Estimate 3: "some other semi-structured way of extrapolating unknown quantities from known quantities",

This is a well studied property of statistics. If you are trying to estimate the mean, odds are just taking the mean of your sample will be incorrect. If you throw in a completely random number, the odds are good that you will get a better estimate. It's sort of like the Monty Hall three doors paradox.

This may seem totally strange, but statisticians use this to improve their estimates of the mean by a process called bootstrapping. (There are variants of this. e.g. jackknifing.) The idea is to take the means of different subsamples of the original sample and using them to derive a better estimate of the mean.

It's not so much a property of crowds as a property of numbers.

I think something is wrong with this equation. 1/ERROR should always be < 0.005 because the error is always ~200km or more, ln(CROWD_SIZE) should be >1 for any crowd sizes of at least e (2.718), so eg a crowd size of e (2.718) gives predicted inverse error 2.34 + (1.8 * 1) = 4.14 or an error of 1/4.14 = 0.24 km, which is way off. Unless I'm missing something, which is entirely possible.

I tried replicating this and the best-fit curve I found was 1/ERROR = 0.00093 + [0.00073 * ln(CROWD_SIZE)]. (Side note: I only had n=6537, not 6924 as you said, after eliminating all blank answers.)

I think your second poll question's caveat that you were off "by a non-trivial amount" may play in here. If I was really confident in the distance from Paris to Moscow, or the weight of a cow, my second guess would be pretty close to the first. But the way the question was phrased, most people would feel compelled to change it up for their second one, even if they were very confident the first time.

This feels a bit like a human "let's think step by step" hack. Also, seems like some part of this benefit is obtained from common advice to "sleep on a important decision" and not make super important decisions impulsively.

I’m mad because I was actually super happy with how close my first guess was - but I didn’t read the question right and guessed in miles, not km. My second guess was in the wrong direction, anyways, so i mostly just got lucky.

I'm out of the loop: OP == "overpowered"?

In theory if there is no systematic bias the error vs crowd size graph should be an inverse square root, not the inverse logarithm you fit to the curve. This follows from the central limit theorem if we have a couple assumptions about individual errors (ie finite moments).

This actually makes the wisdom of crowds much more impressive as the inverse square root tends to zero much more quickly.

I think the poll's instruction to assume that your first answer was wrong by some 'non trivial amount' is important. It's effectively simulating the addition of new data and telling you to update accordingly. Whether the update is positive will depend on the quality of the new data, which in turn depends on the quality of the first answer!

ie. If my first answer was actually pretty close to reality (mine was; I forget the numbers and the question now but I remember checking after I finished the survey and seeing that I was within 100km of reality), a 'non trivial' update is pretty likely to make your second answer worse, not better. That's quite different to simply 'chuck your guess out and try again'. It also suggests that ACX poll-takers may be relatively good at geography (compared to... pollees whose first guess were more than what they think of as a trivial amount wrong? I don't know what the baseline is here).

Without reading through all the links above it's not clear whether the internal crowds referenced were subject to the same 'non trivial error' second data point. In the casino presumably there was some feedback because they didn't win, but I don't know how much feedback. I'm about to go to bed so I will leave that question to the wisdom of the ACX crowd and check back in the morning.

edited Feb 6You gave a handful of examples where we could hypothetically benefit from the wisdom of crowds. But in each case, we *already* leverage the wisdom of crowds, albeit in an informal way.

E.g. my decision of academia vs industry is based not just on a vague personal feeling, but also aggregating the opinions of my friends and mentors, weighted by how much I trust them. True, the result is still a vague feeling, but somewhere under the hood that feeling is being driven by a weighted average of sorts.

I'm not sure there'd be much utility in formalizing and quantifying that--we'd probably only screw it up in the process (as you point out).

I use wisdom of the crowds when I cut wood and I don't have my square; If I need a perpendicular line across the width of a piece, I'll just measure a constant from the nearest edge and draw a dozen or so markings along at that constant. They won't all line up (because I can't measure at a perfect right angle) but I just draw a line through the middle of them and more often than not it's square enough, because I'm off evenly either side of 90°.

With your last point, an important part of this is whether "wisdom of crowds" is a spooky phenomenon that comes from averaging numeric responses, or whether it's an outcome of individuals mostly having uncorrelated erroneous ideas and correlated correct ideas (so that the mistakes get washed out in the averages).

If it's the second, you'd expect that all sorts of informal and non-quantitative ways of aggregating beliefs should also work. If you want to know whether to go to academia or industry, you ask 10 friends for advice and notice if lots of them are all saying the same thing (both in terms of overall recommendation or in terms of making the same points). If you want to build a forecasting model, you can hire 10 smart analysts to work on it together.

Of course, the details matter--if you have people make a decision together, maybe you end up with groupthink because one person dominates the discussion, pulls everyone else to their point of view, and then becomes overconfident about their ideas because they're being echoed by a bunch of other people. If the "consensus information" and "individual errors" in people's thinking are fairly legible, on the other hand, you might do a lot better with discussion and consensus than with averages because people can actually identify and discard their erroneous assumptions by talking to other people.

edited Feb 6What happens if you compare people's second guesses against their first? I.e., is the model predicting “thinking longer causes better guesses” excluded by the data?

My intuition is that wisdom of the crowd of one would predict that the second guess shouldn't be consistently better.

The systematic error might be better known as Jaynes's "emperor of china fallacy".

One question I have is whether language models (and NNs in general) can be used to generate very large 'crowds'. They are much better at flexibly roleplaying than we are, can be randomized much more easily, have been shown to be surprisingly good at replicating economics & survey questions in human-like fashions, and this sort of 'inner crowd' is already how several inner-monologue approaches work, particularly the majority-voting (https://gwern.net/doc/ai/nn/transformer/gpt/inner-monologue/index#wang-et-al-2022-section “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, Wang et al 2022).

I use the single-player mode a lot when I'm guessing what something will cost - and I use it on my wife too. I start with two numbers, one obviously too low and one obviously too high. I then ask:

Would it cost more than [low}?

Would it cost less than [high]?

Would it cost more than [low+$10]?

Would it cost less than [high-$10]?

. . . . and so on. You know you're getting close when the hesitance becomes more thoughtful.

I'm sure I'm not the only one who does this, but I believe that many of us do something similar in a less deliberate or structured way. If you've lived in in Europe, you probably have a good feel for the scale of the place and of one country relative to the next. You may even have travelled from Paris to Moscow. If you live in North America, you may zoom out and rotate a globe of the Earth in your mind's eye until you reach Europe, and then do some kind of scaling. Estimating by either method will almost certainly give a better result than a WAG most of the time. So your "very wrong" answers weren't necessarily from lizardmen, but were just WAGs rather than thoughtful estimates.

edited Feb 6Post gets only a 7/10 enjoyment factor, I still don't know how far apart Paris and Moscow are in surface kilometres and am now forced to have to go look it up. Upon reflection that my personal enjoyment might have been wrong, I've revised my estimate to 5/10 and have now averaged this out to 6/10...or was it ....the square root of 5*7 or 35^(1/2) for an enjoyment of 5.92/10? I don't even know anymore!

I took the instruction to assume that I was off by a significant amount seriously. I decided i thought i was more likely to be greatly underestimating than over estimating and so took my first estimate and x10. In other words, i really didn’t re-estimate from scratch at all. If this analysis was your intention all along, perhaps explaining your intentions would have gotten people to rethink it in a more straight forward way.

This is both a great example for and a horrible case of the "wisdom of crowds" fallacy in forecasting - the problem isn't that your guessing at something known to a large part of the population approximately and so a larger sample more reliably gives you a median that is close to the ideal median of the entire population, which will be somewhere in the vicinity of the real thing because there is some decent penetration of the real value into the populace.

In forecasting you're guessing at something that isn't known to a large amount of the population, but the population and ergo your sample will have some basic superstitions on the issue that come mostly from mass media and social media and so even when you get a good measurement of the median, the prediction is still crap because you polled yourself an accurate representation of the superstition and not the real thing.

Say you want to know when Putin will end the Ukraine war - only Putin and a few select individuals know when that will be - if at all and this isn't made up on the go. But everybody will have some wild guesstimate, since newsperson A or blogger B or socialite Z (pun intended) posted some random ass-pull on twitter not necessarily claiming but certainly implying to know when it will happen. This is the result you're gonna get in your poll.

Wisdom of crowds is useless as forecasting and only works when the superstition has some bearing on the issue at hand, i.e. the policy itself is influential on public opinion or there is a strong feedback loop which ensures conformity of what's happening with the emotional state of "the masses". That, mostly, doesn't appear to be the case.

This is something that I've been thinking about in the context of LLMs. Ask an LLM a question once, and you are sampling its distribution. Ask it the question 10 times, consider the mean and variance, and you have a much better sense of the LLM's actual state of knowledge.

Here is an LMTK script I wrote in Jan which demonstrates this in the context of math problems: https://github.com/veered/lmtk/blob/main/examples/scripts/math_problem.md

I guess Walt Whitman was on to something when he wrote "I contain multitudes"!

Is the data from the study saying that the average guess was many times larger that the actual answer? It seems that that might part of the reason why you got different error measurements. Guessing geographical distances has a limit on upper bounds in a way that guessing a number of objects doesn't.

Doesn't Caplan's Myth of the Rational Voter deal with how the wisdom of the crowds only works when people aren't systematically biased on the subject in question?

For those who were (like me) confused by what "geometric_mean[absolute_value(geometric_mean<$ANSWERX, $ANSWERY> - 2487)]" is supposed to mean, here's the ChatGPT explanation which makes sense:

This expression calculates the geometric mean of the absolute value of the difference between the geometric mean of two values ($ANSWERX, $ANSWERY) and 2487.

The geometric mean of two values is calculated by multiplying the two values and taking the square root of the result. So the expression "geometric_mean<$ANSWERX, $ANSWERY>" calculates the geometric mean of the two values.

The difference between this geometric mean and 2487 is then taken, and the absolute value of this difference is calculated, ensuring that the result is always positive.

Finally, the geometric mean of this absolute value is calculated, which gives a single value as the final result.

> So is the percent chance that your country would win. If you could cut your error rate by 2/3 by using wisdom of crowds techniques with a crowd of ten, isn’t that really valuable?

Aren't people doing that all the time ? Governmental organizations have committees; large corporations have teams; some of them even hire smaller companies as contractors specifically to answer these types of questions.

edited Feb 6re: "What about larger crowds? I found that the crowd of all respondents, ie a 6924 person crowd, got higher error than the 100 person crowd (243 km). This doesn’t seem right to me..."

I'm also suspicious, and would predict this observation will reverse with enough resamples of the 100-person subsets. 6,000 instead of 60 would probably do it? Or maybe some outlier was simply missed.

There is likely a simple proof based on sum-of-squares decompositions that would show the average over all subsets of the 100-person group has higher error

This looks like much ado about nothing.

Part of the crowd guesses high, the other part guesses low.

Nobody is right, but it averages out closer to right.

'Nuff said.

I think the spookiness of "inner crowds" improving your answers mostly comes from an intuition that whatever you were doing originally can be approximated as being an ideal reasoner. An ideal reasoner shouldn't be able to improve their answers by making multiple guesses.

But humans are often pretty far from being ideal reasoners. If this works, I see that more as an indictment of how bad humans are at numerical estimates, rather than a spooky oracle.

(Though this doesn't prevent it from being useful...)

One hypothetical mechanism for why it works is that it forces the forecaster to make an estimate of their uncertainty and take a second draw from the implied distribution. It’s similar to when someone wants to “sleep on it”, even though they aren’t going to get any new information. They are just going to think about the worst case (and maybe best case) and get a second draw after thinking more about the distribution of results

> I think the answer is something like: you can only use wisdom of crowds on numerical estimates, very few people (currently) make those decisions numerically, and the cost of making those decisions numerically is higher (for most people) than the benefit of using wisdom of crowds on them.

Actually, I think you're wrong on this one: wisdom of the crowds really is OP and we're severely under-using it.

An example that immediately comes to mind is peer programming: by having two people work on the same code simultaneously, you can immensely increase their productivity. Every time I've tried it, I had positive results, and yet most companies are *very* hostile to the idea.

The part about getting diminishing returns as you add more people is interesting too. I wonder if you could drastically reduce design-by-committee problems in an organizations by making sure all committees involved have at most three or four people in them.

Maybe a non-spooky explanation is that when we do not know the exact answer to a question, we instead have a distribution of possible answers. When you force someone to collapse that wave function down to a single scaler measurement, they will randomly pick one possible answer as per the probably distribution. But you have lost all the rest of the information contained in the distribution. When you ask again, you make a 2nd sampling from the distribution, which adds precision. Note that if you keep asking you will get more points, but inevitably you still loose information.

Example, I might know that there is either $100 or $200 in my bank account becuase I don't know if a check cleared yet. If you force me to pick a single value I'll pick either at random. Ask me twice and 50% chance I'll pick the other. By your way of measuring it looks like I don't know much, which in fact I have complete information less a single bit.

Similar argument works for crowds as well.

I also made an analysis of the inner crowd on the same survey question, using different statistics.

https://astralcodexten.substack.com/p/acx-survey-results-2022/comment/12089011

tl;dr: I got similar results as Scott: the inner crowd helps a bit, but not too much. Strangely, the second estimate was much worse than the first. Some speculated that this was due to Scott's phrasing "off by a non-trivial amount" in the second question, but the same effect (worse second estimate) was also in the literature, where probably they didn't have such a phrasing. (But my source was much less sophisticated than Scott's VD and VDA paper.)

Highlight numbers, GM stands for "geometric mean":

- The first estimate was off by a factor 1.815. (This means that the GM of all those factors was 1.815)

- The second estimate was off by a factor 1.901.

- The GM of the two estimates was off by a factor 1.791.

- How often was the first estimate better than the second: in 53.3% of the cases.

- How often was the GM better than the first estimate: in 52.8% of the cases.

- How often was the GM better than the second estimate: in 60.0% of the cases.

When asked to guess a number, my mental process is to first find a range, then pick (somewhat arbitrarily, honestly) within that range. I suspect that repeatedly sampling the same person is just a rough, inefficient way to find their range estimate.

I suggest trying a similar question but asking for the 70th percentile upper and lower bound on the distance (with another question asking if the person knows what that means as a filter).

I can think of a couple of mundane reasons this is probably correct.

In one case, you *sorta* know the answer and thus can make a guess about how to improve your first guess. On the Moscow question, I knew Moscow was probably longer than it seemed and thus if my first answer was wrong, it was likely I had guessed too low rather than too high. My second guess was higher and closer to the real answer and thus my average was better.

In some other case you have 0 idea at all. I have 0 idea how far some specific exoplanet is from Earth. Thus I'm likely to make wild guesses that cover my bases. Uh, 5 light years? Uh, 1500 light years? Almost assuredly the average here will be better even if I have 0 idea.

What if you tried bootstrapping the larger groups of individuals (i.e. sample with replacement)? I’m on vacation or I’d do it myself but I’d be curious on if that improves the error

I think you raise a really interesting question.

This suggested to me that the 'internal crowd' was almost entirely worthless. "P < 0.001!" Yes, but magnitude <2% improvement? I have low confidence in a result like this one (even with a great p-value!) that purports to demonstrate a method for 1.5% improvement in guessing accuracy.

IIRC the reasoning for *why* the (outer) wisdom of crowds works, is that the crowd contains a few experts who will be biased in favor of the correct answer... while everyone else errs randomly above or below the correct answer. So there was no inner wisdom of crowds in this version.

“Estimate the number of balls in this jar” and “Estimate the distance between Paris and Moscow” seem like qualitatively very different tasks to me.

Estimating the balls in the jar seems like a visual reasoning task, whereas estimating the distance seems like a preexisting knowledge task.

I didn’t know where Moscow is within Russia. I didn’t know how many countries were between France and Russia. I didn’t remember whether a kilometer was bigger or smaller than a mile. And I didn’t know any reference large distances to use for comparison except that the radius of the earth is 4000 mi. Therefore there were so many inferential steps in my distance guesses wherein to introduce additional error; as compared to my guess about balls in a jar, which seems to just be testing my skill at 1 thing.

I remember unfortunately ruining my results for this by immediately looking up the answer after putting in my guess for the first question (since I didn't know there was going to be a second).

edited Feb 7Hi Scott; it's the inverse-square-root. The standard error of an estimate declines as a function of 1 / sqrt(n) for sample size n (because the variance declines with 1/n).

If the estimates are biased, the root-mean-square error is going to be sqrt(bias^2 + (variance / n)) for sample size n, i.e. the mean squared error will decline hyperbolically. This isn't something the study found; it's a mathematically-derived formula, which they then fit to the data to get estimates for bias^2 and variance. Because estimates taken from 1 person are going to be substantially biased, the error will never reach 0; it asymptotes out very quickly. The average of many people is going to be much less biased, such that the variance probably dominates.

I probably produced two of the very far outliers because of being very bad at geography and spatial reasoning generally. I think I put down a guess that was an order of magnitude wrong, and then, being told by the second question to answer as though my first was wrong, changed my answer by an order of magnitude in the wrong direction. I don't if this information is helpful to anybody; but some of us don't realize we're being lizardmen because we have no idea how to meaningfully connect the ideas "kilometer" "Paris" and "Moscow". 1,000 km seems as reasonable to me as 200,000 km.

I'd be curious to see if those with dissociative identity disorder (or those who self-identify as systems, since that's probably more common than an official diagnosis) are better than the rest of us at this internal wisdom of the crowds.

proposal for improvement:

- right before asking the first time, ask people to provide the last 3 digits of their zip code, or any other essentially random number

- preface the second question with an explanation of anchoring and ask people to provide a new estimate without referring to their previous one.

Benefits:

- providing a plausible reason for people to give a new estimate without inducing too much distortion

- measuring how much anchoring affects ACX readers

- measuring how much "inner crowd" can counteract anchoring.

This hits on why I don’t see fast AI takeoff being a thing. GPT is wisdom of the crowds. A bunch of text is averaged together and gets you an answer that is directionally correct (as far as text completion goes) but is only going to asymptotically approach reality.

To “know” facts you need a different methodology, that is essentially brute force. How do you know the distance? You looked it up from a reputable source, which is reputable thanks to a reputation that took thousands to million of person hours to cultivate, and on top of that someone had to actually physically go and measure (or just wait until we launch satellites that account for general relativity into space and compute it from their data.)

Wisdom of the crowd works because it is actually very very hard to obtain real knowledge, but we think it is easy because we have a superficial experience of “knowing” many different things. Averaging a bunch of estimates allows more real knowledge to contribute.

All this gives me a low prior on AI takeoff even being a thing. We will burn out on modelling existing human knowledge and then begin the hard work of developing machines that can do the hard and painstaking work of actually gaining new knowledge. It will not be fast because knowing things is really a lot of work. Those 10^46 simulated humans will probably get bored and want to do something easier.

Only now do I actually look up the distance from Paris to Moscow, and holy cow I was almost right on the money. My first guess was 2500 km

edited Feb 7My initial impulse is to ask for control! What happens if you pick a random number in a given range to guess (say, for Paris to Moscow the range would be something like 50 to 50,000 km, and yes I know that no two points on earth's surface are separated by more than 20,000 km, but some of your readers might now know it), then take a random distribution on the log scale, then pick two random samples? Would the "wisdom of crowds" effect be random chance?

I've also found the "wisdom of the random duo" effect in my research (https://braff.co/advice/f/forecasting-masterclass-7-find-the-martha-to-your-snoop). I wonder if you or I could simulate the inner crowd by looking at forecasts on props that are highly correlated within the same contest? You have a bunch of Ukraine props where the average-across-3-props for a given forecaster may be a more accurate read on the whole battle than any one forecast?

I also have poor intuition about this problem. However, when I got to the second question about the distance from Paris to Moscow, essentially asking me if I wanted to change my first guess, Monte Hall came immediately to mind.

Is there a logical comparison between this and the Monte Hall problem? Did anyone else think this? Should I look up some old Marilyn vos Savant posts?

If people's first answer was generally closer than their second answer, then means that it'd probably be best to take a weighted average that puts more weight on the first answer than the second.

Some examples of how I use wisdom of crowds:

In games, like Codenames or Wavelength, people on my team independently come up with their guesses before we share and discuss them with each other.

In forecasting, I consider what range of forecasts I might plausibly make and average them. I also make multiple forecasts using different methods (e.g. using two different relevant reference classes) and then average them, to make use of information from independent sources. I also consult others' forecasts on the question when available to aggregate their views.

In general, when a group is collaboratively seeking the truth on a topic or trying to make a decision, I encourage giving everyone time to think of their own independent impression before having individuals share their view.

Not sure what an "inner crowd" is.

Just came here to say when I answered the distance question in the survey, I was SO off. I had no concept of the size of the earth so no idea what a reasonable distance would be. I can't remember now in which direction,but I was off by a whole order of magnitude. So yeah, probably one of the outliers. Just to put it out there that we're not all lizardmen, some of us just don't have a good model of these distances.

I find this really bizarre. I thought the basis for the wisdom of crowds was Condorcet's Jury Theorem: assume (Independence) that individual voters have independent probabilities of voting for the correct alternative. Also assume (Competence) that these probabilities exceed ½ for each voter. It follows that as the size of the group of voters increases, the probability of a correct majority increases and tends to one (infallibility) in the limit. Suppose the number of voters = 1. While the single voter could make multiple guesses, how would that not violate the independence condition?

Is it possible that since most of your readers are American, they had some idea in miles, and many just gave that same guess in km due to unfamiliarity with the conversion? The mean guess in km and changing the units to miles would be a lot closer to the true answer.

Wisdom of the crowds is like ensemble learning for humans. Or maybe ensemble learning is wisdom of the crowds for machine learning models.

Thinking of the mechanism behind the "crowd of one" effect. At first I thought it's a variant of the Monty Hall effect - first guess under complete uncertainty, second guess somewhere else in the spectrum, with some uncertainty removed. But more likely it is, combining sources of incomplete information. People will have different hypotheses or heuristics in mind to make a guess. They will only use one heuristic for the first guess. They will use a different one for the second guess. So now there is more information present than with a single guess. Example, if a person is completely uncertain about Paris-Moscow, they first might use the heuristic of "Russia is huge", then the heuristic "but Europe is small". The average of both biases produces a better result

I've most recently used this trick to estimate how much wine I need for an event with a few dozen people.

I came up with an estimate using different mental model:

- One where everyone is thirsty and drinking wine (upper bound)

- One where people are hardly drinking any wine (lower bound)

- 2-3 more best guesses using different formulas

The numbers came out as: Upper > best guesses > lower.

So I felt pretty confident about how many cases of wine to order by averaging the best guesses and then adding some.

>>> What about in finance, where people often make numerical estimates (eg what a stock will be worth a year from now)? Maybe they have advanced models calculating that, and averaging their advanced models with worse models or people’s vague impressions would be worse than just trusting their most advanced model, in a way that’s not true of an individual trusting their first best guess?

In fact this is standard practice in finance and most other ML applications, see https://en.wikipedia.org/wiki/Ensemble_learning , and is known to be one of the few methods systematically resulting in better predictions (another is increasing the dataset size). Multiple different models are typically created using different sources of information, underlying architectures, training techniques, etc, which are then "averaged" to make the final predictions. The models are usually as advanced as possible (i.e. they are a crowd of experts), and the averaging is typically also learned (i.e. instead of choosing between arithmetic and geometric means, you would learn the actual ensambling function to better account for each of the model's biases, ideally making use of their self-reported uncertainty). I doubt there's any big financial trading firm that does not have this in place, including the presence of multiple uncomunicated teams working on various models for the same purpose, each of them without access to the other models or final ensamble.

I have heard 'Wisdom of the crowds' described very differently, when you get large groups of people a small number will have specialized knowledge of the question, and a larger number will have general knowledge. If the wildly ignorant are simply guessing then their errors will frequently (but not always) cancel each other out and what you are left with pushing the data are the experts. You aren't averaging a bunch of guesses, you are asking enough people to find someone who knows the answer and then averaging out all the bad guesses.

edited Feb 7Perhaps quite tangential, but this has me thinking about how criticism and ratings can help us use “the wisdom of crowds” to predict things that aren’t objective in any real sense.

For example a movie’s quality is pretty arguably entirely subjective, and whether any one person will like a given film is hard to predict, but we all commonly use the wisdom of crowds to estimate a film’s quality and help us predict if it will be with our time or not.

Each person who rates a film is “guessing” the film’s objective quality, since no one person actually gets to claim that objective perspective. But if we add up enough subjective guesses, we can kind of approximate some kind of “objective” value.

I think there are probably a lot of ways we use a kind of vague sense of what “the wisdom of the crowd” Is about certain issues to help us make judgment calls.

>What about in finance, where people often make numerical estimates (eg what a stock will be worth a year from now)?

Isn't the price of the stock already, in a meaningful way, already the wisdom-of-crowds answer to something like that question?

> This looks like some specific elegant curve, but which one? A real statistician would be able to give a good answer to this question.

Under the simplest hypotheses, it should be the sum in quadrature (i.e., a ⊕ b = √(a² + b²)) of a s.c. "statistical uncertainty" proportional to 1/√n and a s.c. "systematic uncertainty" which stays constant.

Regarding the answers for the Paris - Moscow distance. I think it's hilarious how you were surprised at some very wrong answers and assumed the reason is lizardmen/trolls. You're really just underestimating how bad some people are regarding distances and geography. I tried hard to give a good estimate but ended up with what is essentially a random number that could have been the distance to the moon for all I know.

Note that the error can never go completely to zero for the infinite crowd. There should be a lower bound on persistent error set merely by the resolution of typical maps - plus an additional contribution from people's natural tendency to round large numbers. Sorry if this comes across as too pedantic, but I think generally these limits set by resolution are interesting and often neglected!

edited Feb 7I think on non-numerical things, we already instinctively use wisdom of the crowds. You feel vaguely positive about academia *because* you’ve heard people say more good stuff than bad stuff about academia. Our brains are very good at subconsciously "averaging" status signals, perceived utils, etc, but not so good at averaging actual numbers, so it’s only once we start putting numbers on things that we have to remember to do the averaging step explicitly.

edited Feb 7(This is Eric; I helped run the 2022 forecasting contest.)

I've thought a lot about this -- indeed, the first paper I wrote in grad school can be summarized as "the wisdom of crowds is a *mathematical* fact" (if you aggregate forecasters in a way that accords with how you score them). I'm planning to write a blog post about this, but let me briefly illustrate what's going on in this comment.

Suppose you put 100 candies in a jar and ask people to estimate how many candies there are. You're then going to score each person based on how far off they were, and compare two quantities: the average of everyone's scores, versus the score of the average of all the estimates (the latter is the wisdom of the crowd).

We're gonna score each participant based on the *square* of the distance to the right answer. (Why the square? Briefly, this choice incentivizes each participant to truthfully report how many candies they expect are in the jar.)

Let's say that the estimates are 90, 100, 110, 120, and 130: so, the participants disagree with each other but are also somewhat biased upward.

From first to last, the (squared) errors of the five participants are 100, 0, 100, 400, and 900, for an average of 300. By contrast, the average of all five estimates is 110, which is only off by 100.

In fact, it is *always* the case that the second number (error of the average) will be smaller than the first (average error), no matter which numbers I chose for my example. An intuition you could have is that the first number is equal to the second number, *plus noise*, where the "noise" is the variance in the participants' estimates. (Check it out: the (population) variance of {90, 100, 110, 120, 130} is 200, which is equal to 300 - 100!)

(Feel free to skip this aside, but: what's the math behind this? Briefly, let X be the random variable equal to the signed error of a randomly chosen expert -- so in our example, X would take on the values -10, 0, 10, 20, and 30 with equal probability. Then the average error is E[X^2], whereas the error of the average estimate is E[X]^2. The former quantity is larger, and the difference is E[X^2] - E[X]^2, which is the variance of X.)

The math here is sensitive to the fact that I chose squared error (and to the fact that I chose to aggregate estimates by averaging them). If -- as Scott did -- you take the *absolute value* of error instead of the squared error, it's no longer *mathematically* true. However, I would bet that it's empirically true a large fraction of the time. That's because if some participants underestimate the quantity and others overestimate it, they both count positively toward the average error, but the *cancel each other out* when you look at the average.

As for whether your error will go to zero as the crowd size goes to infinity: no. This is only true under a really strong assumption, which is that the crowd is *unbiased*. So for example, if in my example you have a huge crowd but they're systematically biased so their estimates are centered at 110 instead of 100, then in the limit of an infinite crowd you're still going to be off (your average will be 110).

And -- last point -- regarding making multiple estimates on your own and averaging them: it's definitely an interesting frame, but I'd say that you've reinvented the art of *thinking longer about the problem* :)

Here's what I mean: suppose you're weighing going into grad school versus getting a tech job. You think for a while, and you realize: "I'll be 9/10 happy with my pay at the tech job, but only 5/10 happy with my grad school pay." Then you think longer and realize: "I'll be 8/10 happy with the sorts of problems I'll be thinking about in grad school, but only 6/10 happy with the sorts of problems I'll be thinking about in tech." Then you think longer and realize: "the weather at the grad school I'm considering is 3/10, while the weather in the Bay Area tech job is 9/10". And so on. If you wanted to, you could think of each of these things (pay; intellectual interestingness; weather) as separate estimates. And then you can be like "wow, my decision will be more accurate if I average all my estimates together than if I make my decision based on a single factor!" -- I think that's basically all that's going on with the "wisdom of the crowds" here.

edited Feb 7Stupid question: hasn't this topic been done to death in statistics? I'm not an expert, but from what I remember, yes, you can combine lots of inaccurate predictors into a more accurate predictor - provided the individual predictors are unbiased, i.e., they don't systematically over- or underestimate.

My gut feeling is that this is the hard part - finding a dozen people knowledgeable enough to give a meaningful estimate is doable. Finding a dozen people who are not all influenced by the same sources of information to be overly optimistic or pessimistic is the hard part, and if you don't, you converge with great confidence on an inaccurate answer.

Edit: should have read Unexpected Values' answer above before I wrote this...

edited Feb 7There is a UK quiz show called "Who Wants to be a Millionaire" in which an individual is selected from a dozen or so competitors by their correctnesss and speed in answering a preliminary question, such as "Put such and such into alphabetical order" and is then asked a series of questions by the host. Each question has four possible answers, shown to the contestant, and one of these is the correct answer.

The contestant starts with three so-called "lifelines", which they can use once each for any question whose answer they are unsure of or don't know: "50 50" (which halves the number of alternative answers), "Phone a friend", and "Ask the Audience".

The "Ask the Audience" lifeline is the most relevant to this discussion. When it is invoked, each audience member selects on a key pad the answer they know or guess is correct, and the contestant is then shown a bar chart of the percentage of selections of each remaining possible answer.

For a commonly known answer, to a question relating to sport or soap operas for example, the "Ask the audience" lifeline is usually fairly conclusive, and one of their choices obviously predominates and turns out to be the correct answer. But sometimes a majority, occasionally spectacularly so, chooses the wrong answer!

It is interesting to speculate why so many people would choose the same wrong answer, presumably guessed. From my observation of several examples, the main reason for this is that they are biased toward a name they have heard of, or association familiar to them, among others they have not.

I have also observed that another source of audience bias obviously occurs when a contestant is unsure of the answer and the host asks them, before they commit to a choice, which answer they think is correct. It seems very foolish for a contestant to divulge a guess in that situation and then go on to use the "Ask the Audience" lifeline, as they will have influenced equally unsure audience members in advance, but many do!

I just asked my partner: First answer was 2000 km. Second answer was 1500 km.

In that case the error got bigger. Could there be a failure mode for this technique, that while on average it may make you more correct, there are doom cases where it makes you catastrophically more wrong?

>Since we only have one datapoint for the n = 6924 crowd size, it’s not significant and we should throw it out.

I have no background in statistics, but that seems wrong to me. Is that the only rationale for throwing it out? On average, it should be at least a good a crowd like any other, and according to the theory (larger crowd = better), it should be the most representative of the average participant's wisdom.

Also, how did you come up with the crowd size of 100 for doing the analysis? If you tried different crowd sizes, were the results different on average? Did you try a sample of random crowd sizes?

This is similar to something I sometimes have to do for my job. We need to get estimates from experts on quantities of interest. There are various techniques you can employ to get them to give unbiased answers. The simplest and most useful one is after you ask for their best guess you ask "is it more likely that the true answer is above or below your guess?"

It's a technique I employ in my own decision making too, and from the comments it seems that lots of other people do.

> As mentioned above, the average respondent was off by 918 km on their first guess. They were off by 967 km on their second guess.

Was the second guess (on average) higher than the first? Estimating a distance has this asymmetry where there are a finite number of ways to undershoot, but no limit to how far you can overshoot.

If so, maybe the you-are-the-crowd hypothesis has a better shot at holding true in something like betting the point differential in a game?

You should check out 'Noise' by Kahneman, Sibony and Sunstein. It's a whole book about this stuff. They discuss lots of experiments on the wisdom crowds including a crowd of 1. Especially interesting is when they address real world applications - in sentencing, insurance, executive search and more.

Isn't the wisdom of the crowd sort of the whole idea of democracy?

Assuming everyone makes some kind of internal estimate of how good/bad each candidate's policies are, a fair election should spit out the best option according to the median estimate. It's a lossy compression - we lose the numbers themselves and skip straight to the decision, and I'm not sure how well crowd wisdom works with the median, but I think our systems do *try* to apply this principle more than we give them credit for.

Love the math! I was a stat addict in college so this post struck the right chord.

Writing a diary is the traditional way to access the wisdom of the crowd of your past selves.

(Though I admit I have never seen a diary that would end every entry with: "My today's estimate of the distance from Paris to Moscow is 1234 km.")

I'm a bit late to this, and I haven't read all the comments, so it's possible someone else mentioned this, but it seems like "wisdom of crowds" becomes less useful for highly subjective future predictions that do not involve estimations of objective concrete facts which are not influenced by future decisions made after making estimates. If wisdom-of-crowds predictions are made about things over which our decisions have influence, the merely knowledge of the wisdom of crowds "answer" for a question influences our future decisions about things, rendering the prediction unreliable, because the learning of the prediction changes the likelihood of the outcome (making it either more or less likely).

I don't know if anybody else did this, but when I guessed the second time, I imagined how I would guess if I knew I was off significantly with my first guess. So it wasn't a clean guess. I guessed 5 or 10 times my first guess because I was imagining how I'd react if someone told me my first guess was way off... If that makes sense.

edited Feb 71. Use and share a histogram.

2. Mean and standard deviation

3. Geometric mean vs. Arithmetic mean - why are you fooling around with this. Is this purposeful obscurantism?

4. parrhesia - start over and rewrite this essay.

My first guess unburdened by the thought process was1500 miles which turns out to be within 72k of the right answer I surprised myself. My second guess my reasoning was, well if I'm off by a non-trivial amount....

So, maybe the second question should be just "guess again" which is closer to how a crowd works

You say the right answer is 2486, but then use 2487 in all the calculations.

I've taken a quick gander at the survey results, and I think you might have ballsed this one up.

There's a problem with the first question, in that there are two possible answers; by road (2834), or by air (2834). That's a difference of ~350km or about 12% before you start.

As you used "non-trivial amount" in the second question, there's a spot of priming/framing going on, such that the second answer can be reasonably expected to be further away from the first than would otherwise be the case.

Anyway, off for a spot of fun slicing and dicing.

Isn’t wisdom of the crowds in everyday issues a big part of asking a friend for advice? This seems pretty self-evident to me.

I'm a little mystified that you assert nobody thinks about the wisdom of the crowds, either inner or outer, in their ordinary lives.

In my world, asking people who you know (and sometimes even that you don't know, like in a blog comment section) for their thoughts before you make an important decision -- consulting the "outer" crowd -- is ubiquitous. I can't think of anyone who *doesn't* do this. SImilarly, "sleeping on it" or "not making decisions hastily," which amounts to asking the inner crowd (i.e. "re-evaluate this estimate again after some time has passed") is also ubiquitous.

We use wisdom of crowds all the time, and have for thousands of years, without a numerical component. A king's advisors are literally a core example, but extend to something like the Cabinet in the US, or a board of directors at a large company. If I'm thinking of making an important life decision, I may check in with my spouse, my sister, my best friend, my pastor, my financial advisor, and whoever else. All are using a type of wisdom-of-crowds.

Do you not see those as the same for some reason?

I already do this when I make estimates, and I think many other people (less than 1 in ten, but at least 1% or so) do too!

Specifically, when I am making a you-only-get-one-guess kind of guess, and it's important that I'm maximally precise (such as when the best guess, of hundreds, wins a prize, but there's no prize for being almost as close), I start by asking what number I'd throw out. I just cough something out via whatever estimation tool pops to mind. Then I try to identify, assuming that's *wrong*, which *way* it's wrong--meaning take a second guess, with a different estimation tool, or a more careful use of the first tool. Then a third. And etc. I'll also put error bars on my guesses' estimation tools (e.g. an estimate arrived at via multiplying four numbers with plus-or-minus 50% has a much bigger error range than an estimate arrived at via adding four numbers with plus-or-minus 50% error bars.)

I think when the stakes are high, people already do this. When the stakes are low, mostly they don't--so the "this is OP" will show up in survey questions, but not in real life (or less so) for stuff that matters.

edited Feb 7There are 6,379 questionaries where the question was answered both times. 6,076 had both answers between 500 and 20,000 km. I am using the 6,076 as my population for the analysis. I binned the answers into increments of 250 (e.g. Bin 2,500 would contain the count of all guesses from 2,251 to 2,500) in order to remove odd cutoffs since most answers were round numbers, I didn’t want a bin ending in a 0 or a 9 to change my results much.

For the 2 closest bins (2,500 and 2,750 which would account for guesses between 2,251 and 2,750 km) There were the following number of guesses in that range:

Guess 1s: 579

Guess 2s: 626

Averages: 914

Of those that originally guessed in that range, 227 (39%) had Averages in those 2 bins.

When you include the next 2 bins (2,001-3,000 range) the effect gets smaller

Guess 1s: 1,485

Guess 2s: 1,583

Averages: 1,627

When you include the next 2 bins (1,751-3,250 range) the average gets worse

Guess 1s: 2,424

Guess 2s: 2,375

Averages: 2,185

870 had their Guess 1 in the 2,000 bin, of those, 444 had a better average (Between 2,250 and 2,750). 102 had higher averages that were just as bad or worse (>=3,000) and 320 had lower averages that were just as bad or worse. (<=2000)

817 had their Guess 1 in the 3,000 bin, of those, 311 had a better average (Between 2,500 and 2,750).

456 had higher averages that were just as bad or worse (>=3,000) and 50 had lower averages that were just as bad or worse (<=2,250)

The below contains most of the data 5,357 of the observations. The first column is the bin their Guess 1 was in, the second column is the count of guesses, the third column is the amount of averages that were better than the Guess 1, and the fourth column is just column 3 as a percentage.

Bin Count Improvements % Improvements

500 66 62 94%

750 80 48 60%

1000 533 314 59%

1250 182 118 65%

1500 412 277 67%

1750 146 88 60%

2000 870 444 51%

2250 89 39 44%

2500 526 0 0%

2750 53 0 0%

3000 817 311 38%

3250 69 28 41%

3500 276 135 49%

3750 30 13 43%

4000 521 261 50%

4250 25 14 56%

4500 139 59 42%

4750 10 5 50%

5000 513 264 51%

Overall, it looks like those who initially guessed low had their average improve their score while those who initially guessed high did not. I am not sure what to take away from all of this, besides it’s not obvious that an individual guessing is a good way to go about increasing accuracy. There may be an effect where the best original guessers could do even better by multiple attempts.

Maybe the real world doesn't use proper scoring rules, so we don't instinctively make decisions that are good at maximising brier scores or whatever metric you want to use.

Like if you come to a fork in the road, with one path taking you north and the other east, and you think that the eastern path is likely to be 3 times worse than the northern one, you don't average them and set off NNE.

Random fun and reasonably relevant math fact: arithmetic mean and geometric mean are both special cases of an elegant generalization of pretty much all possible kinds of means: the power mean. https://en.wikipedia.org/wiki/Generalized_mean

Not at all what I thought you were testing. I thought you had found some different magic hack, because when you asked the first time I said 3000km and the second time I said 2500km, almost bang on when I checked.

About your last few paragaphs, I agree with most of your remarks about how the general inability of people to consistently and adequately quantify "vague feelings", and that knowing that, finance people will prefer models they trust to relying on people.

However, something in your paragraph about forecasting and the way it's not used in the real world brought me back to the fact that most decision-making is going to be not an applied probability, but a black-or-white decision. "What is the probability that I should work in academia" actually makes little actionable sense if it's not >80% or <20% (adapt the numbers to your risk-aversedness). And actually, going back to forecasting, it is very counter-intuitive that black-or-white events should be probabilized: either you believe they will happen, or you believe they will not -- it's not like there can be anything in between. What people making decisions want is not a 66% chance: they want a 0-or-1 belief that can be explained.

edited Feb 12Your method of using the geometric mean of the absolute error doesn't work well as a summary of how far off the typical answer was. Suppose for example the true answer to some question is 20, and the guesses are distributed uniformly randomly within the interval [19,21] (the exact form of the distribution doesn't matter, so long as it's continuous with a non-zero density near the true answer). After taking the absolute error, it's uniformly random in [0,1]. If this average is well behaved, in the limit we can replace the product with an integral exp(integral from 0 to 1 of ln(x) dx). The integral is negative infinity [edit: As Matthieu pointed out this is not correct, and therefore neither are the things that follow.] so the answer (exp of that) is 0. This means the average is not well-behaved, but intuitively it implies that the geometric mean of the absolute error will tend to zero as the number of samples tends to infinity, even if the actual average error remains constant. Note that this is not the error of the average, but the average of the error, which should remain non-zero. This is probably why the size-ten crowds appeared better than the full-size crowd, because this method of averaging over-emphasises values near zero.

A more reasonable way to combine the geometric mean with estimating errors would be to take the logarithms of all the estimates and the true value, calculate the mean-squared error or mean absolute error or something of the logarithms, then either use this result as-is ("estimates were off by X orders of magnitude on average") or take the exponential of it ("estimates were off by a factor of [e^X] on average"). In either case the result is dimensionless rather than being a kilometer value.

I may at some point re-do your analysis with this method and see how much it changes the results.

Could you get valuable results for a single person by asking about the distances between two different pairs of cities which are (roughly) the same distance away from each other? You would get some confounding results based on personal knowledge of geography, but it might be useful way of averaging multiple rough guesses from a single person.

If you're asking about a single data point, I wouldn't expect multiple guesses from a single person to be particularly useful. Asking for an error margin (or a min/max value) is the only way I can think of to actually get additional useful information from a single person.

Opinions are not formed independently, there are latent (hidden) relationships that imply correlation structure among the responses (gender, politics, shared exposure to blogs, location, SES, ...) effectively reducing the size of the crowd.

Scott, surprised you didn't mention the TLP article on the same topic.

https://thelastpsychiatrist.com/2009/01/gods_cheat_code_for_accuracy.html

edited Feb 8I don't think your second question's formulation was useful for testing your hypothesis, but it is an interesting illustration of the wisdom of the crowd effect. If everybody gives their best guess first, and is told to think it is very wrong, and made to give a second "guessier" guest you don't have a wisdom of a crowd of size = 2x, you have the wisdom of a crowd of size X, and the intentional worse effort of a crowd also of size x, and are taking the average.

Here, everybody gave their best guess and were 918 km off. They were told to assume their first guess was non-trivially wrong, and take a second guess. Collectively they were able to reduce their individual "wisdom" by making a guess that was not their actual first choice, and as a crowd were successfully not as correct as their actual collective first choice.

I'm curious if this would be replicable. "If you ask a crowd a knowledge question and ask them to guess, then tell them to assume their first question was nontrivially incorrect and guess again, the average of the first guesses will be closer than the average of the second guesses to the correct answer". In other words, respondents are successfully and correctly assigning a relative probability rating to their top two guesses

https://www.youtube.com/watch?v=CNvz91Jyzbg&t=3043s

Generally interesting interview with Edward Thorp-- this is the section where he explains how he found out that Madoff was a fraud, and the importance (quite specifically) of actually understanding what's going on rather that trusting the wisdom of crowds. The crowd trusted Madoff.

I *think* the wisdom of crowds is about estimates, while it's possible that there are things which can actually be understood. How can you tell if you're dealing with something which can be understood?

https://www.youtube.com/watch?v=CNvz91Jyzbg&ab_channel=TimFerriss

Not the first place I've heard the idea, but Edward Thorp argues that that holding index funds are the best investment, and part of that is because getting in and out of stocks (and other investments?) is heavily taxed.

Is substantially taxing getting in and out of particular stocks a good idea? Are there costs to pushing people into index funds?

Generally interesting interview.

Isn't democracy wisdom of the crowd at scale? And explicitly so if we look att the jury theorem, which was an argument for democracy from Mr. Condorcet.

"If you could cut your error rate by 2/3 by using wisdom of crowds techniques with a crowd of ten, isn’t that really valuable?"

Yes! We use wisdom of crowds all the time! When you ask your friends for advice, that's wisdom of crowds. You don't ask them for numerical estimates of happy you'd be in academia vs. industry because our brains didn't evolve to deal with numbers. Instead, you ask them whether they think you should go into academia and industry. They subconsciously weigh all the information they have, and give you a subjective sentiment. If Alice, Bob, and Charlie all say you'll be miserable in academia because of X, Y, and Z, except Alice and Bob say you'll be slightly miserable and Charlie says you'll be very miserable, that's pretty good evidence against going into academia.

Democracy is also wisdom of crowds. You don't ask people what the chances of Russia invading Ukraine are. Instead, you do opinion polls to ask them what the proper foreign policy toward Russia should be, which is what you want to know anyways. The war hawks that would get us into a nuclear war cancel out the hippies who would have Russia roll all over us, and in the median you get a more or less reasonable foreign policy.

I was one of the people who gave a ridiculously far-off estimate. Partially this was due to me not remembering how far a kilometer is and going "well, it's either roughly half a mile or roughly two miles ... roughly two miles it is"

There's an easier analysis of the guessing twice strategy:

For each individual, which guess or average is closest to the correct answer?

Using the public data of everyone who answered both,

First guess is best 3036 times.

Second guess is best 2569 times.

The average of the two guesses is best 643 times.

The geometric of the two is best 358 times.

So in this data the first guess is most often the closest one. However, if you combine the second guess and the two averages, one of them (but you don't know which) is best more often than the first guess.

That doesn't feel surprising. It's another way of saying "Consider other values, and move your guess closer to values you consider probable." Which is something you're probably doing when you're guessing anyways. There might be some value in writing down a number of guesses, rating them, and then iterating with more guesses and ratings until you feel like you can't improve any more. As a way to simply force yourself to have more thinking time on the problem considering different possibilities.

You have used the abbreviation “OP” twice in this post without mentioning it what it means.

I have noticed that ACX and LW posters also tend to use abbreviations a lot and just assume that their audience is in the know. This is very frustrating and does a disservice to those who are new to your community and are trying to learn and enjoy being a part of these discussions.

Does any know what “OP” refers to?

Thanks

In the case of prediction market contests, there's an additional factor that would cause averages to be more accurate, even beyond the usual wisdom of crowds. The goal of a participant isn't to minimize their *expected* error, the goal is to *maximize the chances of rising to the top*. Realistically, the only way to win a contest like this is to gamble on some unlikely outcomes and hope for the best. But if you average a crowd together, they'll probably gamble on different things and the outliers disappear.

I'm pretty sure at least one of the very low outliers was my honest answer. I guessed that the round-number km distance from equator to pole was probably 1000 rather than 1,000,000, and Paris-Moscow was probably about a fifth of that.

I think it was wrong to remove outliers.

This phenomenon is similar to test time augmentation in ML. There you do multiple predictions with a bit of randomness injected into the same inputs. Then you take those multiple predictions sourced from the same model and average them.

If you instead enable dropout, which is kind of like a human being on lsd. It's called monte carlo dropout'

Spooky. The Van Dolder paper is on my very short list of studies to read soon.

The Good Judgement Project had an experimental condition where some participants made predictions by simultaneously betting in a prediction market and sharing the expected likelihood of an event directly. Their aggregation algorithm achieved highest accuracy when both the bet and the direct prediction were incorporated, implying that they each held some different information. They speculated this was due to the "inner wisdom of crowds" thing, and I think cited that same study.

Ask the entire population of the world whether God exists. The wisdom of the crowds would point to yes.

I hope Scott reads 4Denthusiast's comment and my complement, because they are the answer to his statistical puzzle.

Why would you have to convert life decisions into numerical values? Consulting family, friends, etc is pretty basic human behaviour, and they certainly may be of clearer mind about your preferences, character, capacities etc than you, at least in certain aspects, and many instances.

I can offer an anecdote about that.

Just yesterday, at a mall, I spotted one of these "guess how many marbles are inside and post the answer online" contests. As it happens, I read this post just recently, so I figured I might stand a good chance if I tried this technique: make a few logical guesses, each time assuming my previous guess was wrong in some way, then average them all.

My guesses? 6000, 9000, 4851, 5198. So my answer was their root mean square, 6472. (For some reason I felt the r.m.s. was better than a mere average.)

The actual answer? 6498.

99.6% accuracy. My mind was blown.

(In case you're wondering what I won, the answer is: nothing. Turns out the contest ended 18 months ago, and then they just left the big glass case with marbles there, complete with out-of-date instructions.)

Since multiple estimates from one person seem to be more powerful the less correlated they are, I wonder if there are any strategies-of-thumb which might reduce the correlation among an individual's estimates (other than the obvious wait-until-they-forget-their-other-answers).

e.g.

Estimate 1: "gut feeling",

Estimate 2: "Fermi estimate",

Estimate 3: "some other semi-structured way of extrapolating unknown quantities from known quantities",

etc

This is a well studied property of statistics. If you are trying to estimate the mean, odds are just taking the mean of your sample will be incorrect. If you throw in a completely random number, the odds are good that you will get a better estimate. It's sort of like the Monty Hall three doors paradox.

This may seem totally strange, but statisticians use this to improve their estimates of the mean by a process called bootstrapping. (There are variants of this. e.g. jackknifing.) The idea is to take the means of different subsamples of the original sample and using them to derive a better estimate of the mean.

It's not so much a property of crowds as a property of numbers.

> 1/ERROR = 2.34 + [1.8 * ln(CROWD_SIZE)]

I think something is wrong with this equation. 1/ERROR should always be < 0.005 because the error is always ~200km or more, ln(CROWD_SIZE) should be >1 for any crowd sizes of at least e (2.718), so eg a crowd size of e (2.718) gives predicted inverse error 2.34 + (1.8 * 1) = 4.14 or an error of 1/4.14 = 0.24 km, which is way off. Unless I'm missing something, which is entirely possible.

I tried replicating this and the best-fit curve I found was 1/ERROR = 0.00093 + [0.00073 * ln(CROWD_SIZE)]. (Side note: I only had n=6537, not 6924 as you said, after eliminating all blank answers.)