323 Comments

> And then generalize further to the entire world population over all of human history, and it stops holding again, because most people are cavemen who eat grubs and use shells for money, and having more shells doesn’t make it any easier to find grubs.

This is inaccurate. The numbers are pretty fuzzy but I find reputable-looking estimates (e.g. https://www.ined.fr/en/everything_about_population/demographic-facts-sheets/faq/how-many-people-since-the-first-humans/) that roughly 50% of humans who ever lived were born after 1 AD.

Expand full comment

I'm pretty confused by this kind of attitude. To be quite frank I think it's in-group protectionism.

I'll start off by saying I think most psych studies are absolute garbage and aella's is no worse. But that doesn't mean aella's are _good_.

In particular, aella's studies are often related to extremely sensitive topics like sex, gender, wealth, etc. She's a self-proclaimed "slut" who posts nudes on the internet. Of course the people who answer these kinds of polls _when aella posts them_ are heavily biased relative to the population!

I think drawing conclusions about sex, gender, and other things from aella's polls is at least as fraught as drawing those conclusions from college freshmen. If you did a poll on marriage and divorce rates among college-educated people you would get wildly different results then at the population level. I don't see how this is any different from aella's polls.

Expand full comment

If smart people eat bananas because they know they are good for their something something potassium then we should be skeptical about the causal language in your putative study title. Perhaps something more like "Study finds Higher IQ People Eat More Bananas" would be more amenable to asterisking caveats and less utterly and completely false and misleading.

Expand full comment

I think the real difference here is that the studies are doing hypothesis testing, while the surveys are trying to get more granular information.

I mean you have a theory that bananas -> potassium -> some mechanism -> higher IQ, and you want to check if it is right, so you ask yourself how does the world look different if it is right versus if it is wrong. And you conclude that if it is correct, then in almost any population you should see a modest correlation between banana consumption and IQ, whereas the null hypothesis would be little to no correlation. So if you check basically any population for correlation and find it, it is evidence (at least in the Bayesian sense) in favor of your underlying theory.

On the other hand, if you were trying to pin down the strength of the effect (in terms of IQ points/ banana/ year or something), then measuring a correlation for just psych 101 students really might not generalize well to the human population as a whole. In fact, you'd probably want to do a controlled study rather than a correlational one.

Expand full comment

I agree that most people rush to "selection bias" too quickly as a trump card that invalidates any findings (up there with "correlation doesn't mean causation"). However, I disagree that "polls vs correlations" is the right lens to look through it (after all, polls are mostly only discovering correlations as well).

The problem is not the nature of the hypotheses or even the rigor of the research so much as whether the method by which the units were selected was itself correlated with the outcome of interest (i.e., selecting on the dependent variable). In those cases, correlations will often be illusory at best, or in the wrong direction at worst.

Expand full comment

What do you all think about the dominance of Amazon’s Mechanical Turk in finding people for studies? Has it worsened studies by only drawing from the same pool over and over?

Expand full comment
Dec 27, 2022·edited Dec 27, 2022

"Selection bias is fine-ish if..."

I'm interpreting this as saying that one's prior on a correlation not holding for the general population should be fairly low. But it seems like a correlation being interesting enough to hear about should be a lot of evidence in favour of the correlation not holding, because if the correlation holds, it's more likely (idk by how much, but I think by enough) to be widely known -> a lot less interesting, so you don't hear about it.

As an example, I run a survey on my blog, Ex-Translocated, with a thousand readers, a significant portion of which come from the rationality community. I have 9 innocuous correlations I'm measuring which give me exactly the information that common sense would expect, and one correlation between "how much time have you spent consuming self-help resources?" and "how much have self-help resources helped you at task X?" which is way higher than what common sense would naively expect. The rest of my correlations are boring and nobody hears about them except for my 1,000 readers, but my last correlation goes viral on pseudoscience Twitter that assumes this generalises to all self-help when it doesn't and uses it to justify actually unhelpful self-help. (If you feel the desire to nitpick this example you can probably generate another.)

I agree that this doesn't mean one ought to dismiss every such correlation out of hand, but I feel like this does mean that if I hear about an interesting survey result's or psych study's correlation in a context where I didn't also previously hear about the survey/study's intention to investigate said correlation (this doesn't just require preregistration because of memetic selection effects), I should ignore it unless I know enough to speculate as to the actual causal mechanisms behind that correlation.

This pretty much just bottoms out in "either trust domain experts or investigate every result of a survey/every study in the literature" which seems about right to me. So when someone e.g. criticises Aella for trying to run a survey at all to figure things out, that's silly, but it's also true that if one of Aella's tweets talking about an interesting result goes viral, they should ignore it, and this does seem like the actual response of most people to crazy-sounding effects; if anything, people seem to take psych studies too seriously rather than not taking random internet survey results seriously enough.

Expand full comment

Like any kind of bias, selection bias matters when the selection process is correlated with BOTH the independent and dependent variables and as such represents a potential confounder. Study design is how you stop selection bias from making your study meaningless.

Expand full comment

The way I think about the key difference here (which I learned during some time doing pharma research, where this kind of issues are as bad as... well) is that when claiming that a correlation doesn't generalize, some of the *burden of proof* shifts to the person critizicing the result. Decent article reviewers were pretty good at this: giving an at least plausible-sounding mechanism by which when going to a different population there's som *additional* effect to cancel/revert the correlation. It's the fact that the failure of correlation requires this extra mechanism that goes against Occam's Razor.

Expand full comment

It's not about correlations, it's about the supposed causal mechanism. Your Psych 101 sample is fine if you are dealing with cognitive factors that you suppose are universal. If you're dealing with social or motivational ones, then you're perhaps going to be in danger of making a false generalization. This is particularly disastrous in educational contexts because of the wide variety of places and populations involved in school learning. It really does happen all the time, and the only solution is for researchers to really know the gamut of contexts (so that they realize how universal their mechanisms are likely to be) and make the context explicit and clear instead of burying it in limitations (so that others have a chance to catching them on an over-generalization, if there is one). Another necessary shift is for people to simply stop looking for universal effects in social sciencies and instead expect heterogeneity.

Expand full comment

“But real studies by professional scientists don’t have selection bias, because . . . sorry, I don’t know how their model would end this sentence.”

...because they control for demographics, is how they’d complete the sentence.

Generically, we know internet surveys are terrible for voting behavior. Whether they’re good for the kinds of things Aella uses them for is a good question!

I’m on the record in talks as saying “everything is a demand effect, and that’s OK.” I see surveys as eliciting not what a person thinks or feels, but what they are willing to say they think and feel in a context constructed by the survey. Aella is probably getting better answers about sexual desire (that’s her job, after all!) and better answers on basic cognition. Probably worse on consumer behavior, politics, and generic interpersonal.

Expand full comment

>And then generalize further to the entire world population over all of human history, and it stops holding again, because most people are cavemen who eat grubs and use shells for money, and having more shells doesn’t make it any easier to find grubs.

I know this is somewhat tongue-in-cheek, but for accuracy's sake: the number of people who were born before widespread adoption of agriculture was on the order of 10 billion, vs. about 100 billion after. https://www.prb.org/articles/how-many-people-have-ever-lived-on-earth/

Expand full comment

I am a professor of political science who does methodological research on the generalizability of online convenience samples. The gold standard of political science studies is indeed *random population samples* -- it's not the whole world, but it is the target population of American citizens. Yes this is getting harder and harder to do and yes imperfections creep in. But studies published in eg the august Public Opinion Quarterly are still qualitatively closer to "nationally representative" then are convenience samples, and Scott's flippancy here is I think a mistake.

My research is specifically about the limitations of MTurk (and other such online convenience samples) for questions related to digital media. My claim is that the mechanism of interest is "digital literacy" and that these samples are specifically biased to exclude low digital literacy people. That is, the people who can't figure out fake news on Facebook also can't figure out how to use MTurk, making MTurk samples almost uniquely bad for studying fake news.

(ungated studies: http://kmunger.github.io/pdfs/psrm.pdf

https://journals.sagepub.com/doi/full/10.1177/20531680211016968 )

This post is solid but it doesn't emphasize enough the crucial point: "If you’re right about the mechanism...". More generally, I think that there are good reasons that Scott's intuitions ('priors') about this are different from mine: medical mechanisms are less likely to be correlated with selection biases than are social scientific mechanisms.

There is a fundamental philosophy of science question at stake here. Can the study of a convenience sample *actually* test the mechanism of interest? As Scott says, there is always the possibility of eg collider bias (the relationship between family income and obesity "collides" in the sample of college students).

So how much evidence does a correlational convenience sample *actually* provide? This requires a qualitative call about "how good" the sample is for the mechanism at issue. And at that point, if we're making qualitative calls about our priors and about the "goodness" of the sample....can we really justify the quantitative rigor we're using the in the study itself?

In other words: should a study of a given mechanism on a given convenience sample be "valid until proven otherwise"? Or "valid until hypothesized otherwise"? Or "Not valid until proven otherwise"? Or "Not valid until hypothesized otherwise"?

Expand full comment

Is there a reason why you just wouldn't want to be somewhat specific with the headline of what you're publishing? So instead of "Study Finds Eating Bananas Raises IQ," you instead publish “Study Finds Eating Bananas Raises IQ in College Students," if they're all college students.

Expand full comment

I think the important issue is whether the selection bias is plausibly highly correlated to the outcomes being measured. I think the reason ppl scream selection bias about internet polls is that frequently participation is selected for based on strong feelings about the issue under discussion.

So if you are looking for surprising correlations in a long poll (as u do with your yearly polls) that's less of an issue but the standard internet survey tends to be in a situation where the audience can either guess at the intended analysis and decides to participate based on their feelings about it or is a situation where they are drawn to the blogger/tweeter because of similar ways of understanding the world so is quite likely to share whatever features of the author prompted them to generate the hypothesis.

Choosing undergrads based on a desire for cash is likely to reduce the extent of these problems (unless it's a study looking at something about how much ppl will do for money).

Expand full comment

Real scientists control for demographic effects when making generalizations outside the specifics of the dataset used. I'm confused why this article doesn't mention the practice - demographic adjustments are a well-understood phenomenon and Scott would have been exposed to them thousands of times in his career. And honestly, I think an argument can be made that the ubiquity of this practice in published science but its absence in amateur science mostly invalidates the thesis of this article, and I worry that Scott is putting on his metaphorical blinders due to his anger at being told off in his previous post for making this mistake.

This article does not feel like it was written in the spirit of objectivity and rationalism - it feels like an attempt at rationalization in order to avoid having to admit to something that would support Scott's outgroup.

Expand full comment

(1) It's also worth noting that you can do a lot of sensitivity tests to see how far the results within your sample appear to be influenced by different subgroups which can help indicate where the unrepresentativeness of your sample might be a problem. IIRC the EA Survey does this a lot. This also helps with the question of whether an effect will generalise to other groups or whether, e.g. it only works in men.

Of course, this doesn't work for unobservables (ACX subscribers or Aella's Twitter readers are likely weird in ways that are not wholly captured by their observed characteristics, like their demographics).

(2) I think you are somewhat understating the potential power of "c) do a lot of statistical adjustments and pray", which understates the potential gap between an unrepresentative internet sample which you can and do statistically weight and an unrepresentative internet sample (like a Twitter poll) which you don't weight. Weighting very unrepresentative convenience samples can be extremely powerful in approximating the true population, while Twitter polls are almost always not going to be representative of the population.

Expand full comment

Seems like a good argument for rejecting studies done on Psych 101 undergrads, not for accepting surveys done on highly idiosyncratic groups of blog readers.

Expand full comment

Since someone evaluating a claim can never know how many polls didn't show interesting results so both the fact that real world surveys are much more expensive to conduct and have fewer variables under control of the survey giver (accepted practice isn't to say what the UG is coming in for and cash is primary motivator in all of them) is a very strong justification for treating online polls as less reliable.

In some sense the real selection bias is the selection bias in terms of what polls you haven't heard about but it's a good reason. Though it leads to an interesting epistemic situation where the survey giver may have no reason to doubt their poll more than the academic polling UGs but those they inform about it do.

Expand full comment

> It doesn’t look like saying “This is an Internet survey, so it has selection bias, unlike real-life studies, which are fine.”

Eh, this seems like a highly uncharitable gloss of the concern. I would summarize it more as "Selection (and other) biases are a wicked hard problem even for 'real-life' studies that try very hard to control for them; therefore, one might justly be highly suspicious of internet studies for which there were no such controls."

One good summary of the problem of bias in 'real-life' studies: https://peterattiamd.com/ns003/

The issue is always generalization. How much are you going to try to generalize beyond the sample itself? If not all, then there is no problem. But, c'mon, the whole point of such surveys is that people do want to generalize from them.

Expand full comment

So, this is kinda accurate, but I feel like you're underestimating the problems of selection bias in general. In particular, selection bias is a much bigger deal than I think you're realizing. The correlation coefficient between responding to polls and vote choice in 2016 was roughly 0.005 (Meng 2018, "Statistical Paradises and Paradoxes in Big Data"). That was enough to flip the outcome of the election. So for polls, even an R^2 of *.0025%* is enough to be disastrous. So yes, correlations are more resistant to selection bias, but that's not a very high bar.

Correlations are less sensitive, but selection effects can still matter a lot. As an example, consider that among students at any particular college, SAT reading and math scores will be strongly negatively correlated, despite being strongly positively correlated in the population as a whole: if a student had a higher score on both reading and math, they'd be going to a better college, after all, so we're effectively holding total SAT constant at any particular school.

So the question is, are people who follow Aella or read SSC as weird a population as a particular college's student body? I'd say yes. Of course though, it depends on the topic. For your mysticism result, I'm not worried, because IIRC you observe the same correlations in the GSS and NHIS--which get 60% response rates when sampling a random subset of the population. But I definitely wouldn't trust the magnitude, and I'd have made an attempt at poststratifying on at least a couple variables. Just weighting to the GSS+Census by race, income, religion, and education would probably catch the biggest problems.

Expand full comment

If this is about the last article your general point is correct but you polled a readership that's notoriously hostile to spirituality to determine if mental health correlates to spirituality. It'd be like giving Mensa folks a banana and measuring their IQ. You selected specifically for one of the variables and that's likely to introduce confounders.

Expand full comment

Going to Aella's tweet that was linked:

> using it as a way to feel superior to studies, than judiciously using it as criticism when it's needed

just because people use selection bias as a way to feel superior to studies doesn't mean that the study isn't biased in the first place

and

> But real studies by professional scientists don’t have selection bias, because...

ignoring the fact that professional studies control for selection bias, or at least have a section in the paper where the participants are specified, unlike twitter polls

Expand full comment

Selection bias can and absolutely does break correlations, frequently. The most obvious way is through colliders (http://www.the100.ci/2017/03/14/that-one-weird-third-variable-problem-nobody-ever-mentions-conditioning-on-a-collider/) - but there's tons of other ways in which this can happen: the mathematical conditions that have to hold for a correlation to generalize to a larger population when you are observing it in a very biased subset are pretty strict.

Further: large sample sizes do help, but, they do not help very much. There is a very good paper that only requires fairly basic math that tackles the problem of bias in surveys: https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf (not - this is not specifically correlations, but the problem is closely related). Here is the key finding:

Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρR,X ≈ −0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence)

And keep in mind - this is in polling, which 'tries' to obtain a somewhat representative sample (ie, this sample is significantly less biased than a random internet sample).

Expand full comment

Looking at Aella's data & use of it, I don't have the same concerns I may have about the SSC survey used on religious issues.

So this chart, for example:

https://twitter.com/Aella_Girl/status/1607641197870186497

I am not aware of a likely rationale for these results to change by the selection effect, specifically on the axis studied. I may wave the selection effects concerns if the slope were the specific question, but not the presence of a slope.

Even further, I don't have a background where Aella is trying to turn a very messy & vague original problem statement into something to attempt to refute without providing a number of caveats.

It is valid to push back that selection effects are everywhere. It is valid to argue that SSC data has some evidentiary value, and that as good Bayesians we should use it as evidence. The tone of the post does not hit the right note not to have it rejected.

However, to push-back the push-back, I would seriously try to assess if you have a challenge in dealing with disagreements or challenges. Not trying to psychologize this too much, however, is this post actually trying to raise the discourse? Or is this post just trying to nullify criticism? Are you steelmanning the concern, or are you merely rebutting it?

Expand full comment

An interesting solution to the problem that surveys are so easy to give online (creating strong publication/heard of bias) would be to setup a website where poll givers have to post a certain sized donation (say to givewell) to give the survey to duplicate the effect of offline polls being expensive to give thereby reducing publication bias.

Expand full comment

I’ve been thinking a possible online business / salve for democracy would be “a weekly election on what matters most to you.” Basically like a Twitter poll but slightly less crazy.

If people volunteer their demographic info, this would be very valuable for customers like businesses and politicians. End users get the satisfaction of someone somewhere finally listening

Expand full comment
Dec 27, 2022·edited Dec 27, 2022

I'm sympathetic to pushing back on lazy criticism, but also I think the context of how the result was produced is very important for calibrating how strongly one can take it as evidence. It's certainly true that all surveys are inherently "flawed" due to selection bias issues. There's a few ways to proceed from this:

(1) Throw up one's hands, declare the truth unknowable, and post a picture of an airplane wing with bullet holes.

(2) Acknowledge that this survey, like all surveys, is imperfect. But hey, the result sure is interesting, it makes some kind of intuitive sense, and there's no obvious reason why it really shouldn't generalize. Take the exact numbers with a grain of salt and hope that the first order effect dominates, as it often does.

(3) Do a lot of careful statistical analysis to attempt to correct for unrepresentative aspects of the sample. Compare results to literature for previous research into related questions. Submit to peer review and respond to critical feedback. Attempt to replicate.

Response (1) is the kind of lazy critique that this post argues against, and I agree that it is poor form and doesn't contribute much. Response (2) is reasonable for generating hypotheses and building intuition about the world, but it will also lead you astray a nontrivial fraction of the time. Response (3) is closer to what a professional researcher would do, but it takes a lot more time and expertise and will still be wrong sometimes.

I think the interesting conflict comes from conflating (2) vs (3). Someone accustomed to (3) and may look at people doing (2) as naïve and out of their depth, and also dilutive to more rigorous work because it may look the same to undiscerning lay people. Meanwhile someone doing (2) may look at people demanding (3) as gatekeepers with excessive demands for rigor whose preferred methods aren't exactly bulletproof either. This could easily degenerate into a toxic discourse where people just yell past each other. But provided they are given with appropriate context, I think both (2) & (3) can be useful ways to build knowledge about the world. Rigor is useful but it's not a binary where everything insufficiently rigorous must be discarded as useless and anything that meets the bar accepted as eternal truth.

Expand full comment

"But generalize to the entire US population, and poor people will be more obese, because they can’t afford healthy food / don’t have time to exercise / possible genetic correlations."

And, to be impolite, because many of the same things that make them more likely to be poor make them more likely to be obese: lower intelligence, less ability to defer gratification, less ability to plan and follow through, etc.

Expand full comment

Well, sure. I think the steelman argument is that selection bias is often much worse for a survey on the internet than a psych 101 study. No psych professor has to worry whether all their respondents are all horny, always online boys because they recruited by posting nudes on Twitter, or whether they’re all participating in the study just to fuck with someone’s results.

Also, your banana study title is killing me. It shows correlation, not causation, and as we all know…

Expand full comment

When I was diagnosed with pancreatitis, I immediately searched the internet for information. Unfortunately, the first serious-looking research paper I found declared the ailment had a 60% survival rate in five years.

I didn't like that one bit, so I kept looking. After a couple weeks I found another paper that declared the five-year survival rate was over 90%. I liked that paper a lot better.

Seven years on, my survival rate is 100%. So, Is my confirmation bias confirmed?

Expand full comment
Dec 27, 2022·edited Dec 27, 2022

I wonder if the mere fact that you restrict the sample on x axis, or y axis, causes the correlation between x and y variables to be completely different than in the general population.

For example: suppose that psychology students never eat less than one banana per year - other than that they do not have any fancy physiology or mental properties - wouldn't that alone restrict the "elliptic" picture of the x-y correlation to a fragment in which this ellipse has a particular slope?

I've made a tool to help me visualize this:

https://codepen.io/qbolec/pen/qBybXQe

in this demo there are two variables:

X is a normal variable with mean=0 and variance=1

Y depends on X, in that it is a Gaussian with mean=X*0.3 and variance=1

So, we expect the correlation to be positive, because the higher the X, the higher the Y in general and indeed the white dots form a slanted elliptic cloud. And the correlation in general population seems to be ~0.29.

But if we restrict the picture to the green zone in the upper right corner of the ellipse, I sometimes get negative correlation for such sub-sample, and I never get close to 0.3.

(Sorry, I could not get this demo to robustly show the negative value, though)

IIRC the https://www.lesswrong.com/posts/dC7mP5nSwvpL65Qu5/why-the-tails-come-apart was about this phenomenon.

Expand full comment

Scott is a clever guy, but here he is on thin ice, for reasons others have pointed out above. Testing correlations (hunting for causality) the way he did in the blog post he refers to (healthy people less often report mystical experiences) is a subtler version of what is commonly referred to as “red wine research”.

…lots of studies find a positive correlation between drinking red wine and scoring high on various health measures. Some researcher is then quoted in the news media suggesting a causal relationship: There must be something in red wine that improves health. And there may be.

However, drinking red wine is correlated to being upper middle class. And upper middle class people score higher on many/most health indicators.

You can do multivariate regressions and the like to reduce the problem, but the number of control variables will always be limited. Unobserved heterogeneity is always with us, in such correlation studies. The problem is particularly acute if you do not even have a time series (panel study).

The problem with the correlation between health and mystical experience is more subtle - it is not a straightforward 3rd variable problem. So it is not a straightforward “red wine research” problem (I do not want to insinuate that Scott is not aware of statistics 101). The subtler problem first has to do with possible selection in who among healthy people that are ACX readers & who among them that filled in the survey. Perhaps they are a particularly secular bunch of healthy people, who give secular explanations to “strange” personal experiences that run-of-the-mill healthy people would label mystical experiences. Secondly, it has to do with the possibility that not-so-healthy ACX readers who filled in the survey may be a more mystically oriented bunch of people than run-of-the-mill not-so-healthy people. If so, they might be more likely than other not-so-healthy people to interpret “strange” experiences as mystical.

…this is based on a speculative hypothesis that ACX readers are composed of two groups of people: particularly secular rationalists drawn to Scott’s writing on rationalism, and particularly mystically-oriented people drawn to his writings on, well, mystical experiences of various sorts. And that there are correlations with self-declared health between these two select groups of readers (who responded to the survey).

Who knows.

Expand full comment

Is this because of all the comments on your last post?

The issue I had wasn't that selection bias is present in your survey, that's unavoidable. The issue I had was that you were far more conclusive than your survey allowed you to be. You misused your data and stood on a soapbox at the end there.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

Sample selection can be a problem for other reasons as well (i.e. Berkson's paradox).

Expand full comment

There's a specific circumstance where selection bias is fatal for correlations: when examining correlations on characteristics related to selection. Take your obesity example:

"in a population of Psych 101 undergrads at a good college, family income is unrelated to obesity. This makes sense; they’re all probably pretty well-off, and they all probably eat at the same college cafeteria. But generalize to the entire US population, and poor people will be more obese, because they can’t afford healthy food / don’t have time to exercise / possible genetic correlations."

The big problem here isn't that everyone's reasonably well-off, it's that because college selects for well-off people, people who aren't well-off and who end up in college anyway will have a bunch of compensatory characteristics that help them get selected into college. To make it extremely simple, we could imagine that whether you go to college is entirely a function of family income and something like personal grit/self-control. In this case, we'd expect that the minimum amount of self-control necessary to get into college would be higher for lower-income people. As a result, if there was no other relationship between self-control and family income, we'd end up with a negative correlation between the two among college students that was stronger the more selective the college was (and thus the more people are on the line between being selected and not).

So now when you do your obesity study, you'll get a biased estimate of the effect of family income on obesity because family income will be negatively associated with self-control, which is itself negatively associated with obesity. This will be true despite the fact that there's no relationship between self-control and family income in the full population.

In the case of the ACX reader surveys, this might mean that people who are least like other ACX readers (for instance, non tech people, women) will be more selected for ACX-ness than are the people most likely to read ACX.

My favorite example of this is basketball players and height, btw. My guess is that if you surveyed NBA players on how much time they spent playing basketball as kids, the shorter players would have spent more time playing basketball than the taller players, because short people need fantastic basketball skills to be NBA players while tall people only need decent basketball skills. This would be the exact opposite correlation you would get with any other group of people.

Expand full comment

This is a side issue but because of very high recent population growth this almost certainly isn’t true right: “And then generalize further to the entire world population over all of human history, and it stops holding again, because most people are cavemen who eat grubs and use shells for money, and having more shells doesn’t make it any easier to find grubs.” I’m not sure when the median person who ever lived was born but I bet it was sometime in the 20th century, no?

Expand full comment

The problem is not with the surveys themselves, the problem is with how people interpret their results. Yes, you're smart enough and savvy enough to mentally append "this is true for the types of people who follow Aella's Twitter account" to every conclusion you draw from her surveys. But I doubt that all of Aella's followers are also that smart and savvy. A lot of people will probably assume that the results are representative of the general populace, simply because they haven't even considered the fact that the results might be unrepresentative.

And for what it's worth, I really like Aella's surveys, and I genuinely think there's a lot of value to be found in them! I just also think saying "take internet survey results with a grain of salt" is a useful reminder, because not everyone takes their full context into consideration by default.

Expand full comment

"Internet studies have selection bias and academic studies don't" is a strawman. A stronger form of the argument is that it's typical for Internet studies to *select on the dependent variable* in ways that are much more concerning than the typical Mechanical-Turk or psych-undergrad samples of an academic study.

While academic studies often use WEIRD samples that are somewhat better-educated, richer, etc than the global average, Internet convenience samples—particularly those from blogs like this or Aella's that have a strong "flavor"—are biased along ideological, cultural, or interest-based affinity dimensions, in addition to selecting for literacy and Internet access in ways similar to psych-undergrad studies. Furthermore, a typical Internet study asks questions about topics specific to the interest(s) distinctive to the sampled population, which makes it much more likely that results will be unrepresentative and even correlations won't generalize.

Aella is a clear example of this: she's a former sex worker who has gathered a following by flouting normal social taboos about sex and sex talk, and she asks these followers about exactly these topics. It's certainly interesting to see what this large sample of highly-open-to-discussing-sex-in-written-English people thinks, but there are obvious reasons to think most people's thoughts about sex are more similar to those of the median Mechanical Turk user than the median Aella poll participant.

Expand full comment

> (obviously there are many other problems with this study, like establishing causation - let’s ignore those for now)

I agree that failure to generalize out of sample can be fine-ish if you already know or don't care about the causal model, but when I see something criticized for selection bias, it's almost always to caution against making inferences related to causation.

Expand full comment

>It doesn’t look like saying “This is an Internet survey, so it has selection bias, unlike real-life studies, which are fine.” Come on!

Do you think people will to criticize your blog, generally trust shitty psyche studies?

Expand full comment

Is the data available for download? If so, does someone have a link? Thanks in advance.

Expand full comment

Aella neither has to face peer review nor the scrutiny of replication by other, unaffiliated scientists.

This feels like a major shortcoming.

Expand full comment

This is why I pay attention to anecdotal evidence, Amazon reviews, and dietary cults. Where humans are not homogeneous, people select diets/products/etc. based upon their individual circumstances. A diet/drug/product/meditation technique may be beneficial to some and harmful to others for an average effect of zero. Even attempting a scientific study to determine higher order statistical terms is double plus expensive. Anecdotal evidence is often mixed with revealed preference.

It takes a bit of human judgment to tease out the underlying truth of such data. People will cling to a diet or ideology even if it isn't working for them. But even there, the truth leaks out. For example, the fact that most Paleo and Keto advocates tout weight training and dismiss aerobic workouts is a pretty decent indicator that such diets are not optimal for running marathons. Conversely, carbo much advocates tend to be big on aerobics, the the tiny number of vegan body builders out there are big on "superfoods."

Expand full comment

Aella's followers seem to be men. Who are mostly heterosexual, I guess, as they like to look at pics of her with not too much clothing on - if I understand correctly - with enough text in between so even Alex Tabarrok feels ok to read her tweets. I would not see this as a reason to consider her polls un-representative. - Now can please someone provide links to those fabulous pics - instead of tweets where she argues for "my samples are likely more reliable than most published social research" - which may very well be. Just gimme those pics! Please! And maybe some links to her legendary polls. ;)

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

I once asked a bunch of Ivy league students working in a physics lab to try to draw maps of the USA from memory. The results were pretty interesting.

One thing that was apparent was that the students I thought were dullest produced the best maps. I think this is best explained by Berkson's paradox--smart students don't need as good a memory to get into Ivy League schools.

I worry about correlation in SSC surveys.

Expand full comment

> But real studies by professional scientists don’t have selection bias, because . . . sorry

Because they do have selection bias. Which is why psychology (most of it) cannot be trusted as a science. It doesn't get a pass.

Expand full comment

Instead of talking about Selection Bias in the abstract, as many commenters have, why not speak about it in the particular instance? Let's agree that there is some degree of selection bias at play in the correlations drawn from internet surveys of certain groups, what now? Should we totally throw out the results? That doesn't seem right if we're genuinely interested in learning something. I wholeheartedly agree with the Aella tweet that too many people on the internet use some of these biases as easy ways to dismiss research they don't like. Open up r/science on reddit and you can see countless examples of this, even examples where the researchers are accused of not controlling for things they specifically did control for. Similarly, in this comment section you can see people troubled by the selection bias in SSC surveys while simultaneously using faulty logic and personal anecdotes to make causal claims.

On a more meta-level, I think whenever statistical techniques are used to draw inference one either has to be very careful and specific about their conclusion (as Scott demonstrates in another comment) or one opens their analysis up to some methodological criticism. Unfortunately, particularly on the internet, there's little to no effort on the part of the criticizer to demonstrate that the cited bias (OVB, SB, etc.) is actually important here. Instead, they can merely claim it exists, write down some plausibly true example of it playing out and call it a day.

I'm not sure where Andrew Gelman would land on all of this, but to me this whole thing is propagated by the NHST framework that encourages a binary classification of research as either being good or bad. Lastly, I'll just say that despite years of studying statistics/econometrics I still find colliders hard to think about and that makes me mad.

Expand full comment

Aren't most studies by (respectable) academics these days done on Mechanical Turkers? Which is sure a skewed sample, but probably less skewed than say Cornell undergrads or Aella's twitter followers.

Expand full comment

Perhaps banana eating is more popular at highly competitive high schools and geographic diversity criteria make a higher IQ necessary to be admitted from them.

Expand full comment

Nice rejoinder. I think of this as one of the Wikipedia memes, the argument conventions that come from people steeped in Wikipedia (or similar collaborative efforts) to the point that assorted WP rules seem like the only natural way to put rules on argument. So you get this one, and the wild overuse of "correlation is not causation" and assorted other logi-slogans, plus the belief that adding a citation to anything you say necessarily increases its logical force tenfold.

Arguably it's all a reason to restore the study of rhetoric to greater prominence in general education.

Expand full comment

There are also reasons to distrust surveys generally that have nothing to do with selection bias.

https://carcinisation.com/2020/12/11/survey-chicken/

> Comprehension is difficult enough in actual conversation, when mutual comprehension is a shared goal. Often people think they are talking about the same thing, and then find out that they meant two completely different things. A failure of comprehension can be discovered and repaired in conversation, can even be repaired as the reading of a text progresses, but it cannot be repaired in survey-taking. Data will be produced, whether they reflect the comprehension of a shared reality or not.

(And yes, I think we should somewhat distrust professionally done research too,)

Expand full comment

Look, sometimes you want to say "consider this way your data may be biased."

That doesn't mean "your data is trash, we can learn literally nothing from your trash contaminated data. sit in the corner and feel bad." it means "consider this way your data may be biased."

If a political poll gets retweeted by e.g. Contrapoints and no other big bluechecks, and so 70% of the voters are Contrapoints followers, that is really worth mentioning while people are trying to derive meaning from the poll!

Expand full comment

Selection bias can be fatal to polls, but like many poisons it is all a matter of degree. How much selection bias? What kind of selection bias? The kind that has a big effect on the kind of question being polled for? What measures have been taken to minimize the effect of any selection bias in the poll. Since there's always selection bias, these are the important questions. There's a whole technology of minimizing effects of selection bias in polls and it works, but it's not always used because the purpose of many polls is to support a result rather than detect it.

With correlations, it's actually the same. It matters to the degree it affects the findings. The questions are, is the selection bias relevant to the conclusion, how much the selection has been biased, what has been done to account for selection bias. Yes, this posting does a good job in saying that the there's a different relation between a poll, which attempts to detect things like what a population believes, and a correlation study which attempts to detect relationships between characteristics and perhaps generalize from them. But this cannot mean, selection bias isn't important in correlation studies. It only means, it plays a different role. If a sample is very biased with regard to the matter under study is going to distort the result. And, in a good correlation study measures would be taken to account for the inevitable selection bias in all studies, as well as to the extent feasible, minimizing selection bias. But as with polling, its all a matter of degree.

As the posting convincingly points out, selection bias is always present to some degree in both these areas, polling and studying correlations. I'm not sure what is gained by trying to say that selection bias per se is a big deal in polls but not a deal at all with correlations, except to create a false dichotomy in support of ignoring some selection bias and overvaluing other selection bias.

Expand full comment

information is information. just modulate how you take it based on stuff like selection bias and sample size

Expand full comment

I wold recommend a deeper research before asserting and opinion. This is math related and in that field there are no room for opinion; hypotheses, yes. A reading I recommend as starting is the book Seeing through Statistics.

Expand full comment

It kinda varies. When aella does an "imagine a random number generator study", probably some selection bias but no worse than academic studies. When she tries finding correlations by something like "do you think abortion should be legal | do you support bestiality", her audience bias is much, much worse.

Expand full comment

The alternative I'd compare it to is something like how opinion polling is done, where they put a major effort into getting demographically representative population samples and/or weight the final result proportionately. Obviously there's some debate about the best way to do this, but the general technique is accepted

Expand full comment

Stop Confounding Yourself <-> Selection Bias Is A Fact Of Life <-> On Bounded Distrust

Probably forgetting some other related posts, like anything which references the Elderly Hispanic Woman Effect. Yes, I remember that bit. Anyway, sorta feels like there's a more general principle that ought to cover all these cases, without going *too* meta (e.g. Knowing About Biases Can Hurt You, which of course is on LW).

Also, real-life studies suffer from selection bias - they disproportionately exclude Very Online people who'd never see a physical bulletin board. (Does it count if a real-life study recruits people using the internet, or vice versa? Maybe that's the secret!)

Expand full comment

This is a horribly take. She isn't just hampered by selection biases, she's actively engaging with people so as to seek certain results. It's bunk science methodology that politicians and corporations might use to discredit narratives. If you want to do sociology just do sociology, don't pretend you can generalize to all humans because EvErY oNe ElSe Is DoInG iT! lol

Expand full comment

You cannot eliminate selection bias completely (unless you have literally all people in a database, and you can force the random selection to take your test), but depending on how you design the study, the bias can be smaller or greater.

I think the argument is that on the scale from smaller selection bias to larger selection bias, Aella is not even trying.

Expand full comment

I read “this internet study has selection bias” as “some subset of users are likely gaming your survey to produce amusing results.” Any system that doesn’t have robust anti-trolling systems in place is open to “Boaty McBoatface” brigade attacks or script kiddies. Is this an actual problem in your results? Given the way your surveys work I’d guess not but I think Aella’s Twitter poll format is more vulnerable.

Expand full comment

Would this post: "Will people look back and say this is where ACX jumped the shark? Let's do a poll." meet the 2 of 3 criterion?

I find it useful to frequently come back to W. E. Deming's important paper: On Probability as a Basis for Action, The American Statistician, November 1975, Vol. 29, No. 4

https://deming.org/wp-content/uploads/2020/06/On-Probability-As-a-Basis-For-Action-1975.pdf

Expand full comment

Your are missing collider bias where the selection mechanism induces a correlation. The classic example is the correlation between good looks and acting ability among Hollywood actors. You only get to be a Hollywood actor if you are good looking or a good actor, so we don't see any bad looking bad actors and there is a negative correlation between ability and looks which might not stand in the general population.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

Imagine a medieval peasant hearing that people will get obese because of poverty.

A few years ago, I lost 15 kg when my bank accounts were blocked and I had only to use some of cash to buy food for some months.

btw, how do you pronounce 'Aella'?

Expand full comment

I've seen Aella make the claim that "Aella's audience that responds to Aella's surveys" is pretty close to equivalent compared to other populations, but I'm not sure I buy it - "very online > twitter > rat adjacent > Aella" is a pretty strong filter; I'd expect among other things the normal skew towards "more autistic than most groups" that you end up seeing when you survey most rat populations. Ditto "is likely to be pretty accepting of a wide variety of sex stuff" and similar.

She claims this isn't a problem, but the *way* she claims this bothers me a bit. Here's two quotes from her main article on this (https://aella.substack.com/p/you-dont-need-a-perfectly-random):

***And key to this, I can see how their responses differ. I have a pretty good grasp on the degree to which “people who follow me” is a unique demographic. And surprise - for most (though admittedly not all!) things I measure (mostly sex stuff, which is the majority of my focus), they’re very similar to other sources.***

***I also am really familiar with my twitter follower demographics, so I can anticipate when stuff might be confounded or warped due to selection bias.***

These kinds of statements are essentially her asking me to trust her, but my general impression of Aella is that she's extremely eager to prove that most out-there sex stuff is very, very healthy and good and you should definitely be weird topical sex/relationship outlier X. I don't particularly trust her to be perfect at factoring this out.

I don't think this is unique to Aella - basically everything I've said here replicates in my views on, say, most diet studies/surveys, or any survey I see about some overton-window friendly marginalized group. I'm in the rarer "surveys in general are trash" group of people.

I think that gets worse when you start to get into the *kind of stuff Aella asks about*. Most people don't care a ton about how many bananas they eat - it's sort of a factual thing. And you can test them for IQ, so their bias doesn't enter into that part of it as much (or at least doesnt' have to). But Aella asks questions that often boil down to the general sphere of "Is Polyamory great, and should everyone do it?" questions that you'd expect to interact a lot with people's self-worth and tribalism.

To put that another way, I suspect that Aella's following is highly:

1. Autistic

2. Attracted to Aella specifically and trying to get her attention

3. Sexually liberal

And she asks a lot of questions that interact with that. I *do* expect that someone following Aella is more likely to want to impress her than most, and that they aren't unaware that she's incredibly pro-weird-sex-stuff. I *do* expect they are more autistic than most and tend to approach potentially disturbing/provocative questions more analytically than most. I do expect there's a greater amount of people who would be reluctant to report that their experiences with weird sex/relationship thing X have been negative because it would be letting other-tribe have an opportunity to count coup.

Again, this isn't unique to her. But her being the example at hand is a way for me to talk about how much I distrust surveys in general.

Expand full comment

Very reasonable writeup.

I don't see a couple of real issues being addressed:

1) Structural biases due to the differences in social and/or economic class between the average online user vs. the overall population, and

2) Structural biases due to the web sites/email lists involved.

The former is an issue because the significantly more wealthy/educated average online user compared to the overall population has impact. Income differentials introduce large skews in health, in political views, etc etc.

The latter is an issue because no web site or mass emailing is likely to be random even above the inherent online vs. overall skew. Just as the average Fox online viewer is different than the average CNN online viewer - every web site has a largely self-selected population of like thinkers. Email lists also derive from something, somewhere and are just as likely to contain inherent structural biases.

This doesn't invalidate your main points but the different types of subtle structural fingers on the scales are very potentially problematic.

Expand full comment
Dec 28, 2022·edited Dec 28, 2022

Maybe it's me, but I don't recall having seen many criticism of Aella or Scott's polls that don't suggest any mechanism why the selection bias could be affecting the results. And I wouldn't expect tweets to flesh out every argument in the mind of whoever wrote it.

But yeah, I get that published psych papers deserve more scrutiny and it can be frustrating, and also no information should be fully discounted because one can think of possible bias.

Expand full comment

...so is this article an argument that everyone needs to specify "relevant selection bias" instead of just "selection bias"? Would that extra word satisfy the complaint?

Expand full comment

Thing is the questions Aella asks really are polls, so the answers she's going to get are going to be subject to selection bias. And tbh, I'd expect Aella followers to be a particularly unrepresentative group because Aella herself is deeply weird.

But to be fair to Aella she's smart, and I'm sure she realises that the results from her survey questions don't generalise to wider society. Probably the response she''d get from your average normie is "why the hell should I care about this dumb hypothetical scenario?"

Expand full comment

> Sometimes the scientists will get really into cross-cultural research, and retest their hypothesis on various primitive tribes - in which case their population will be selected for the primitive tribes that don’t murder scientists who try to study them.

Given how (relatively) frequently scientists try this, and how few people live in primitive tribes, does that mean some primitive tribes are spending a significant part of their day-to-day life responding to scientific surveys?

Expand full comment

Selection bias is correlated with the topic of the online survey. When you post a survey online it gets passed around people with an interest in the topic you're surveying. If that banana-IQ survey gets passed around a forum dedicated to the banana-IQ hypothesis and populated by people who are going to give 110% on the IQ test because they care a lot about the topic you have a problem with the bias that a static group selection will never have.

This is actually great if you want to say, find out what the other beliefs of the banana-IQ believers are, but you can't test the banana-IQ hypothesis that way.

Expand full comment

"I think these people are operating off some model where amateur surveys necessarily have selection bias, because they only capture the survey-maker’s Twitter followers, or blog readers, or some other weird highly-selected snapshot of the Internet-using public. But real studies by professional scientists don’t have selection bias, because . . . sorry, I don’t know how their model would end this sentence."

Certainly there are many people who are inconsistent on this, but "it's fine because academic psychology does it" is only valid if academic psychology is actually fine. As a wise man once argued in "The Control Group Is Out Of Control" (https://slatestarcodex.com/2014/04/28/the-control-group-is-out-of-control/), academic psychology is not fine, and its methods and epistemics don't even suffice to disprove psi powers.

Even when used well, the method of running surveys to uncover psychological truths is generally pretty weak, and it is extraordinarily hard to use well. Most attempts produce noise and nonsense. This is as true of Twitter surveys as it is of academic surveys.

Expand full comment

Here is a study that successfully replicated studies based on convenience surveys, using representative sampling: https://www.pnas.org/doi/full/10.1073/pnas.1808083115.

Expand full comment

I think this is the first time I’m sure Scott is completely wrong about something.

Aella polls are completely and totally useless. If you take a sample of psych undergraduates, your result will be biased. But there’s a lot of diversity within that population of psych undergraduates. The results may not generalize, but if something is true of that population there’s good reason to think it may be true of people in general (obviously more representative samples are better).

In contrast, the type of people who follow and engage with Aella are nearly a distinct population. Her content is directed to such a strange consortium of techno-optimists/crypto people/sex workers/inter-sectionalists/etc that none of her poll results have any validity whatsoever. The type of person who regularly answers an Aella poll is simply built different, they are not representative of any population except themselves.

You can’t just throw your hands up in the air and say “Well, everything is biased anyway who knows what the truth is.” If I take a survey in front of the Hershey’s chocolate factory asking people what their favorite candy is, it is a shit survey. If you don’t try to control bias you might as well not bother doing a study.

Expand full comment

Thx, Kevin. Tough, I don't share your opinion. After all how can you tell there is not any bias? My comment was pointing to the fact that correlations are hard to detect but there will always apear.

Expand full comment