138 Comments

Relevant to AI forecasting: Someone set up a bot that queries the GPT-4 API and uses its responses to bet on Manifold Markets. It leaves a comment with every bet explaining its reasoning.

It's... not great. Currently in the negatives, and doesn't look to be coming back from that any time soon. But this is comparable to the average Manifold user, so I think it could reasonably be called a "human-level" forecaster.

https://manifold.markets/GPT4

Expand full comment

>Actual GPT-4 probably would just give us some boring boilerplate about how the future is uncertain and it’s irresponsible to speculate. But what if AI researchers took some other model that had been trained not to do that, and asked it?

Sounds like a job for a certain Chad McCool...

Expand full comment

Who decides architectural significance? And how intact must it be?

Expand full comment

Highly relevant for the next ~day.

Please buy 1 No share on this Manifold Market if you want Prediction markets to succeed: https://manifold.markets/IsaacKing/will-the-whales-win-this-market#R1Q3FCdYceg2c55fvpw8

Isaac King (The guy in the chronological post below) just dropped 22k$ to win this prediction market. All of that money is going to Manifold. If you bet 1$ No, then he'll have to spend 100$ more on Manifold. 5 seconds of work for 100$ for Manifold Markets, this way they'll have more runway.

Expand full comment

Re: Autocast. Interestingly, you can't just use GPT-3 or GPT-4 to perform better on their dataset. This is precluded by the rules because the models were trained after the resolution dates of the curated forecasts, meaning the training data is likely tainted with the correct outcomes. I still agree more powerful LLMs will forecast more accurately, and I'm especially excited about the parallels between "let's think this through step by step" and fermi estimation. I wrote about these and related topics here: https://damienlaird.substack.com/p/research-forecasting-with-large-language

Expand full comment

Talking of bets, anyone want to post another attempt at those stained glass windows Scott made a bet about, with today's models? I'd love to see what midjourney v5 cooks up.

Expand full comment

Prediction, especially based on detection of subtle patterns in enormous noisy data sets, is certainly a good idea for neural network models -- indeed, that's kind of the reason they were invented. But it seems like if you train it on *only* data before "the present" and you reward it for predicting stuff that humans are likely to want to hear, you're going to get the conventional wisdom of "the present" ipso facto -- which is boring. You could have gotten that a lot cheaper with a poll.

But on the other hand, if you train it on predicting stuff that humans "in the future" (when the prediction came true or not) want to hear, then you could indeed be training the network usefully. That seems quite promising, a way to leverage all this enormous text data collection to do something way more practically useful than making a talking robot.

Expand full comment

>We’ve talked before about LLMs playing chess; they can sort of do it, but they’re not very good yet. The market thinks 34% chance they’ll get much better in the next five years; I think my estimate is lower.

Can LLMs exceed the performance of human experts? Even if they are trained on data from experts, surpassing that level would require predicting something *different* from what an expert would do. But LLMs are trained to predict the "next move" [token] as accurately as possible given their training data.

Expand full comment

My take for the utility of a super predictor (like a really, really, good one) is that it would be used to interdict between humans and all uses of AI by predicting if something really bad will happen as a result of a given request and then denying that specific request.

Expand full comment

". . . to decide if some evidence bears on a forecast. . ."

Maybe we should just ask the evidence bears. But we might have to stipulate an anonymity claws.

Expand full comment
Apr 25, 2023·edited Apr 25, 2023

Re: banning mifepristone, it will depend how you define "nationwide".

If that means "every state in the Union" then unless you imagine that by some fluke California agrees with a ban, then the winning bet is "no", and Gavin is doing his best to safeguard the nation against this:

https://www.gov.ca.gov/2023/04/10/california-announces-emergency-stockpile-of-abortion-medication-defending-against-extreme-texas-court-ruling/

"Governor Newsom announced that California has secured an emergency stockpile of up to 2 million pills of Misoprostol, a safe and effective medication abortion drug, in the wake of an extremist judge seeking to block Mifepristone, a critical abortion pill.

California shared the negotiated terms of its Misoprostol purchase agreement to assist other states in securing Misoprostol, at low cost.

While California still believes Mifepristone is central to the preferred regimen for medication abortion, the State negotiated and purchased an emergency stockpile of Misoprostol in anticipation of Friday’s ruling by far-right federal judge Matthew Kacsmaryk to ensure that California remains a safe haven for safe, affordable, and accessible reproductive care. More than 250,000 pills have already arrived in California, and the State has negotiated the ability to purchase up to 2 million Misoprostol pills as needed through CalRx. To support other states in securing Misoprostol at a low cost, California has shared the negotiated terms of the purchase agreement with all states in the Reproductive Freedom Alliance."

If, however, some jiggery-pokery with definitions goes on, such that "by 'nationwide', I meant at least one state in the West, one in the Middle, and one in the East, that is, geographically extending from one coast of the country to the other", then it could happen.

But I still think "no" is the way to bet.

Expand full comment

Ask AI to forecast the likelihood of consciousness if a trillion data points were achieved. Let’s take the max tegmark & future of life AGI pledge & pause 6 months to get our feet under us.

Expand full comment

I'm not an AI expert, but I naively expect a LLM to perhaps asymptotically approach, but always do worse than, the wisdom of the crowd at predicting the future. The reason is that a LLM has no model of the outside world, just a model of human language. I could believe in some hand-wavey way that a large and advanced enough LLM might approximate the wisdom of the crowd, but without a true model of the outside world I don't think it would ever surpass that.

Expand full comment

If AIs were so damn smart, why can't Google or MS come up with a spell checker or autocorrect that actually works?

Mine is worse than random chance, and worse, they keep changing correct text to incorrect.

Expand full comment

Not covering the whale/minnows drama on Manifold is a big miss, Scott

Expand full comment

Reminder that Futuur currently has over 700 open real-money markets on a lot of these topics.

https://futuur.com

Also, we just launched our beta API (request access on your Futuur settings page or ping me directly).

I expect there will be a lot of interesting trading opportunities, both via arbitrage, and leveraging the new AI models.

Expand full comment

> This is my Long Bet with Samo Burja - the resolution criteria are slightly different, but close enough to make me feel a little more confident I’m on the right side.

The way this and your other bet are worded have me slightly confused. "Something comparable to GT from slightly before GT" seems plausible to me in a way that "100,000 year old Ice Age civilization that taught the Egyptians how to make pyramids" doesn't.

Expand full comment
Apr 25, 2023·edited Apr 25, 2023

Not that I think it can't do it, because AI LLMs seem to keep making progress that people said they couldn't, but how exactly would it work that it would be able to play chess better than a grandmaster?

I can understand, as a first step, recognizing the input as being related to asking for chess moves and outputting things that look like chess moves. Then as a second step, recognizing the patterns for what makes something a legal chess move and not just a capital letter followed by a lowercase letter from a to h followed by a number from 1 to 8.

Then even as a third step recognizing a connection between the prompt and the idea that they're supposed to be good chess moves, along with recognizing what is considered by the stuff in the training data to be a good chess move.

But how do you get from that to beating a grandmaster? Unless the dataset is filled with games that are better performances than what grandmasters do, and labelled as such, but that doesn't seem to be the case now. Maybe if Google dumps like 1 billion AlphaZero chess matches onto an online database somewhere?

This also leads to a related question which is how reliably it can know which ones are the good moves. For something like the scholar's mate, presumably most of the references to it in the model are near words like "bad" and "stupid" and "don't do this" so it can tell they're bad. But it's not clear to me that this same thing would distinguish the best moves from the merely-good moves (which is really a broader question than just being about chess).

If I dump 50 billion chess matches onto an online database but they're all shit, along with a bunch of (AI-generated of course) commentary of "oh what a brilliant move here", would that make GPT-5 really bad at chess?

Expand full comment

I hate to see "anything can happen" getting made fun of because, to a first approximation, it's often correct. Other than physical and logical impossibilities, we're not usually putting zero probability weight on the things we think won't happen. I should think that anyone worried about existential risks would be well aware of the importance of low-probability scenarios?

Furthermore, most of the time we communicate and reason using metaphor and analogy. They're good for imagining possibilities you might not not have considered otherwise. They're pretty rubbish for calculating probabilities or ruling things out. Even for people who like math, we're most often using math as metaphor. Have you calculated anything using a prior probability today?

Expand full comment

I had the same idea w.r.t testing LLM predictions on events already past a few days ago, and quizzed GPT-4 on the first dozen significant-seeming questions that occurred to me: https://twitter.com/glaebhoerl/status/1649547678718500866. Not systematic or scientific in any way, unlike the paper! I hope someone puts in the elbow grease to see how newer and more capable models perform.

Expand full comment

So two things about the chess LLM question:

1. There is a large difference between an LLM that *consistently* plays at the grandmaster level and one that can beat a grandmaster once given an extremely number of chances (including situations where the grandmaster is not playing at their normal playing strength due to being drunk/not paying attention/etc).

2. An LLM that consistently plays at the grandmaster level ( Elo > 2500) would almost have to be a superintelligence, right? If an LLM can play chess (a game that doesn't lend itself well to the text-based information processing of the LLM) at an elite human level, then the LLM would almost surely be the best programmer in the world--probably by a significant margin.

Expand full comment

>Actual GPT-4 probably would just give us some boring boilerplate about how the future is uncertain and it’s irresponsible to speculate. But what if AI researchers took some other model that had been trained not to do that, and asked it?

How unaligned AGI got created because Scott wanted to test an LLM in a prediction market and forgot to be paranoid.

Only kidding, I don't think anything of the sort is likely, whether for this or any other reason. I just thought it was funny.

Expand full comment

I feel forecasting is AGI-hard? For questions already discussed in public ("will X invade Y this year"), the best LLMs can probably do is search for op-eds on the question and calculate a weighted average on them, in which step the LLM is only really needed to convert the opinion text to a probability. If we had an neural net whose world model was good enough that it could predict as well as a superforecaster, that already would seem somewhat x-risky?

The question about the biggest twitterer creating a question seems silly? It would be like betting on boxing match outcomes if it was acceptable for contestants to bet against themselves and fake going k.o. in that no rational non-contestant would bet on that? Given that this is all play money, nobody cares, I figure?

As a PR move for the platform, it is brilliant. It is basically a welcome gift for the biggest twitterer who spends five minutes to create some market.

Expand full comment

Further on the mifepristone issue, Manifold also has this market which currently gives a 35% chance that SCOTUS overrules the Texas decision (and by implication gives a 65% chance that it is upheld): https://manifold.markets/BTE/will-the-supreme-court-reverse-the-304d1ed78694

Putting that and the other question together, it seems like the market expects mifepristone to be banned nationally, but for the case not to be resolved before 2024.

Expand full comment