Watch

Transcript

Disclaimer: The transcript that follows has been generated using artificial intelligence. We strive to be as accurate as possible, but minor errors and slightly off timestamps may be present.

Jeremie Harris (00:00):

Hi everyone and welcome back to the Towards Data Science podcast. And before we dive into things, a quick reminder that Towards Data Science is always looking for great submissions about machine learning, data science, and analytics. So if you’ve been wondering about whether you should put together that blog post on your research project or your cool data visualization work, then please, please do. We’d love to see it. Okay, onto the show here, and this is going to be a really fun one. Today, I got to speak to Allah Shabana and Jacob Sieves, two machine learning researchers with world class pedigrees, who decided to build a company that puts AI on the blockchain. Now, to most people, myself included, AI on the blockchain sounds like a winning entry in startup buzzword bingo. But what I discovered talking to Jacob and Allah was that they actually have good reason to combine those two ingredients together. At a high level, doing AI on a blockchain allows you to decentralize AI research and reward labs for building better models, not for publishing papers in flashy journals with often biased reviewers. And that’s on all. As you’ll see, Allah and Jacob are taking on some of the thorniest current problems in AI with their decentralized approach to machine learning. Everything from the problem of designing robust benchmarks to rewarding good AI research, and even the centralization of power in the hands of a few large companies building powerful AI systems, these problems are all in their sights as they look to build out Bittensor, if they’re AI on the blockchain startup. Allah and Jacob joined me to talk about all these things and more on this episode of the Towards Data Science podcast.

(01:43):

I’m really excited for this chat. And just for a little bit of background for our listeners here, because there’s a little bit more sort of backstory to this than the average podcast in a way. So Allah, like you were a researcher at 4AI, which is a company that, for listeners, was co-founded by the co-inventor of the transformer architecture, which was Aidan Gomez, the author of that attention is all you need paper and friend of the show, I’m pleased to say. And Jacob, you were at Google before this, which is a company co-founded by Eric Schmidt, also friend of the show, I’m sure. Hi, Eric. Okay, so you guys came together somehow with those two different backgrounds, and you co-founded this company Bittensor, and we’ll get to what Bittensor is and what it’s doing. I think it’s really cool, and there’s a lot of stuff for everybody to learn on this. But can you just start us off with like the story of how you met and how two software engineer, ML engineer people started to work on something at the intersection of like machine learning and crypto?

Ala Shaabana (02:38):

I can jump on this one, Jake, unless you want to take it. So actually, Jacob was also a researcher at 4AI. We kind of met through Aidan Gomez on 4AI as well. So that was kind of the interesting part of it. I had joined 4AI right after Finneyshing my tenure at VMware. I had been working on distributed computer problems and effectively, you know, something in the shadow of what Bittensor is. And Aidan thought it was a good idea for me to kind of just give a pitch of what I’ve been working on and what I would love to kind of just jump onto and stuff like that.

(03:08):

And I had just joined and I decided to kind of pitch what I was doing. And then Jake messages me and he’s like, hey, I’ve been working on this problem for a lot longer than you have. Come join me because I’ve kind of got like a nice prototype running and let’s see how this goes. And we kind of just hit it off from there and things kind of just ran quite well since then. And effectively just ended up leaving my full-time job to pursue this full-time. And yeah, things have been cracking since. Aidan actually introduced us, which is interesting.

Jeremie Harris (03:33):

Oh, very cool. So the union was blessed by Aidan.

Jacob Steeves (03:37):

But by the godfather of Aidan, exactly.

Jeremie Harris (03:39):

Excellent. All right. Glad to glad you got that endorsement, too. That’s a very powerful thing. So I think one of the the approach I was thinking about this podcast ahead of time, because like full disclosure, as I mentioned before we started recording, like I know very, very little, dangerously little about the blockchain, about cryptocurrencies, blah, blah, blah. And so what I’d love to do is just start with what you think are the basic ingredients that listeners should be aware of that we’re going to build on as we start to explain the problem you’re tackling and how you’re tackling it. So maybe, Jacob, I’ll turn to you. Can you introduce those ingredients?

Jacob Steeves (04:13):

I mean, so if you look at if you look at blockchain, you can see a lot of different components there. Like there’s decentralization and censorship resistance. And that has a lot of like a libertarian aspect. But what was also invented by Bitcoin was this concept of digital trust. And the ability for computers to reach consensus about this thing, about money, right? And that allows us to build incentive mechanisms or markets in a digital space that never could exist before. So blockchains are basically tech tools for having computers reach consensus about something.

(04:48):

And in Bitcoin, that’s the state of the blockchain. And it’s one way to think about it, but this is kind of like a market for hashing power. So that’s kind of the bridge between blockchains and what we do is we go, hey, look, Bitcoin’s this market for hashing power. It’s the largest in the world. And we’re going to take that same technique of building these global markets, which grew Bitcoin to this global size and apply it to the other blockchains. And apply it to the other other computational problem, which also has its own digital commodity, digital intelligence, and basically use markets to create it. OK. So that’s where blockchains come in to BitTensor.

Jeremie Harris (05:29):

And sorry, do you mind if I just ask like for the explain it like I’m five version of a couple of those concepts? Because so I love that high level thing and I think it’s a great motivator. So we have the sense in which some measure of of trust and reliability, the automation of trust and so on is going to be important to the story. But I’d like to take a step back just for people who hear blockchain, hear decentralization, hear all these terms. And like they’re trying to connect it to like. What is the thing that’s sitting on the database somewhere? Or like how does a transaction like that kind of basic level, if possible?

Jacob Steeves (06:07):

I’m sure that’s good, I can explain it like that. So essentially, a blockchain is just an append only database.

(06:15):

Where it’s distributed across many computers and they effectively vote on what the next thing in the database is going to be. And in Bitcoin, they vote with computational power and in a proof of stake system, they vote with monetary wealth and that’s essentially the core of the system. And now the reason why there needs to be this kind of voting is that it’s hard to determine who is legitimate with inside of the Internet. Like IPs are basically inFinneyte that can be spoofed. So what is the thing against you could vote in this decentralized way to determine the state of the system? And that’s really important if you want to build trust about who’s good and who’s bad, for instance, or which transactions came before which transactions in the case of Bitcoin. So that’s sort of the concept of a blockchain. Usually the way in which data is appended is in terms of blocks of data, not a single edit. So that’s where the block comes from. And the fact that it’s append only, things get added one at a time in a forward manner makes it a chain where every piece of data locks itself to the previous. So you can’t edit the history of the network. This allows us to create state which everybody can agree on in a decentralized way on the Internet, which is the foundation for smart contracts and also what we do at Tensor.

Jeremie Harris (07:40):

Okay, perfect. So thank you, by the way, for that 101. I think it’s just really helpful to have these ideas in mind. So, okay, this database, there are copies of it, of course, stored on a whole bunch of different computers. In the case of some blockchains, parts of copies. In the case of others, I guess, full copies, depending on, or that would be an option, I suppose. Okay, so this is the starting point. How do we get from this to what you’re doing at BitTensor? How does this connect to machine learning?

Jacob Steeves (08:07):

Right, exactly. It’s a really good question. So, I mean, I kind of touched on it in the beginning. You know, as many of the viewers may know, the largest supercomputer in the world is Bitcoin. It’s like 500 times bigger than all of Google’s data centers. Magnitude’s larger. And that’s astonishing. I worked at Google and I remember they would, when they Finneyshed building a data center, they would just start building another data center right beside it. Like, that’s the amount of power that this decentralized system has been able to extract. It’s done, and it’s actually extracted this computational power from the world.

Jeremie Harris (08:40):

And sorry, is this just from laptops, like my laptop and so on? Or GPUs who are mining?

Jacob Steeves (08:48):

It started that way. But because of the fact that Bitcoin created a market which was very precise in what it needed, it created a very efficient market where anybody in the world could work on the arbitrage opportunities. They could find some energy that’s not being used or they could discover a different way of solving this hashing problem using a specific type of chip. And basically, you unlock the power of the hive mind to find these arbitrage opportunities across the globe and therefore creates this massive network. Now, today, nobody uses laptops because it’s inefficient, right? Where Bitcoin is really mined next to a freezing river in Finland or in using geothermal energy in Iceland. And the reason is because that’s just the most efficient place to put these computational tools. That’s really what we’re tapping into with BitTensor.

(09:38):

We’re going, hey, look, the power of markets, the power of arbitrage, the power of the hive mind, the global hive mind in a borderless decentralized system is incredibly powerful. We can get to Google and OpenAI size in terms of neural networks and in the production of machine intelligence if we can use the same tooling, the creation of a market that can the ability for anybody, any engineer in the world to go, hey, look, I’ve got a computer. I’m going to run it next to a freezing river in Finland or in a volcano in El Salvador. And it’s just going to turn away solving a specific or a certain aspect of this larger problem, which is the understanding of the creation of intelligence.

Jeremie Harris (10:23):

OK, great. So how do you then. Here’s my understanding so far and where I suspect you’re going with this, and I’m curious if I’m completely wrong. So right now, the way I picture it, there are a bunch of GPUs in like these giant clusters, like you say, like next to geothermal vents or whatever in Iceland and not yet, not yet on BitTensor, but but potentially potentially. And like currently systems like that are mining like Bitcoin. And the way they’re doing that is by building a system that’s going to run on Bitcoin, and the way they’re doing that is by just guessing random numbers to see if they end up guessing the right random number to for various convoluted reasons, lock in the next block in the block chain. And that is part of the whole trust thing. That’s great. But basically, these things are just guessing random numbers. And so my gut says these two guys are about to propose a way that maybe we can not just get these things to guess random numbers, but to spit out computations that are like more useful than that. So they’re not just burning up our our atmosphere with CO2 emissions by doing like random crap, but specifically human value generating crap. Is that fair?

Jacob Steeves (11:28):

Right. Yeah, exactly. I mean, and so like if you look at if you look at machine intelligence, you look at like a neural network, right? You can break it apart into different components. And those components are adding something to the system, right? Like, for instance, some of your viewers probably know what a mixtures of experts model or an ensemble model is, right? In an ensemble model, you have a specific part of the machine learning model that is producing a representation of the input. It’s like here’s the features of the image or this is the semantics of some text. And that representation is valuable. That’s and so that’s the digital commodity that we’re working on. It’s literally the representation of inputs.

(12:09):

And and so in Bitcoin, Bitcoin is designed in such a way that this digital commodity is very easy to to validate, right? Hashes are immediate to validate and everybody can agree, whereas the value of a representation of input is not so obvious, right? So our our consensus mechanism is designed to work with that sort of fuzzy data inside a machine learning system. But essentially, these GPUs are learning from from, for instance, a machine learning system is an example text to produce something like the representations created by GPT, right? GPT, when you query GPT, you get representational, you get representational representational knowledge. So, hey, what how do you what do you think of this text? And it gives you this representative in a tensor. And that’s very useful for solving machine learning problems. And so that’s the digital commodity that we focused on, because we think that that representational knowledge is the most foundational thing in AI. All of the other problems, specific supervised problems, derive themselves from first, what is the meaning of this?

(13:12):

Without without an understanding of the task. So specifically to your viewers, you’ll know what unsupervised learning is, and that’s what we’re working on.

Jeremie Harris (13:18):

OK, great. And so just to like double check, triple confirm here, just for listeners who maybe focus on the more classical side of data science and do less neural network stuff in a neural network, you know, as as stuff propagates down the neural network at a certain layer fairly deep, you end up with a good representation of the thing that like a list of numbers that represents, in some sense, the meaning of the thing you’ve sent in. And then that kind of gets, let’s say, the puck gets put in the net by a couple of layers sometimes or maybe one layer just above that kind of uses that representation to draw a conclusion, like I think this is an image of a dog or a cat or whatever.

(13:53):

And now you’re looking at that that representation and saying, OK, all the values there, because we really just kind of do a little tweaking on top of it to get the thing we ultimately want. So maybe the main thing of interest is that. OK, so how do you how do you convince people? I’m not even sure what the right question is. Like, is it like like, well, how does this work? Are you trying to get a bunch of computer GPUs to or people to agree on a particular representation to validate it? Or like, what’s the value generation part of this?

Jacob Steeves (14:23):

Yeah, really good, really good question. So like that, of course, in this decentralized system, we need to reach agreement because functionally we need to determine what is incentive, who’s going to get money and who’s not. So we need to determine that. And that comes down to this problem of consensus, which is which is much more on the blockchain side. Like, how do we determine who’s got useful information? And so the way that we do that is that we have essentially is you could say we have two types of miners, those that are producing the knowledge, producing those representations, which are kind of like maps of the territory.

(14:59):

And then you have computers that are validating that knowledge, so they’re going, hey, like, well, was it useful? And and, you know, concretely, we work on next token prediction problems with basically a large unlabeled text corpora and try and the validators are basically attempting to use that knowledge to solve the specific problems. That’s the same. That’s the same problem that GPT is trained on. Right. So they try to use representations to to understand text.

(15:28):

They take that in and use it and understand text, and while they’re doing that, they can actually learn who’s valuable because because if you’re not producing a representation of the text, which is which is useful for doing NTP, machine learning by its very nature sifts out, you know, signal from noise. So are you producing signal?

(15:47):

I mean, and just to diverge a little bit, it’s very interesting because if you can think about it, the tensor is essentially a new type of benchmark, right? It’s an incentivized benchmark for this type of these types of representational models. But it’s it’s a very like high resolution benchmark where actually you’re measuring the performance of the machine learning models in like representational knowledge, where they can produce just a little niche inside of a this large this large experts. It’s essentially a mix of the model.

Jeremie Harris (16:18):

Sorry, can I test my understanding then of this so far? So like there’s there’s some person in Iceland who is going to go, hey, I want to be the one who gets to add the next block in the block chain and I get a reward for doing that because that’s how block chains work. That’s the incentive that causes me to want to do this. And in order to do that, rather than getting a random hash, I have to generate like some.

(16:43):

Relating representation of a sentence or a token that I want to add. So so I guess I I get like so this would be at the generation end. This is different from the validation end. I guess I’m confused about who gets the money for what in that context. It’s it’s a good it’s a really good question, perhaps this is something that we we should

Jacob Steeves (17:01):

have described before the the the mechanism for reaching consensus inside of the tensor is done on top of a blockchain. So the the the way in which blocks are generated is using a pretty traditional mechanism from from blockchain. It’s where we use the proof of stake system in order to determine the ordering of blocks and what information goes on those blocks and where it’s going to go. And what information goes on those blocks and where the state how the state is updated. Aside from that, we run we run basically a smart contractor on the blockchain.

(17:36):

That where the validators are syncing information and that consensus in the system is actually run on that. So the the concept of blocks is you could say below us, we’re like layer two. Right. OK, but but to answer your question more specifically, the way in which incentive, once these these validators are determining this, this informational significance, obviously there’s a little bit of fuzziness there. Right. OK, well, he’s valuable sometimes to me, but way more to you. So we need to reach a sort of agreement who’s at the intersection.

(18:07):

Of everybody and and once we know who’s at the intersection, we can fairly trust that they are producing something of value. And those that intersection is computed on the blockchain. And that’s how we determine who gets who gets inflation in there. And we just mint new tokens to them.

Jeremie Harris (18:24):

OK, and how do you make money from this? Like like is it do you print a bunch of a bunch of tokens and they go to Tensor and then like they appreciate in value or is there some other strategy?

Ala Shaabana (18:37):

So it’s a little bit different. We decided to go with a what’s called a fair launch. So that means that any cryptocurrency is called TAO. Any TAO on the system has actually been mined fairly. So it’s actually contributed to the general knowledge of the whole system. We didn’t kind of do any kind of initial coin offering. We didn’t really kind of offer anything. The idea behind it is, as Jake was mentioning, there are contributors and there are consumers. Right. And then what happens is the contributors are effectively rewarded for their contributions to the system if they’re contributing valid representations, representations that are generally considered useful by the validators. And what happens is they’ll get a fraction of the of the TAO that is being generated at each block at the first layer that Jake was describing. So what happens is every 12 seconds or so there is a new block generated. And with that block, there’s one TAO that is awarded to the highest performers, the ones who do the basically the ones who do the best in relation to their neighbors. And they’ll eventually effectively just get a portion of that TAO as a result. So, for example, one of the things that the blocks actually contain information wise is a ranking. So it’s basically just it’s a matrix that contains the ranking of all of the contributors to the system and all the validators as well. And this goes to show who is the most useful. Basically, it’s an informational significance measurement and effectively the ones that rank the highest are the ones that get the largest portions and the ones that are lowest. They get the lowest portions as well.

Jeremie Harris (19:60):

Okay, and how do you like how do you actually measure this goes to Jacob’s point of like how things get verified and validated? Like, how do you actually assess that? Like, okay, this person has the more valuable system. Do you have a bunch of people running GPT-2 or whatever on their locals and then they just kind of see how similar a representation is to theirs or like what’s the similarity metric?

Jacob Steeves (20:23):

We use basically a technique called Fisher’s information, which is just basically it’s an approximation of how different or worse off the validator would be if a particular node was removed. So the peers are actually evaluated in a collective. You take all of the embeddings, you combine them at the validator, and you learn which ones are valuable, and as you’re learning, you can actually determine these weights. So the specific techniques are like the categories called a salience technique. Imagine you had a neural network and you had a specific neuron in the neural network that you can think of an individual neuron as producing value as one of our miners, and then you can go, hey, what happens if I remove this one?

(21:11):

Is that good or bad? You know, and this work has been going on in this problem for a very long time. Yann LeCun wrote a paper about this a long time ago where he used Fisher’s information. And so we use that same technique to go and calculate the exact value of, OK, well, here, if we remove that peer, what happens? How much worse are we off in the system? And then that actual metric is essentially entropy if we’re using a KL divergence. OK, that’s actually that’s really so you’re basically doing a kind of ablation study on

Jeremie Harris (21:40):

the different peers, the different like latent representations. OK, that makes sense. It’s almost like permutation feature importance measure or anyway, it’s got a lot of analogs. OK. Yeah, analogs and other ones like shapely values, similar, right?

Jacob Steeves (21:56):

Oh, no, very cool.

Jeremie Harris (22:01):

OK, so it’s a practical question, like I am about to I don’t know, I’m about to launch a text autocomplete app or something. And I’m wondering, like, should I fire up my own personal GPUs or should I go to Amazon Web Services or should I go to, you know, Bittensor and try to use their decentralized infrastructure? Like what like why would I choose one over the other? Let’s say what scenario would the tensor really be most valuable in?

Ala Shaabana (22:30):

One of the things that the tensor actually provides with part of what we’re building is what is effectively a large decentralized data set, right? There is an inherent data set that the tensor is sitting on. And it’s just a giant text corpus that contains everything from books to emails to articles, abstracts, papers and everything. And the idea is that we kind of place this into what is effectively a large decentralized system, whereas it’s over something called the IPFS network, the Interplanetary File System.

(23:02):

And it’s likely just charted all over the place. And what happens is every Bittensor node will download a sip of that data set and start using it directly. Right. So in addition to really if you’re training a language model, you don’t really need to bring your own data set. You can have the security data set already working on and kind of download that tensor. You’re also working with you’re effectively in a lot of ways learning from other systems and from other nodes in the system. Right. You’re not just learning on your own data set. You’re also speaking to other nodes and they’re also exchanging information with you and you’re learning from them. Right. So one of the biggest problems in AI that Jacob and I both noticed independently during our careers is that AI itself, the research compounds every year.

(23:40):

Right. So we’re always building on top of what we’ve learned. But to reach the art world was retraining models. Right. So GPT-2 was trained over all the same data that GPT-1 had learned when really GPT-1 already knew all this. Right. But it had to kind of surpass basically learn the same thing to surpass it eventually. And I mentioned GPT-1 was discarded and now it’s not really kind of obsolete in a way, even though it could provide some very interesting insight regardless of how well it’s performing. So one of the things at BitTensor that we are actually actively trying to verify and actively working on, one of the biggest points is that information itself on the network also compounds. Right. You don’t need to relearn things anymore because you’re speaking to other models and might have already learned something that your model doesn’t know yet. So not only are you dealing with models that know things that you don’t, you’re also kind of dealing with adversarial data sets. Right. You kind of model sending you stuff that your model might not have seen before and you’re kind of adapting and learning. So what you’re doing is you’re actually more creating a resilient generalist’s model as opposed to a very narrow specialist. Right. So what’s happening is your model becomes more, I guess I want to say resilient again, just because it’s actually dealing with stuff that hasn’t seen before all the time. Whereas if you trained on say AWS or anything else, not only are you paying for the computer you’re sitting on top of and you’re not really getting any of that back, you’re also kind of training on one specific data set you have. You’re not really seeing anything new. You’re not really learning from all the models at the same time. Interesting.

Jeremie Harris (25:06):

And does this feed into the question of these, I keep thinking ablation studies, sorry, the strategy that you’ve developed to measure importance or the value added by a particular peer in this network. If I have a peer that has the shard of the master data set that it has access to, it’s a crappier shard, let’s say. Does that unfairly disadvantage that particular peer? Like no matter how good the model is, no matter how much processing power it’s invested, maybe the data is just kind of junk, at least for the particular task that it’s working on. Is that like an issue here of almost like data fairness? Like they all have access to equal quality?

Jacob Steeves (25:48):

Well, they all have access to the entire data set. That’s not a problem. It may actually be very valuable to work on a smaller niche aspect of the problem because then you can dominate that. I mean, because we’re using conditional computation at the validators, the validators can learn that, oh, you’re valuable for this particular type of data. So I’m going to query this German model when I’m thinking about German things. And I’m going to query this archive model when I’m reading archive papers.

(26:14):

But the purpose here is to create essentially a corpus or a library of machine intelligence that is continually growing and sharing knowledge between the peers. It’s like a continuous machine learning library.

(26:34):

And that’s what we’re trying to create as a value for our customers. Hey, look, you can combine, hey, what if every epoch you also talk to this network and maybe there’s something in there that you can learn from? We have models of a very large variety. We’re essentially creating a protocol that each of our endpoints is like the company OpenAI. You know, it’s like, hey, you can talk to me. I’m going to give you my embeddings for the language model that sits behind my endpoint.

(27:05):

We’re the protocol that stitches together all of those AI companies so that they can share collectively what knowledge they’re adding to this global corpus library, the neural Internet, as we like to say in a kind of a cliche way. But that’s, I think, the real value here is to is to exponentiate the amount of machine intelligence that one can access.

Ala Shaabana (27:30):

Interesting. Just to add to what Jacob was saying, another thing is that we do have our own data set. But you know, anybody doing the network are free to bring their own as well. They don’t really need to abide by the one that we’re using, right? It’s just that we have the one that we currently use is just because it covers quite a large corpus of different books and different sources and stuff. But we’ve had folks kind of bringing their own data as well. And, you know, as long as it’s the same modality, in this case being text, it’s fine.

Jeremie Harris (27:56):

OK, very cool. And Ali, you mentioned that these models would like learn from each other. I’m not sure if I misheard that or misunderstood it, but what would the mechanism be that facilitates that? Like how does one model benefit in robustness terms from like exposure to others?

Ala Shaabana (28:11):

In a sense. So learning from each other is almost, I don’t say it’s a misnomer, but it is kind of the closest representation of what is happening here. What happens is every model, as Jake mentioned, every epoch will start basically trying to ping the network and trying to find other models that it can talk to. Right. And what happens is what is what mid-tensor effectively is is a protocol. That’s really all we are. We’re an API. And as a machine learning engineer, you don’t really necessarily need to worry about the blockchain aspect. You can really just focus on the API aspect. And since we are effectively a gigantic, decentralized mixture of experts, what happens is at the mid-tensor layer, any model that you create is going to have, when it plugs into the mid-tensor layer, it’s going to have what is effectively, it’s really a way to pick experts.

(28:52):

But in another way to think about it, if you’re looking more at it from a more software engineering point of view, it’s a DNS lookup and we’re doing is you’re pinging other models out there that are the experts that you’re talking to and they’re responding back with information. So what happens when you’re pinging them in the first place is that you’re kind of sending them a portion of data and you’re saying, hey, can you run this through and give me back your results?

(29:11):

And what happens is that model will basically run through and give you back a result. And what you do is you take that the output of their last layer and kind of use it as your own input. I’m kind of our white paper goes into much more detail about this. I’m kind of just like condensing audit all into like a few sentences. There’s a lot more going on there. But what we’re effectively using is we’re kind of stitching these models together using the Bittensor protocol. So what’s really happening is that it’s kind of almost a massive interconnected model over the internet with basically using a Bittensor protocol. It kind of act as like this mesh web between all of them. And what happens is if, say, for example, you took the information from this model, you ran it through your own model, and your loss function ended up suffering as a result, you might be less likely to talk to it next time because you didn’t really optimize. You ended up doing a little bit worse. So their information or their architecture or whatever their output is is maybe not that useful to what you’re building.

Jeremie Harris (30:04):

Okay. That’s actually really helpful. And it helps to also kind of answer a question that I had, which was, you know, I think is this a mixture of experts thing or is it an ensembling thing? So for people who need a little bit more background here, ensembling is when you have a bunch of different models and you get them to basically each vote on what the output, the prediction, or the generated product might be. And you kind of pick the one that got the most votes. And so you get to benefit from the wisdom of the crowd. Whereas mixture of experts models usually involve an initial step where you decide which models in an intelligent fashion, you pick which models get to contribute to this particular prediction based on the kind of problem it is. It sounds more like a mixture of experts model than the ensemble because there’s that initial assessment. Okay. Okay. Very cool.

Ala Shaabana (30:50):

Yeah. So what’s effectively happening is each model is kind of picking and choosing, right? You’re not kind of picking all of them at the same time. You’re kind of picking a subset of the ones that you’re in contact with and you’re kind of optimizing over time who that subset comprises.

Jeremie Harris (31:03):

Okay. Very cool. And so you’ve both now talked about this as a sort of marketplace for intelligence and API. Like a lot of this reminds me of a podcast that I recorded a couple of months ago with Ben Goertzel where I think he’s the founder of SingularityNet and he’s got this like open cog framework that he uses for that. But basically it sounded very similar. It sounded like a marketplace for this sort of activity. I don’t know if you know much about SingularityNet. I’d just be curious, like given that there are two different efforts in this domain, like what differentiates them, what’s different in the philosophy?

Ala Shaabana (31:37):

I feel very, very passionate about this. This is one of the things that speaks to me truly about BitTensor. So one of the other… I mentioned one of the big inefficiencies of AI is what we call retraining stuff over and over, right? The other kind of side that is also inefficient is the idea of contribution, right? How do I contribute to the system? Today, really, if I want to contribute to, let’s say, an open AI or some massive GPT model that I’m creating, I really kind of have to be either part of the Googles and the Facebooks of the world or the rather large research institutions, but I can’t really contribute any other way, right? That’s kind of the first end. The second end is that if I want to contribute to the research end of things, there’s kind of a minefield to navigate, right? If you want to publish in the best papers, if you want to publish in the best journals, if you want to kind of reach out to the right audience and stuff like that. And a lot of the times, if you’re not part of, again, these massive institutions, it’s going to be fairly difficult for you to kind of contribute something meaningful.

(32:30):

And one of the ethos, one of the massive things that Jake and I are both very, very passionate about at BitTensor is this idea that why is this happening? Why can’t anybody really contribute regardless of their stature in the community? And so at BitTensor, the idea is that you as a human have no input over what models you’re talking to. It’s your own model that’s picking and choosing who to talk to, who is best for it. So we kind of remove this human bias part of it.

(32:60):

There’s no kind of, you know, I can go into, there’s no kind of, let’s say, marketplace. We can kind of walk into it, get a pre-trained model, download it similar to like a hugging face setup, and then start using it, right? It’s just, you have your own model. Your model through the BitTensor API is enabled to be able to speak to the web and speak to other models and realize for itself who it wants to reward and who it can’t, right? So the system truly becomes a market as Jake was describing. And it’s, this market is really driven by ingenuity and really how good of an engineer you actually are as opposed to, you know, what journal you published in or what institution you’re part of and so on and stuff like that.

Jeremie Harris (33:37):

Very interesting. Oh, and sorry, Jacob, did you have something to talk about?

Jacob Steeves (33:41):

Yeah, I think exactly the same thing that Al was saying, but I think SingularityNet is a human mediated market. And I also don’t think that there’s very much collaboration between models. Do models compose? You know, BitTensor is designed around the language torch, so everything is directly composable as if you were composing a neural machine learning system. But I think the major point here is that it’s a machine to machine market.

(34:06):

It’s neural networks learning which other neural networks are valuable and rewarding them for such. So that’s, I think, one of the unique aspects of what BitTensor is. I don’t know too much about the technical aspects of what Ben is doing. I mean, I’m sure it’s pretty good, but so I can’t really speak too much on that comparison. I might also add that there’s similarities. And I think, like, you know, there’s a couple other teams out there that, you know, see this potential for, hey, we should take neural networks into the Internet era.

(34:36):

Are we in 1983 where it’s going from mainframes to the Internet, right? And I think that there’s going to be many different approaches to this task. And, you know, we have our specific group, but the thing that actually makes us very similar to Ben is that we see that vision, right?

(34:59):

We’re like, wow, in the future, artificial intelligence is going to be connected together. They’re going to be talking across the Web. What’s that protocol going to be? What’s the thing that they’re going to be sharing? And is that going to allow us to scale from the 500 billion to the 100 trillion? And I think that is actually how we’re going to scale. Right. You know, the only thing bigger than the open AIs and the Googles is all of them combined.

(35:23):

Right. And if you look at the largest, you know, at the beginning of the call, the largest supercomputer in the world is Bitcoin, right? And it gets to that size by being permissionless open over the Internet. And so, you know, there’s many people, including Ben, that are taking that step, you know, drawn by the vision, the potential of, you know, Internet-based machine learning.

Jeremie Harris (35:46):

It’s a really interesting facet of this, too, is, yeah, that kind of perspective.

(35:52):

It’s hard not to notice the quality of the people who are working on this problem. And any time I’ve seen that before, like whether it, you know, our time at YC or our time like looking at like some kinds of like crypto investments, I remember talking actually, this is going to sound like a name drop, but it’s actually just to showcase my own stupidity here. So like at Y Combinator, one of our batch mates was Devin Finzer, who went on to be the, he was the co-founder at the time of this little dinky company called Opensea that seemed like a really dumb idea. And I had a bunch of conversations with with Devin where he was like telling me about what he did. And he would tell me like, oh, what you’re working on sounds super cool. And I was like, well, yes, it does. And he’d go on about his thing and I’d be like, yeah, it sounds interesting. And, you know, anyway, the quality of this guy, you could already tell him and his co-founder Alex were just like top notch people. But the idea, I kind of have this problem where like I can’t seem to grasp that like crypto vision, but I keep seeing people of such insane caliber going towards it that I sort of resign myself to like, you know, I don’t get NFTs, but I think if like Gary Tan initialized capital of YC, if Andreessen Horowitz, if the entire planet of like super skilled people like you guys is working on this problem, it’s very hard to be short that position.

(37:09):

I do want to ask a question about scale. So you alluded to it, this idea that you’re reaching a big scale. Bitcoin is the biggest supercomputer and so on. So how large is the network right now? How is it growing? How fast is it growing? Can you speak to that a little bit? Maybe I’ll check in with Ella actually on that.

Ala Shaabana (37:27):

We had our official launch back in November in 2021 and we have been kind of growing rather steadily since then.

(37:36):

Don’t want to kind of knock on wood, but we have been growing rather well in the sense that we kind of have set up a what is effectively almost an active mining pool where pools are not the right word, but it’s an active mining set of the tensor nodes that are allowed in the system of approximately 2048 nodes. It’s kind of similar to the idea of the mixture of experts paper where they use 2048 experts. So kind of use the same thing to kind of get a sense of how things are working, how the network is kind of almost like a fish tank kind of manner. And interestingly, it has grown quite a bit and kind of one bit of caveat to that is that these 2000 are miners, not all individual users. So it could be somebody would say three or four miners or even 100 perhaps. But the system is kind of growing steadily. We have been kind of researching within the system to kind of see how well we can perform, how well of a model we can build, what is the adversarial dataset look like, what does the adversarial performance look like, and so on, and kind of have been incrementally getting really promising results from this whole system. And what’s happening now is that we’ve actually set up a bit of a, we have a civil resistant algorithm, it’s a bit of a proof of work that people have to do before they can join the network. And the more people try to join, the harder it becomes because we need to almost act like a funnel to kind of prevent somebody from deploying, let’s say 10,000 miners trying to basically join at the same time. It kind of just becomes tighter and tighter the more people try to join. And it’s getting to the point now where there’s been almost a wait list in a lot of ways for people trying to join the network in the first place because a lot of people trying to join at the same time is becoming rather difficult. So while we do have 2048 active nodes in the system, there’s likely the number of people trying to kind of use it and kind of grow into it has been much larger than that. So growth has been rather steady and rather relevant, kind of projecting to grow a little further this year to kind of reach the states that we kind of always talk about.

(39:30):

From a kind of more machine learning view, though, we have been kind of trying to measure, well, approximate the number of hyperparameters on the network right now, which is within those 748 nodes. And I think Jake, you might be able to speak better than that, but I think we’re reaching GPT2 levels and we’ve only been launching for about four months, though. So it’s been rather promising at this point. I think we’ve passed 110 billion parameters at this point and we’re kind of growing further. So, yeah.

Jeremie Harris (39:57):

And so that’s the total parameter count of like all the neural networks that are in these different peers. Okay. Exactly. Yeah, exactly.

Ala Shaabana (40:05):

Oh, so it’s happened. It’s a little bit hard to kind of approximate the exact number just because a lot of people will use perhaps a pre-trained model. They don’t learn from a hugging face. We have no visibility to that. We have no idea what it is that is actually running on the system.

Jeremie Harris (40:15):

Oh, that’s so interesting. That’s right. You have no idea what the actual architectures are that are being… So how do you even get a parameter count then if you don’t know?

Jacob Steeves (40:23):

It’s a very conservative estimate. So 110 is based on basically the smallest model that people are likely to be running, which is actually about half the size of the default. But I mean, in reality, it could be much larger than we know. Like if somebody’s just running one GPT2 model on the network, we get 175 billion parameters right there and potentially all 2,000 nodes could be running 175 billion parameter. The scaling though is difficult because to run a single forward pass of a GPT2 model is actually incredibly slow if you’re just running on a single computer. So one of the ways that we’re scaling right now and we’re measuring this very closely is okay, how many requests can the system process? How large can the sequence lengths be through the validators? And then as that number grows, we can go, okay, great. So we’re getting a lot more fidelity into the system and we’re growing out this network to be highly performant. And the one thing that we have that’s cool, so it’s at the token price, we can calculate how much computational power we could be extracting in a perfect market. And so we’re kind of growing up to that number into the hundreds of thousands of dollars per day of compute and making sure that the validators are doing a good job of asking the network to expand that energy in terms of what is it running more compute or having better internet connections, et cetera.

(41:58):

So in the end, it’s hard for us to tell exactly the number of parameters, but a really conservative estimate is about 110 billion.

Jeremie Harris (42:05):

Okay, that’s really interesting. And that also makes me wonder, when you zoom out and you look at this entire network and what it’s doing, you effectively have like, I don’t know how many, like a dozen or hundreds maybe of instances of something like GPT-2 running, making predictions, comparing the predictions, blah, blah, for like what, each token, each sentence or something.

(42:23):

So people will often argue like, oh, this is terrible for the environment. GPT-3, these large language models, like there’s a lot of pollution. Does this exacerbate that? Do we have this just happening in parallel and how does this ultimately make economic sense if I could turn to GPT-3 and get one price per token, but then I go over to this system and economically, it seems like it almost has to accommodate like a hundred times that price per token. I’m sure I’m missing something, but I’m just curious what it is.

Jacob Steeves (42:52):

Well, I mean, I think one of our fundamental arguments here is that there’s a lot of waste in the machine learning world as is today, right? You have tons of researchers training the same models and publishing them to NIPS and we’re saying, hey, well, why don’t we do it collaboratively? We don’t need to train one system to make this work. And another thing is that we’re saying, hey, look, there’s efficiency in the way that we’re measuring this digital commodity.

(43:17):

Hey, we want informational significance. We want embeddings that produce value. So that’s exactly what we want. We’re going to measure that exact thing and then we’re going to build a market around that and hopefully the market will be efficient in producing that thing instead of papers. We’re getting the commodity we want. And then also there’s other aspects. For instance, in Bitcoin, the computational power tends to arbitrage for cheap electrical resources. That’s one thing. I mean, and I think it’s hard to say exactly what the token price will be. And I wouldn’t want to speculate, but we’re aiming to produce something that is state of the art. And I think that our competitors are producing are using a lot of electrical value already. 75% of Google’s energy costs go to machine learning. So I think that we’re deFinneytely going to be using electricity. There’s no question about it, but hopefully we’re going to use it in an effective way, in the most effective way that we can. And what else can you really hope for?

Jeremie Harris (44:23):

Yeah, no, I mean, to be clear, I find most of the arguments for that kind of thing, the kind of environmental arguments around large language models, like profoundly uncompelling when you look at just the scale of things and like, what does it mean? What is the marginal value of a token generated in terms of just like human thought time and how that would translate into like food consumption? And I mean, like there’s a whole like other side of the ledger that is completely unexplored and unexplorable. Okay, cool. I want to ask one last question that is about the sort of what should we shouldn’t we side of this too. There are arguments that are interesting at the very least when it comes to the malicious use of AI, you know, open AI, for example, famously decided they’re not going to release GPT-3, citing in part concerns over malicious use, same with DALI, certainly same with DALI 2 and so on. And there are other firms that are kind of making similar noises. Starting with malicious use, there’s also obviously the kind of alignment story that we’ve been exploring a lot on the podcast long term. Where do you go with systems that get more and more human capable? How do you ensure that these systems are developed safely and aligned and don’t have, you know, side effects they produce when they solve problems in dangerously creative ways? So I guess those two things like malicious use and alignment risk superficially seem like they’re exacerbated when you decentralize and democratize access like this, but I’d be really curious to hear you both opine.

Jacob Steeves (45:49):

So I think that the fair use and the problems of, you know, malicious users tend to be things that centralized organizations focus on because the more glaring problem is the fact that they’re deciding who gets to decide what is good and where AI should be moving and who owns it. And I think that the decentralization is really about that. It’s about making sure that the question of what can or cannot be done with AI is answered collectively in a democratic way. Instead of, you know, the ivory tower people saying, I don’t like this and I don’t like that. Now, caveat here, right?

(46:32):

Ivory tower people are really smart and they maybe have a really good view from their tower. But I think that our thesis here is that the bigger issue for AI is not the AI fire the nukes and they take over the world, it’s more that like a small group of people come to own it and use it for themselves rather than that collectively it’s aligned with humanity.

(46:58):

Decentralization solves that problem very well because decentralization is all of the decentralization of power, right? And so one of the things that we think we can do with tokenization and building these open protocols is allow for that kind of quality to happen. So there’s not just five big players, there’s tens of thousands of people that control AI and get to decide what it learns and who gets to use it and who gets to talk to them and all these things. And I think that’s how we’re going to as humanity make sure that AI is working for us by making sure that the AI’s bottom line is us and not some small group of people. Interesting.

Ala Shaabana (47:35):

So effectively, Jake, kind of you’ve touched on the final last bit of ethos for Bitensor, right? We talked initially about the inefficiency, we talked about the issue of access and the access and now we’re talking about centralization. Just to kind of, you kind of nailed it on the head, but the one last bit is that the problem of, I think one of the things that people always talk about is AGI, right? artificial intelligence, which is the Holy Grail really of AI. The problem itself requires a lot more ingenuity, teamwork, people really generally just a lot more work than can be done by a single company, right? And to be completely honest with you, I find it a little bit more terrifying that a single company, like I don’t want to mention, would own AGI compared to all of us owning a piece of it, right? That’s a company always using for their profit margins, whereas if we all own a piece of it, we might just be able to do something more interesting than that with it.

Jeremie Harris (48:29):

Okay, very interesting. So I think I find myself like temperamentally very much in agreement with like the thesis and a lot of the things that underpin it because, because I have a perspective that’s a little different when it comes to long-term AI risk and I view AGI as like basically intrinsically dangerous. I think that like we currently have not solved the alignment problem and I expect the default outcome from strong AI to be catastrophic.

(48:59):

Like in that worldview, which you know, it’s totally cool, you guys don’t agree with it and then there’s a lot of disagreement with that perspective obviously, there’s far from consensus. So would you say that there is, so there really isn’t much consolation for somebody from my perspective in this approach from that perspective at least. Is that fair to say?

Jacob Steeves (49:19):

Perhaps. I mean, I think maybe it comes down to the conception of what AGI does, what it looks like, right? And what is it tethered to? And I think, you know, what is it working for? What is it, if it’s going to be a traditional machine learning system, does it have an objective? We’re tying that objective to tokenization and decentralization. We’re saying, hey, it’s got to be tied to the holders of these tokens. And so then that’s its leash for AGI. But I think maybe that’s not, maybe it’s that, perhaps, well..

Jeremie Harris (49:55):

So like, so the default argument for AGI risk is that the moment you specify any kind of optimization function, optimization objective, a sufficiently advanced AI system will come up with dangerously creative ways of making that number go up. So whether that’s the stock market or the like a categorical cross entropy of a classifier or whatever, the ways that cause that number to go like to the moon have as a side effect almost invariably the destruction of all human value on the planet.

(50:30):

And so like, while I couldn’t necessarily describe the specific way in which this system could get hacked to blazes and cause like the world to turn into a pile of paperclips, the argument would be that this is something that a sufficiently capable AI system would end up doing. It’s just like, it’s a question of how rather than of if from that perspective. But it may just like, it just depends ultimately on the extent to which you buy that argument. And again, a lot of people don’t. And so it’s not like this is just the way things are.

Jacob Steeves (51:02):

No, I do actually buy that argument. I believe it. But like, for instance, you know, some people think that one of the ways that you can, for instance, save the environment, like, hey, capitalism is that AI. We’re this AI that all we care about is, you know, GDP number goes up. And so we cut down on the forests.

(51:22):

Right. And, you know, some ideas from economic environmentalism is to make sure that you can’t, that there’s this alignment between the two systems, right, that, hey, the bottom line of the company is able to see that it needs to maintain the forest. Right. And so there’s an alignment between those two things. And I think that that’s, you know, if we want AGI to be in alignment with humanity while it tries to solve this problem, you know, we don’t want it to just kill humans to make the number go up. Well, we need to make sure that the humans are owning the thing that it wants. So it doesn’t kill the humans. Right. It’s aligned. It’s aligned with us. So it goes, hey, you know, that avenue is not actually making the number go up. Right. So it’s about crafting that objective function in such a way that this AGI doesn’t go that direction.

(52:15):

Number go up means actually help all those people that are holding these tokens that are, that it needs to help. And, you know, that symbiosis is a really difficult problem. And we don’t really have the answers to that. So, I mean, we’re not claiming, but we have the perfect answer. But we think that tokenization allows us a lot of maneuverability in the way that we can represent that alignment problem.

Jeremie Harris (52:36):

Yeah, yeah. It certainly is like a new axis on which to define loss functions that feels more fuzzy, which I mean that in a good way. So like one of the challenges with like well-formed objective functions is that they tend to be more intrinsically side effecty. Like you can really kind of nail the crap out of this one specific number. And like because it’s such a specific number, it’s not tied to kind of a process. It’s for various reasons anyway more vulnerable to this. I guess I just tend to assume that like something like this in the limit though remains a problem just because an AI could like would find itself incentivized to find ways to convince people to say that its outputs were valuable or things like that. Like the meta game becomes the issue. But again, I think one of the challenges is like for people like me, I don’t honestly know what I’m proposing as an alternative. This stuff is going to continue. So there is the question of what happens in the meantime, even if I end up being right and everything goes to shit. Like do we want the world to go to shit in the hands of a specific company or other?

Jacob Steeves (53:41):

So I feel like we can have a much longer conversation about this particular point.

Jeremie Harris (53:44):

Absolutely. I mean, that’s the problem with alignment.

Ala Shaabana (53:48):

This is a beer problem. We just sit down, have a beer and talk about it.

Jeremie Harris (53:52):

Absolutely. No, look, I super appreciate you guys especially being patient with my very limited understanding of blockchain stuff and giving me that explain it like I’m five version. Where can people go to follow BitTensor? Where can they go to follow your work and dive deeper into it?

Ala Shaabana (54:08):

We’re a fully open source project. Our entire stuff is on GitHub and you’re free to go to bittensor.com, check out our network. There’s a network tab, kind of show you what the product looks like right now. Check out our GitHub, check out our white paper as well as light paper that’s coming out very soon. And we also have a Discord link that I’m going to share with you as well.

Jeremie Harris (54:27):

Awesome. Actually, if you can, I’ll also share that on the blog post that will come with the podcast. And if you’re listening to this and you’re really excited about this, do check that out because we’ll have all those links there. All right, Jacob, thanks so much for joining me for this. This is a ton of fun.

Ala Shaabana (54:40):

Thanks, I really appreciate your time.

Jacob Steeves (54:41):

A lot of fun. Yeah, thank you so much.


Resources

People Mentioned


Video Description

Ala Shaabana and Jacob Steeves, two ML researchers, unpack the problem of designing robust benchmarks to rewarding good AI research and even the centralization of power in the hands of a few large companies building powerful AI systems.

Intro music: ➞ Artist: Ron Gelinas ➞ Track Title: Daybreak Chill Blend (original mix) ➞ Link to Track: https://youtu.be/d8Y2sKIgFWc

0:00 Intro 2:40 Ala and Jacob’s backgrounds 4:00 The basics of AI on the blockchain 11:30 Generating human value 17:00 Who sees the benefit?22:00 Use of GPUs 28:00 Models learning from each other 37:30 The size of the network 45:30 The alignment of these systems 51:00 Buying into a system 54:00 Wrap-up


Sponsors

Podscript is a personal project to make podcast transcripts available to everyone for free. Please support this project by following us on Twitter.