EPISODE 1779

[INTRODUCTION]

[0:00:00] ANNOUNCER: Data analytics and business intelligence involve collecting, processing, and interpreting data to guide decision-making. A common challenge in data-focused organizations is how to make data accessible to the wider organization without the need for large data teams. Metabase is an open-source business intelligence tool that focuses on data exploration, visualization, and analysis. It offers a lightweight deployment strategy and aims to solve common challenges around data-driven decision-making. A key aspect of its interface is that it allows users to interact with data with or without SQL. Sameer Al-Sakran is the Founder and CEO of Metabase. He joins the show to talk about the challenge of data accessibility, the evolution of the data analytics field, key lessons from his 14 years leading Metabase, why the platform uses the Clojure language, and much more.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:09] SF: Sameer, welcome to the show.

[0:01:10] SA: Thank you, thank you. It's great to be here again.

[0:01:12] SF: Yeah, thanks so much. We're talking analytics today. I think there's been various approaches to analytics, have been around for a long time. There's been a bunch of different generations of BI tools, things from Tableau and Looker. There's also, I think, different takes on the problem of analyzing data, things like driving insights, using things like Streamlit, where you're actually building your own dashboards, and then there's frameworks like DBT that are focused on data transformation and preparing the data before it reaches the visualization stage.

Metabase has been around for nearly a decade now. You've probably seen a few evolutions of products in the space. What's the background of the company, and where does Metabase fit into this world of analytics?

[0:01:53] SA: Yeah. I think we're the last mile. In general, in most companies, there's data that people are interested in, that lives in one of more places, depending on how bunned up you are. It could all live in one really gleaming, perfect data warehouse, everything is perfectly organized and everything's right. Or it could just be this complete de-federated mess. I think we are the thing that lets you get data into the hands of the people that have real jobs.

I think one of our core center points has always been the poor sucker with the day job. Really, it's about how to get as much of their curiosity satisfied by them, without help from anyone else as possible. There's been, like you say, a lot of really compelling ways to instrument your company, to have better understanding what's happening, to have better awareness of what certain segments of your customers are doing, what people are clicking on, who's signing up for what, when. All these things that have been pretty dialed in for decades.

I think the playground that we're trying to really do well in is just there's a normal person that has questions, and we don't want there to have to be an analyst or an engineer that has to deal with every single question that comes up. A way to think about us is that most times you see a dashboard, there's a certain set of questions that dashboard answers. Then there's an easy anticipated set of questions three through 20. We're trying to make it really easy for someone to answer those. Compared to a Looker, or Streamlit, or Tableau, we're less about creating a dashboard as consumed as is, and more about creating a dashboard that really just sparks some amount of interest of curiosity, and that the subsequent clicks and subsequent iterations and refinements are where a lot of the magic in Metabase shows up.

[0:03:42] SF: Essentially, the typical target user then is a non-technical user that needs to be able to, not only analyze the data, but perform, essentially, this discovery process, because they might not even know, necessarily, what they're looking for, or what questions they have. They want to be able to mix and match and explore?

[0:03:58] SA: Yeah. That's the final curse and the final constituent. I do think that analytics is often a multiplayer game, and there are different roles people fill. We are generally set up by engineers. For the most part, an analyst is not the person who has a say in Metabase. It's usually an engineer, and it's usually someone that has a database that's lying around. There's people out there in the company that need stuff from that database.

Usually, the person that downloads a docker image and then runs it is not the final user. It's the person that is serving the final user. We do separate our own internal lingo of the installer persona, versus the end user, versus analysts and pro users. In general, our heart and soul is helping the poor sucker with the day job get their questions answered themselves, release a subset of that, and taking the burden off of the engineer that's currently a bottleneck.

[0:04:51] SF: Okay. Yeah. I definitely want to get into some of the details on some of the configuration and setup process. Maybe before jumping there, since you've been in this space for a long time, I think you founded multiple companies related to analytics, what are your thoughts on the evolution of this analytics stack through your career? What surprised you? What trends, I guess, are you paying attention to today that you maybe were not on your radar a few years ago?

[0:05:15] SA: I mean, I think the big one and the easy one is just natural language processing finally kicked up a viable solution to many things. It turns out to be even more broad-ranging than previously feared. I think that's a easy, immediate response. I think it's part of a larger secular trend that I've been following for, I don't know, a decade at this point, which is just that tools have gotten easier to use, and that there is an intrinsic complexity in analytics. There are certain things that are just naturally annoying about how to calculate net perfect retention, for example.

There's some math you've got to be aware of. There's some choices you have to make. The actual equations are annoying. Encoding those in SQL, or Python, or what have you is just a pain. There's also a lot of just unnecessary complexity that has, over the years, been shipped away out. If you were trying to calculate anything in the 70s, you had to write a bunch of code. Gradually, SQL took over. You walk that forward to Excel at the SQL Tableau in the last couple generations of more post-iPhone software, where everyone just has a much higher bar for interaction, quality, and just ease of use. I do think that there has been a very palpable sense of simplification of the tools itself. It's not that what users are doing is getting simpler. It's just there's less self-inflicted annoyance.

I do think that just the general user experience has improved dramatically over the last 10 or 20 years. I do think there's also a lot of interesting things happening around data shaping itself. I think, there's always been a question how you should store data, what the appropriate format is, how to deal with consistency. There's all kinds of textbooks by data warehousing. I do think one of the things that I, I don't want to say is surprising, but I don't think I would have given it as much weight as I currently do, but I think the success of a self-service data organization largely revolves around what schema you present to users.

Given a choice of where to spend time, getting the schema cleaned up and specifically in a way that lets a normal person with a normal cognitive model of their business look at that and recognize what they're looking at. I do think there's often data rest formats that make a ton of sense from the efficiency, or consistency, or just convenience perspective, that essentially make it impossible for anyone that's not eyeballs deep in the actual code base of that application to make sense of.

[0:07:45] SF: Yeah. What's that look like in order to present a schema that's understandable by someone who's not in the weeds of the database, or the data warehouse?

[0:07:54] SA: I think it's fundamentally about resisting the urge to normalize everything, to have workhorse tables that are both enriched and manageable in size. Anything over 20-ish columns becomes harder and harder to use. The columns should have English, or whatever language your company runs under, names. You should be able to understand what's in a column about having to look something up. That there should be a relative, let's just call it simplicity to how concepts are represented. Users have addresses and ideally, it's not two tables with a foreign key from an address into the user.

While that may be accurate, it may be represent the fact that certain people have addresses historically, that sort of thing makes it really difficult for someone that's just trying to look up customer data to make sense heads or tails of it. There's a lot of things like that where you probably want to have specialized data sets that are just views on whatever data looks like at rest. You probably want to iterate those based on department, or use cases. There's a lot of things that are very brilliant ideas that an average database designer would have that essentially make that data set on use of by anyone who is not as smart as them.

I do think there's a need to, I don't want to say smart cleaning, but people in tactical roles are generally fairly intelligent in my experience. But they know their world, they don't know the world of the latent databases. Making them learn the world of the latent databases get their stuff done, kind of puts up initial barrier. If you forgive me, I've often used the blogging metaphor for this, where once upon a time, to write something on the Internet, you had to learn PHP and how to do various command line shenanigans and set up this, that and the other thing. At some point, the amount of effort it took to set up a blog was more about configuration and installation than it was about the quality of writing.

Hence, you had really, really good writers that weren't able to get their word out. When you reduce the technical burden of how difficult it is for someone to get the written word out there, the people that are actually skilled at writing, versus skilled at setting up Unicorn, or NGINX, or, I don't know, what have you, to actually get the word out. I think there's something similar happens with most organizations, where inside of a company, the people who have most nuanced view of revenue retention, or active users, or the specific mechanics of a checkout funnel are not necessarily people that know how to write Python, or SQL.

The people that are running that funnel, or that retention analysis and that actually are talking to users and have a fairly specific and understanding what people are doing. They're the ones that know whether you should count a plan upgrade as part of whether you should treat that to the retention of the original plan, or the final plan.

[0:10:38] SF: I think, also, besides having that domain specific knowledge that the person who's maybe writing the query doesn't have a business side of it, the different people also are going to have different perspective and experiences where they might be able to solve a problem, like bringing in new information from this other domain that feels disconnected, but because they have experienced in it, they're able to essentially recognize those patterns across things that seemingly on the surface may be seem disconnected.

[0:11:08] SA: Yeah, for sure. Yeah. I mean, I most agree with that. I do think that there's, in terms of just the data set like, hey, just to pop the stack a little bit, I do think that one of the critical things is not prematurely abstracting and letting the different usage of data sets have different data set shapes. I know it's not exactly what you're talking about, but I'm just - sorry, my brain just went off on a rail there.

[0:11:31] SF: One of the things you talked about there, I mean, you had the analogy about blogging. If we can essentially reduce the friction to getting up and allowing someone who's good at writing to write and not have to deal with these technical hurdles, then you're going to end up with a lot more people just writing. If we can reduce the technical hurdles and configuration steps involved with accessing data, then we're going to have a lot more people who are maybe good at actually driving insights for the business, from the data available to do that.

Now, you mentioned, essentially, all the interest, of course, around LLMs. I think there's a number of companies that are trying to leverage generative AI now as essentially this interface to democratize access to data. I'm curious, what are your thoughts on that and how does Metabase fit into that world?

[0:12:16] SA: Yeah. I mean, I think there's maybe two different angles on that that I had to cut and then the remainder. I'll carve off two pieces and there's some residual. I think that it's pretty clear, to me at least, that some subset of people want to talk to computer. The idea of unstructured, just natural language as an interface for existing functionality is pretty much written into the timeline. I suspect that there is going to be some set of things where people will just naturally and organically want to start talking, or typing in a way that's conversational and natural language. I think that it's going to increasingly be just the hard expectation for all tools in analytics otherwise to support that as a UX like a paradigm. Much different way is mice existed and suddenly, all of a sudden, you need to have a menu system. If you don't have a menu system and everything is command shortcuts, you're weird and you have to explain yourself.

Now there's still tools that are 95-plus-percent driven by keyboards today, even in the world with phones and touch pads and mice. I do think there will be, like going forward I need to figure out where squishing natural language is the right user interface. I think it's a separate notion of using LLMs to generate queries, or generate analyses, or generate deep dive execution plans, or everyone call them. I think I'm somewhat less bullish on that. I do think that, I mean, avert then say, I'm fairly excited about what you can do with agents that are wielding deterministic tools. I think that there is going to be a lot of ways to push forward what a malleable, squishy agent that is basically working in LLM land with hallucinations, with all of the usual caveats you have there, what it's able to give users, if it is able to then invoke tools that return absolute QWERTY numbers.

I think one of the things with analytics is it's a very harsh place in terms of expected accuracy. That if something is wrong 2% the time at organizational scale, that this doesn't work. If 2% of your numbers in your company are wrong and you just can't tell which 2%, that really doesn't fly. Especially if that 2% changes randomly on you. I think that trying to generate SQL, or generate whatever target language you have is probably a rocky road. That will work well after, I think, the game has been played in one. I do think that what is exciting and what I think will start taking hold is, I have this toolbox of deterministic stuff. I have agents who are single, or multiple agents can use that. Then a lot of the heavy lifting is going to come from the actual deterministic tools themselves.

Just to bring it all in, I do think it's important that if the machine produces a number, that number is right. I think that a world where the number is just like, eh, it's cool, rapidly falls apart once you're talking about real operations with real stuff that people care about. I do also think that most people that are not in analytics underestimate how much time goes into understanding why number X is the same as number Y. My revenue number here is 1.25. My revenue number here is 1.27. Which one's right? Working analysts tend to spend a disgusting amount of time dealing with that. I think things that make that harder are net-net, a larger burden on analytics.

In terms of what's happening with Metabase, I just think that for us, our target persona has always been the non-engineer, the non-analyst. I do think those people are rightfully going to want to talk to computers. We've had two different iterations that have a meta, a chatbot that we've had. We're constantly playing around with stuff. We have a couple dark alphas. I do think that we're also playing a lot with how LLMs can - we've had various classification clustering recommendation algorithms moving into the code base for ages. We've gradually played around with placing some of all of those with LLMs at element locations. I think that there's some very, very interesting stuff around, again, the new idiom being, I can talk to computer. I think that's where we're putting a lot of our chips in. I think that an LLM as an analyst is still - I think that'll happen. I just think that'll happen well after a bunch of really, really cool stuff gets produced in other ways.

[0:16:37] SF: Yeah. I mean, I think that what you're saying is right. You need to start with the types of tasks that LLMs are reliable for today, especially when we're talking about analytics. You can't get the wrong revenue number. You can't get these numbers wrong, or it's going to lead to all kinds of problems. Back to Metabase, if I'm using this product, I want to get started with it, what is that process? I'm assuming that an engineer is the first person that's working with Metabase get the setup. What is that setup and configuration process?

[0:17:10] SA: Yeah. Our whole bag has been that we're the laziest possible option. I think that we've tried to make it very easy for someone to spin us up alongside a very early-stage project. You just pull up a Docker image, you run it, you point us to your data warehouse. I mean, at that point, you're just a database and give people accounts. There's a couple of less of some options in the open-source version. There's better ones in the pro version. Generally, just download a jar if you run jars, download Docker image if you don't, or we have a cloud service if you don't want to do either. I think it's literally a couple of minutes, and for folks that are super early in the cycle of their product, or their projects that have a larger company, we actually suggest that you don't do anything else.

Don't make dashboards, don't write reports. Let that happen organically. That set us up before there is an analyst, is usually a very strong recommendation. Because it can delay the need for analyst by just having there be a controlled place where people have accounts, they can run SQL questions if they know how to write SQL. You can give them SQL templates to run. There's a query builder they can use on their own. There's potentially lots of easy ways to click and hunt and peck their way to Nirvana. I do think that for us, the primary thing that we're trying to do is delay the need to get serious bad date of data. I think that there is a certain desire people have to set up a data warehouse, to set up DBT, to set up a bunch of other stuff. That all makes a ton of sense. But you should probably have something that lets the normal humans in your company ask questions, like months or years before that moment.

[0:18:47] SF: Yeah. Then, is the cloud services that the way, mainly you commercialize?

[0:18:53] SA: It's one of the main ways. I think that there's three ways to commercialize. One of those is just, hey, you don't want to run it yourself. We'll run it for you. We do have an open core model, so there is some features that will help you at a larger scale that you can buy from us. Then if you want to slap your logo on it and bet in the application, there's a separate license for that. Potentially, if you want to white label us in your application, that is also a thing for you.

[0:19:14] SF: Okay. Then, as a user interacting with the front end of this, what's that experience like, and then what is going on behind the scenes to essentially pull the data?

[0:19:24] SA: Yeah. I mean, there's a couple of different folks that I'll talk about. I think the person who's setting this up is probably going to be smashing SQL together. You show up, you hit a button, you could write SQL, you could say that, you could write dashboards. There's a power user mode effectively where, if you know what you're doing, you can do all kinds of rich dashboards, a template type of SQL, data transformations, model things, persist models, etc.

I think there's also, from the end user perspective, there's just the ability for me to click on stuff. When I click on stuff, it changes. Then when I can use a simple query tool and I can just click on buttons and I get answers. For that, we have a target language called MBQL. It's just a pre-parsed pseudo-SQL-ish thing. Our user interface generates MBQL, MBQL then gets transpiled to various SQL dialects, or Mongo or some other non. Basically, have a couple of other community drivers for non-SQL based languages. That gets executed, so everything that is run, runs on your database data warehouse, then it gets pulled back and there's a bit of processing, then it gets chucked over to the client.

For the most part, for a whole host of reasons, we don't want to generate SQL directly, and we don't want to force people to have to write SQL directly. The heart of the application is a transpiler.

[0:20:37] SF: Who's writing the MBQL statements?

[0:20:40] SA: The computer is. I click on stuff, and then we have essentially React components to do some stuff. They invoke an MBQL-lib library, this enclosure script. Enclosure script manipulates this parse tree, set effectively, and then that gets kicked over the wire and that's how most of our queries get presented.

[0:20:57] SF: Then how many different languages you have to transcompile on to?

[0:21:02] SA: I always get this wrong, but I want to say, there's something 20 first-party drivers, and then maybe another 10 third-party drivers. We wrote a bunch of drive characters for common databases, and then every once in a while someone in the community writes something for a database we don't support, but on the order of 30 different databases are targets of MBQL.

[0:21:22] SF: Okay. Why did you choose Clojure as the development language?

[0:21:26] SA: I mean, originally it was Python. First version of this is written in Python. Then when we thought about the deployment installation story, so I glibly mentioned we made an installation and configuration really easy. I actually went through a lot of trouble for that, and we used Clojure to do that in many ways. We wanted to have a single atomic binary and download. We wanted to have mature database drivers. We really didn't want to be forced to run lots of weird processes in the Python, like Docker image where there's just multiple modes of failure.

We ended up deciding to use a JVM language. Tried to port the Python to Scala. That didn't go all that well. And then decided to move to Clojure after a week of banging our head against Scala. It was specifically the ability to manage the transpiler and just dealing with parse trees that made the choice of Clojure specifically compelling.

[0:22:20] SF: That was the main advantage of over, like say, writing the code directly in Java against the JVM?

[0:22:26] SA: Yeah. We knew we wanted JDBC drivers. I still think that's in general, the driver ecosystem in the Java world is pretty robust and pretty reliable, especially compared to Go, or JavaScript at the time. JavaScript has gotten better, Go is still what it is, [inaudible 0:22:39]. Yeah, so given the choice in writing it in Java or Scala, but the decision to use the JVM is probably the first easy thing that we made.

[0:22:49] SF: Has that been, that choice around that language, has that been a challenge in terms of bringing in new engineers to the company? Is it harder to find people that know the language?

[0:23:00] SA: Not at all.

[0:23:01] SF: No?

[0:23:01] SA: I think, I mean, it's actually been beneficial in that map. I think a lot of people want to write in Clojure. It's one of those languages which just has a specific set of ergonomics. If you don't like parentheses, sorry, it's really not going to make you have fun. I do think it's given us a pretty concrete advantage, where lots of people just want to write Clojure for a living. We have that as a benefit of working on a code base. I think that from that perspective, it's been very, very beneficial. I also think that, I mean, this is my personal opinion, but there are good engineers and bad engineers. Good engineers can pick up new languages. I mean, not really, it's a consequence of being bad engineers, but I think that if you're a good engineer in D sharp, or F sharp, you can probably learn Clojure. In general, we have been very cool with people coming in wanting to learn Clojure, but they don't have it, unless they have it dialed in yet.

[0:23:49] SF: Are there certain advantages, disadvantages with the running on the JVM for this particular application?

[0:23:55] SA: The main advantage we have is specifically for the open-source self-hosted world, where it does just a single file. You download an uber-jar, it's a single download. Other works doesn't work. You hit Java dash jar run. Other works, it doesn't work. There's just certain predictability and atomicity to the installation. That's been a huge, huge thing. I really don't think that we have grown as fast, or as well, had we had a 20-page installation process that required compiling native extensions and scouring some repository for the right version, or something. Our ability to build that single binary has been critical. I still think that was a category of the right thing to do all along.

Dealing with JVM is a dark art. There are certain times when we've had to deal with strength and ratios, like debugging some things in Clojure land and JVM land have been challenging at times. The ecosystem is definitely leaps and bounds where it was when we started. Yeah, I'd say that it's probably a bigger, fatter binary than we might have gone in other places and were definitely, because it's an uber-jar, because it has everything bundled in, it is a heavier just file. Than if it was just here's a strip down, set the code base, and go pull in all your tendencies.

[0:25:15] SF: With the transpiling to different versions and flavors of SQL, different DBNFs, was there a particular hard engineering challenges with creating that?

[0:25:25] SA: I mean, it was a pain. Yeah. It's a lot of code. I've lost track of exactly how much it is, but I want to say, it's 50,000, or 70,000 lines are just fairly dense Clojure. There's a ton of just adjacent stuff we use. It's highly on tribute. I think it's fairly gnarly, complicated code. It was a difficult task. I think the folks on the team did it really well. We've gotten someplace really cool with it. I do think it is a fairly difficult undertaking that people managed to go off. I do think we've gotten a lot of benefit from it.

It was probably a dumb idea. Looking back on, it felt like, "Hey, we're going to write a compiler." It probably a more sensible measured person might have been like, "Yeah, let's try to figure out a way to win without doing that." Yeah, in some ways, it was taking the hard way down the mountain.

[0:26:17] SF: What do you think, if you do it again and go a different direction, what's that direction look like?

[0:26:22] SA: I still think I would make the big decisions the same way given what I knew. I still think that having there be a target independent intermediate language is the right way to do it. I think I'd probably, doing it all over again really change the level of granularity abstractness of language, and now it would have it be even further away from SQL than it actually is. I do think that one of the things that has been challenging has been every once in a while, there's a set of conceptual domain models we have about user land, where metrics models, there's these things that live in that world that are hard to map to and MBQL primitives. There is a tension between the primitive's MBQL's built off of which is almost - I'd liken it as if SQL is assembly, it's C. If I were to do it all over again, I would rather than creating a C compiler, I'll create a list compiler, where there's just the ability to have a higher level DSL closer to what actual user land concepts are, rather than having to try to express things in user land down to a C degree of a level, and language or the level of abstractness of C on top of a assembly. Rather, just had a more scaffolding and in some ways more abstract concepts that build up the target language.

[0:27:51] SF: Do you think, if you wanted to go in that direction, you wanted to basically build this different level of abstraction? Is that something that is would be a reasonable project to take on now? Or is it essentially too much time in product dependencies exist for the MBQL system?

[0:28:10] SA: I think it's one of those things where there's a lot that's working that we don't want to mess up. Rewriting the target language which I want to say, is at the center of on the order of 200,000 lines of code. Is that additional benefit worth it? I'm not sure. I think that given where we got to things worked out, I think we probably could have gotten here faster. Some of this is not just, are we at a place that is good, but it's also getting here took a while? I think that we could have speed run a lot of it by having better abstractions. I think, for things like metrics and models, some of the higher concepts we now have, and the way we deal with dimensions, the way we deal with column abstractions, unifying those across different databases when they point the same thing, so for example, latitude really means the same thing in a database. It's not like a column is latitude. There's just a latitude concept. I think we could have speed run where we got to in maybe half the time by having a higher-level scaffolding. I don't know if I would rip it all out now.

[0:29:16] SF: Is there some level of caching of the data that's happening within Metabase as well?

[0:29:21] SA: Yeah. A couple variants of caching. The simplest one is just like, hey, you want to query, we'll cache for you. That has some speed up at some level. I don't know, it's just caching, right? Different vendors have different ways of saying, "We can speed up yours stuff by 2000% by whatever." We have a memory caching. We do a fair amount of pre-computation, especially models and metrics, where we will essentially pre-compute on some schedule, or some push nature. Those are two different ways of viewing it. Then there's a more manual version, where as you start thinking about cross database data sets, just having those live in a centralized data warehouse, or some centralized place. Depending on how you structure things, you can do that as a cache, where you're pulling things from a database of record, you're stuffing them into this other place. It's much faster. Then you're using that as a read-only cache. But then it's like a podcom pull, or usually pushed from the centralized databases. Two-ish levels, layers of caching and arguably at third level as well.

[0:30:31] SF: Do you run into any challenges with, essentially, the data getting in the sync? What the user's pulling is being pulled from the cache, but the actual underlying data has changed in some significant way?

[0:30:42] SA: In theory, yes. In practice, not that often. I think that usually manifests when something's busted. I think data stillness is usually the way that this stuff comes up, as opposed to the cache itself being a problem. In general, we cache things for N seconds, or N days. I think it often, a lot of analytics still is not fully real-time. You don't have a single database that is consistently and always and forever up to date. There's often multiple writers into it that have different schedules. It's not uncommon to either have daily numbers for some data sets, or to have for example, every 20 minutes you pull Salesforce and get some stuff. The underlying data set often has a distribution of data freshness.

I think that the overall analytics profession has just learned to absorb this and to try to find ways to both live with the fact that there's going to be different data freshness and try to propagate freshness through linear interest, whatever tools you have, as well as try to make the way that you calculate numbers that matter, have it be done in a way that doesn't require you to be able to hit a fully fresh data set that's completely consistent.

Just as an example, you'll often be pulling, I think, we pull from 20 different data sources into our data warehouse. We have stuff in Stripe, stuff in our CRM, stuff in different services we run. Those are all happening on different schedules. They're not all exactly happening on the minute that the data point is generating. There is often a little bit of soft consistency. But for the most part, you can get around it, get around the implications most of the time.

[0:32:28] SF: How does the permissioning model work and how fine-grain is that?

[0:32:32] SA: Permissioning is a bane in my existence. If you were to ask me, what did we mess up? A lot of those roads go to permissions. I do think it's actually really, really hard to construct a permission system that gives everyone the knobs they need without creating a monster. I think that we've had very different perspectives on this over the years. Maybe just to make this somewhat entertaining, people can have a good time off our misery.

Once upon a time, we were just really centered around this idea that you give people access data and then the actual products of the data, figure out whether someone has access to their given report or not. That didn't really go down very well. We rapidly, or have rapidly, but we, after a lot of kicking and screaming, we're pulled over into a world where we have a parallel system of folders, where you have collections, collection of permissions, they have sub collections. There's a mixture of the ability to lock things down at a department by department, or function by function. But anything you put in collections, you can use that folder metaphor. People have read/write, and admin access to those.

We simultaneously have the ability to lock things down by datasets. For example, you can say, these three tables have PII and these eight groups can't touch that. You're not able to look up user addresses, for example, if you're an intern. Then on the more paid side, we also have data sandboxing, where you have the ability to lock things down by column or row, where you can basically say, interns are allowed to see aggregate metrics based on users, but they're not allowed to look up phone numbers of customers. There's quite a bit three different permission systems around just data access, collections, permissions. Then lastly, more bespoke and more complicated conditional ways of either creating hierarchical permissions, or column, or row controls.

[0:34:33] SF: Can I also control, if I create some view of the data, can I control what level of access someone has to manipulate it, so I could essentially create a view of the data that is maybe a read-only view that I embed in my application?

[0:34:47] SA: 95% of our usage is read-only. I think then in general, we do have the ability to do right back, but that's not a common thing. But, yeah, there is definitely lots of ways to create safe little sandboxes for people that have differential trust to play. A lot of what we sell is really things that help you in these various scenarios. I think for most people that are operating Metabase in a pretty high trust environment, where everyone has the same permissions, and you're all part of the same team, the open-source version is more than good enough. Then at some point, as you have less trust and less spontaneity in your group, the paid feature is really good.

[0:35:23] SF: You have around, I think, 40,000 GitHub stars. Tell me about the motivation behind open sourcing Metabase.

[0:35:31] SA: Yeah. I mean, I'd say, we probably have less stars than we should, given our footprint. I think we've never really played much in the way of the GitHub vanity metric games. I think, we're open source first and foremost because I think that's the right way to consume software. I think that if you're running something in your data center and you're touching data warehouses that matter, I actually think that being open source is a better format to consuming. I mean, I think if you want to consume a service, that's great. Those work out really well. I do think there's something to your data stack being open source first and foremost. I do think there's just a lot of things that that simplifies. It's easy to do audits. It's easy to be paranoid in security measures. It's easy to fix things. I don't feel like, when that'd be fixed by vendor at a speed you like.

I just think there's a lot - and maybe this is just me talking my own formative career, but I've often had to run software, the firm vendors that was just not being fixed, or we were breaking weird ways and the ability to go into source code and muck around was something that I really value. Just on a personal level, I just think that's how most software should be delivered, at least at this point in time. As the world changes, my opinion there would change. I think, given that it had a lot of interoperability, so we're targeting 30-ish databases, having people be able to inspect the drivers and be like, actually, the way you're hitting the index here is hokey. You should do it this way instead, is very beneficial.

I do think that we have gone a ton from being open-source products in terms of information, adoption, usage. We still very much appreciate people complaining. It sounds weird, but we get a lot of value from people complaining. It gives us a pretty clear sense of who wants what, how badly they want it. I think it's an amount of information in other contexts, I would have spent a lot of money to generate. Having something that is in the public eye is actually very valuable in that, and just solving some.

[0:37:31] SF: For something like this, where you talked about how you feel this is open source, essentially, the model that software should be consumed. Then, from a business side, the value that the business is bringing, where they can charge money is no longer, essentially, the lines of code that they've written. They have to find other ways, essentially, bringing value. In companies that are open source first, or really investing in open source, how do you think that they need to think about bringing business value, so that they can actually, at some point they pay bills, essentially?

[0:38:03] SA: The general frame that I have there is you should understand what you've been charged for very, very early on. I think that it's dangerous to write the project, release it, run it for a year to be like, "Gee, whiz. How to make money off this thing?" I think that most software ideally has a specific user. It has specific set of constituents and people that get value from it. You should understand who's using it, why they're using it, what they value, what the other cast of characters are, and then how to somehow, assuming you're going to commercialize it, what the lines of commercialization are, and then try to do a really good job of seeking those lines.

I think, we from very early on, knew we wanted to charge for right labeling, and that if you wanted to embed us in your application, that's great. We're an application first and foremost, so if you want to white label us, that's going to be a paid thing. We're not building a library to build your own. We're not building an open-source library to build your own analytics applications. We're explicitly building an application that you can embed. I think that created a lot of clarity. It made it easy to just understand how the roadmap should look. It made, hopefully, made us predictable to our users. I don't think we've ever pulled any rugs out from under anyone where we took that features, or did anything too capricious.

I think that if you're planning to, as an entrepreneur, as a founder, or as a company trying to release software through open source, understand what people will eventually pay for. I think the clearer that vision is, and the more justifiable it is, the more likely you are to get the lines right. I think there's a lot of projects that try to commercialize and it's bombed. For a long time, there is no open-source companies, then there was a flurry, and then a lot of them had a come to Jesus moment. I think that one of the things that has separated the people that have won, has just been some sense of like, okay, this is why someone pays.

I think it's important to separate out the winning products. I think without a winning product, you're not really playing the open-source game. You're just having some weird half-ass marketing adventure, side adventure. Understanding what you're giving way and why, and why people want it, and making sure that it actually can replace the alternatives, and it's not just a crippled version. Secondarily, like, cool, if you win that, what exactly is it you're selling? For us, I think a lot of that just boils down to understanding the installer, and then their boss. We try to make the things that installers value free, that think their bosses will demand after it's successful, paid. That was the general heuristic we ran with. It's worth in some ways, not in others. But I think having something like that from the very, very early days that you believe in, that you're able to validate somehow, some way, even before you start charging money is really important.

[0:41:02] SF: Yeah. I mean, I think that what you can charge for, and how people evaluate the value that the delivering has changed over time, like there was a time where you could write shrink wrap software, and you're explicitly charging, essentially, for that software. Obviously, it's bringing value, but you were in a lot of ways, charging for, essentially, the lines of code that you were in. I think now, especially with managed services, and other different ways of essentially monetizing commercializing businesses, it's changed the model where you can essentially give away the source code, and the value is not there. It's something else, just in terms of making it really easy to run, or certain enterprise features that are maybe not available in the open-source model, or whatever it is.

Do you think now where more and more code is essentially being written with at least the assistance of AI, that in some ways even lowers the value of the lines of code even more, where it makes sense to figure out other ways of, essentially, delivering value to your customer?

[0:42:01] SA: I mean, I think this depends on what the implicit rate of improvement for AI is. I think there's a version where no humans have any value, therefore, don't bother. I'm not quite that extremist, but I think there is another version where it's like, it's mostly just going to be where it is today with slightly better ergonomics. Somewhere in those two polls, there's the path beyond. The reason I bring that up is there's some parts of that spectrum where the ability to turn arbitrary incantations in something resembling natural language into something that works remains very, very valuable. That LLMs and co-pilots and all that are really just a higher-level language, but you're still fundamentally working in a higher-level language.

In some ways, the LMM is really just a compiler for your interpreter, for your super, really leveraged DSL. There's still someone that has to make the incantation. The people that can make that incantation will have valuable skills. The people that are able to pull that together to solve actual problems are still valuable. I do think that as the level of skill required to build a certain system decreases, or changes, it starts to shift value into the people that are able to understand what to build. That the relative value of someone that knows, actually, I need to fill this specific Lego to make money is even more important.

I still think that for most that spectrum of how far AI goes. There's still need to be someone holding a wand and speaking the incantation. I just think that the nature of that language will change. How much of the value is in the prompting, versus the actual post-processing, or pre-processing, how much of it is in the actual modeling train, how much is it fine-tuning. There's going to be a lot of stuff that is still high value that has to be done by somebody. Unless, you assume that LLMs and the systems you build around them get so advanced so fast that all that gets done by them, then there's still humans going to be doing all this. Whether we call them a software engineer, or a prompt engineer, or a product creation specialist, or a magician, it doesn't really matter. There's still going to be some number of people.

I do think that it will probably change the leverage. I do think, so you will not need a thousand software engineers to build something. You might only need 10 prompt engineers to build something that would be equal scale. Hopefully, this means we do bigger and crazier stuff and then we have better toys in the future and then we're able to tackle bigger projects. I still think that for quite a portion of that spectrum, there will be someone that, and companies will and still need to figure out what those Legos are, identify them, build their best version at Lego, and then somehow, find a way to get in front of people and have people want to buy from them.

[0:45:11] SF: I think this is a nice way to tie things back to what we were talking about even at the beginning, where you use the analogy about blogging, where if we can reduce, essentially, the configuration setup steps to help people who want to write and put their stuff out there, then you're going to get a lot more creative work that's going on. I think it's similar where if you can essentially lower the barrier of entry being able to create a lot of code and eventually, products, then I don't think it's that there's less people doing that stuff. There's actually more people doing that stuff, because now, it's not that anybody can do it, but someone who has some level of skill can now, essentially, create some kind of product experience, or at least will get there in some stage.

[0:45:54] SA: If I can give a maybe a concrete example, which might crystallize this, I think, again, barring some weird singularity, we're probably going to still want iPhone workout apps. Someone's going to have to build the best workout app. The question of whether the primary skills behind the person building apps will be of everyone who is at least this level of proficiency with iOS development and Objective-C and blah, blah, blah, or is it like, who has the best idea for a workout app? I think that what's going to happen is that having those mechanical skills which were critical when the iPhone launched, and the best workout app of the first generation, whoever's able to write a bug-free app, has shifted to who has the best ideas around how to structure the thing.

There's still a market for it. You still got to build it. You still got to build a better one than the next person. There's still going to be people that build that app. Again, they just might have a different title and might be working on a different editor.

[0:46:52] SF: Well, Sameer, thanks so much for being here. I really enjoyed the conversation. We ended up going deep at the end, which I like. I think there's a lot to digest, especially when we're talking about products that are really focused and reducing, I think, the barrier to entry, or the friction involved with accessing, analyzing, driving value from data.

[0:47:11] SA: Likewise. Had a great time. Thank you for having me on here.

[0:47:13] SF: All right. Thanks, and cheers.

[END]