
At the end of last year, another live broadcast of the Russian PostgreSQL community
#RuPostgres took place , during which its co-founder Nikolai Samokhvalov talked with Flanta technical director Dmitry Stolyarov about this DBMS in the context of Kubernetes.
We're posting a transcript of the bulk of this discussion, and a full video has been posted
on the communityâs YouTube channel :
Databases and Kubernetes
NS : We will not talk about VACUUM and CHECKPOINTs today. We want to talk about Kubernetes. I know that you have many years of experience. I watched your videos and even reviewed some of them ... Let's go straight to the quarry: why is Postgres or MySQL in K8s at all?DS : There is no single answer to this question and it cannot be. But in general, it is simplicity and convenience ... potential. After all, everyone wants managed services.
NS : To like RDS , only at home?DS : Yes: to like RDS, only anywhere.
NS : âAnywhereâ is a good point. In large companies, everything is located in different places. And why then, if this is a big company, do not take a ready-made solution? For example, Nutanix has its own developments, while other companies (VMware ...) have the same âRDS, only at homeâ.DS : But we are talking about a single implementation that will work only under certain conditions. And if we are talking about Kubernetes, then there is a huge variety of infrastructure (which can be in K8s). This is essentially the standard for the API to the cloud ...
NS : Itâs free too!DS : This is not so important. Free is not important for a very large segment of the market. Another thing is important ... You probably recall the report "
Databases and Kubernetes "?
NS : Yes.DS : I realized that he was perceived very ambiguously. Some people thought that I was saying: âGuys, we went all the databases to Kubernetes!â, While others decided that they were all terrible bicycles. And I wanted to say something else altogether: âLook at what is happening, what are the problems and how they can be solved. Now go bases in Kubernetes? Production? Well, only if you love ... doing certain things. But for dev, I can say that I recommend it. For dev, dynamically creating / deleting environments is very important. â
NS: By dev, do you mean all environments that are not prod? Staging, QA ...DS : If we are talking about perf stands, then probably not already, because the requirements are specific there. If we are talking about special cases where a very large database is needed on staging, then probably not too ... If this is a static environment, long-lived, then what is the benefit of having the base located in K8s?
NS : None. But where do we see static environments? The static environment is outdated tomorrow.DS : Staging can be static. We have clients ...
NS : Yes, I also have. The big problem is if you have a base of 10 TB and staging - 200 GB ...DS : I have a very cool case! On staging there is a prod'ovy base in which changes are made. And a button is provided: "roll out to production". These changes - deltas - are added (it seems, they are just synchronized by API's) in production. This is a very exotic option.
NS : I saw startups in the Valley who are sitting in RDS, or even in Heroku yet - these are stories of 2-3 years ago - and they download the dump to their laptop. Because the base is only 80 GB so far, and there is a place on the laptop. Then they buy disks for everyone, so that they have 3 bases, so that they can carry out different developments. This happens too. I also saw that they are not afraid to copy prod into staging - it depends very much on the company. But he saw that they were very afraid, and that often they did not have enough time and hands. But before we move on to this topic, I want to hear about Kubernetes. I understand correctly that in prod'e so far no one?DS : We have small bases in prod. We are talking about volumes of tens of gigabytes and non-critical services, for which it was too lazy to make replicas (and there is no such need). And provided that under Kubernetes there is a normal storage. This database worked in a virtual machine - conditionally in VMware, on top of storage. We placed it in
PV and now we can transfer it from car to car.
NS : Bases of this size, up to 100 GB, on good disks and with a good network can be rolled out in a few minutes, right? A speed of 1 GB per second is no longer exotic.DS : Yes, for a linear operation this is not a problem.
NS : Okay, we should only think about prod. And if we consider Kubernetes for non-prod environments - how to do it? I see that in Zalando they are making an operator , in Crunchy they are sawing , there are some other options. And there is OnGres - this is our good friend Alvaro from Spain: in fact, they do not just an operator , but a whole distribution ( StackGres ), in which, in addition to Postgres itself, they also decided to stuff the backup, Envoy proxy ...DS : Envoy for what? Postgres traffic balancing exactly?
NS : Yes. That is, they see it as: if you take the Linux distribution and the kernel, then the usual PostgreSQL is the kernel, and they want to make a distribution that is cloud-friendly and runs on Kubernetes. They dock components (backups, etc.) and debug so that they work well.DS : Very cool! In essence, it is software to make your managed Postgres.
NS : Linux distributions have eternal problems: how to make drivers so that all hardware is supported. And they have the idea that they will work at Kubernetes. I know that in the Zalando operator we recently saw the eyeballs on AWS and this is not very good. There should not be ties to a specific infrastructure - what is the point then?DS : I donât know in what specific situation Zalando got involved, but in Kubernetes storage is now made in such a way that it is impossible to remove a disk backup in a generic way. Recently, the standard - in the latest version of
the CSI specification - made the possibility of snapshots, but where is it implemented? Honestly, itâs still so raw ... We are trying CSI on top of AWS, GCE, Azure, vSphere, but we are starting to use it a bit, as you can see that it is not ready yet.
NS : Therefore, sometimes you have to tie up with infrastructure. I think this is still an early stage - growth problems. Question: what would you recommend to beginners who want to try PgSQL in K8s? Which operator maybe?DS : The problem is that for us Postgres is 3%. We still have a very large list of different software in Kubernetes, I wonât even list everything. For example, Elasticsearch. There are a lot of operators: some are developing actively, others are not. For ourselves, we made requirements that should be in the operator, so that we take him seriously. The operator is specifically for Kubernetes - not the âoperator to do something under Amazonâs conditions ..." In fact, we use a single operator quite massly (= for almost all clients) -
for Redis (we will publish an article about it soon) .
NS : But for MySQL, too? I know that Percona ... since they are now involved in MySQL, MongoDB, and Postgres, they will have to fix some kind of universal one: for all databases, for all cloud providers.DS : We did not have time to look at the statements for MySQL. For us, this is not the main focus now. MySQL works fine in standalone. Why an operator, if you can just start the database ... You can start the Docker container with Postrges, or you can start it in a simple way.
NS : This was also a question. No operator at all?DS : Yes, 100% of us have PostgreSQL running without an operator. So far so. We actively use the operator for Prometheus, for Redis. We have plans to find an operator for Elasticsearch - it burns the most because we want to install it in 100% of cases in Kubernetes. Just as we want to ensure that MongoDB is always installed in Kubernetes too. Certain Wishlist appear here - there is a feeling that in these cases something can be done. And about Postgres we did not even look. Of course, we know about the existence of different options, but in fact we have standalone.
Testing Database in Kubernetes
NS : Let's move on to the topic of testing. How to roll out changes in the database - from the point of view of the DevOps perspective. There are microservices, many databases, all the time somewhere something is changing. How to ensure normal CI / CD so that everything is in order from the DBMS position. What is your approach?DS : There can be no one answer. There are several options. The first is the size of the base we want to roll out. You yourself mentioned that companies have a different attitude to having a copy of the prod base on dev and stage.
NS : And in terms of GDPR, I think they are more and more neat ... I can say that in Europe they have already started to fine.DS : But you can often write software that dumps production and obfuscates it. It turns out prod'ovye data (snapshot, dump, binary copy ...), but they are anonymous. Instead, there may be generation scripts: it can be fixtures or just a script that generates a large database. The problem is what: how long does the base image take to be created? And how much time to deploy it on the right environment?
We came to the scheme: if the client has a fixture dataset (minimum version of the database), then by default we use them. If we are talking about review environments, when we created branch, we have deployed an application instance - we are rolling out a small database there. But the
option turned out well, when we remove the dump from production once a day (at night) and collect on its basis a Docker container with PostgreSQL and MySQL with these loaded data. If you need to deploy the base 50 times from this image, this is done quite simply and quickly.
NS : Simple copying?DS : Data is stored directly in the Docker image. Those. we have a ready-made image, albeit 100 GB. Thanks to the layers in Docker, we can quickly deploy this image as many times as needed. The method is dumb, but it works pretty well.
NS : Further, when testing, it changes right inside the Docker, right? Copy-on-write inside Docker - throw it away and go again, everything is fine. Class! And you already use it with might and main?DS : For a long time.
NS : We do very similar things. Only we do not use Docker's copy-on-write, but some more.JS : He's not generic. And Docker'ny works everywhere.
NS : In theory, yes. But we also have modules there, you can make different modules and work with different file systems. What a moment. From Postgres, we look at all this differently. Now I looked from the side of Docker and saw that everything works for you. But if the database is huge, for example, 1 TB, then this is all long: both operations at night and stuff everything in Docker ... And if 5 TB is stuffed in Docker ... Or is everything normal?DS : What difference does it make: it's blobs, just bits and bytes.
NS : The difference is this: do you do this through dump and restore?DS : Not at all necessary. The methods for generating this image can be different.
NS : For some clients, we have made it so that instead of regularly generating a basic image, we constantly keep it up to date. It is essentially a replica, but the data is not received directly from the master, but through the archive. The binary archive where the WALs are rolled every day, the backups are also removed there ... These WALs then fly - with a slight delay (literally 1-2 seconds) - to the base image. We clone it in any way - now we have ZFS by default.DS : But with ZFS you are limited to one node.
NS : Yes. But ZFS also has a magical send : you can send a snapshot with it and even (I havenât really tested it yet, but ...) you can send a delta between two PGDATA
. In fact, we have another tool that we did not particularly consider for such tasks. PostgreSQL has pg_rewind , which works as a âsmartâ rsync, skipping a lot of things that you donât have to watch, because nothing has changed there for sure. We can do a quick synchronization between the two servers and rewind in exactly the same way.So, we are trying on this, more DBA'noy, side to make a tool that allows you to do the same thing that you said: we have one base, but we want to test something 50 times, almost at the same time.DS : 50 times means you need to order 50 Spot instances.
NS : No, we do everything on one machine.DS : But how do you deploy 50 times if this one base is, say, a terabyte. Most likely she needs conditionally 256 GB of RAM?
NS : Yes, sometimes a lot of memory is needed - this is normal. But such an example from life. The production machine has 96 cores and 600 GB. At the same time, 32 cores are used for the database (even 16 cores are sometimes now used) and 100-120 GB of memory.DS : And 50 copies get in there?
NS : So there is only one copy, then copy-on-write (ZFS'ny) works ... I'll tell you more.For example, we have a base of 10 TB. They made a disk for it, ZFS still squeezed its percent size by 30-40. Since we do not do load testing, the exact response time is not important to us: let it be up to 2 times slower - that's okay.We enable programmers, QA, DBA, etc. Perform testing in 1-2 threads. For example, they can start some kind of migration. It does not require 10 cores at once - it needs 1 Postgres backend, 1 core. Migration will start - maybe autovacuum will still start, then the second core is activated. We have allocated 16-32 cores, so 10 people can work simultaneously, there are no problems.Since PGDATA
physically the same, it turns out that we are actually fooling Postgres. The trick is this: it starts, for example, 10 Postgres at the same time. What problem is usually what? They put shared_buffers , say, at 25%. Accordingly, this is 200 GB. You wonât start more than three of them, because the memory will end.But at some point we realized that this was not necessary: ââwe set shared_buffers to 2 GB. PostgreSQL has effective_cache_size , and in reality only it affects plans . We put it at 0.5 Tb. And it doesnât even matter that they arenât really there: he makes plans as if they are.Accordingly, when we test some kind of migration, we can collect all the plans - we will see how it will happen in production. The seconds there will be different (slower), but the data that we actually read, and the plans themselves (what kind of JOINs, etc.) are obtained exactly the same as on production. And in parallel, you can run many of these checks on one machine.DS : Do you think that there are several problems? The first is a solution that works only on PostgreSQL. This approach is very private, it is not generic. The second - Kubernetes (and that's where the cloud is going now) involves a lot of nodes, and these nodes are ephemeral. And in your case it is a stateful, persistent node. These things contradict me.
NS : First - I agree, this is a purely Postgres story. I think if we have any direct IO and a buffer pool for almost all memory, this approach will not work - there will be different plans. But we are only working with Postgres for now, we donât think about others.About Kubernetes. You yourself always say that we have a persistent base. If the instance crashes, the main thing is to save the disk. Here we also have the entire platform in Kubernetes, and the component with Postgres is separate (although it will be there someday). Therefore, everything is so: the instance fell, but we saved it PV and just connected to another (new) instance, as if nothing had happened.DS : From my point of view, we create pods in Kubernetes. K8s - elastic: components are ordered on their own as needed. The task is to simply create a pod and say that it needs X resources, and then K8s will figure it out. But the storage support in Kubernetes is still unstable: in
1.16 , in
1.17 (this release was released
weeks ago), these features become only beta.
Six months or a year will pass - it will become more or less stable, or at least will be declared as such. Then the possibility of snapshots and resize'a already solves your problem completely. Because you have a base. Yes, it may not be very fast, but the speed depends on what is âunder the hood,â because some implementations can copy and copy-on-write at the level of the disk subsystem.
NS : Itâs also necessary for all the engines (Amazon, Google ...) to start supporting this version - it also takes some time.DS : While we do not use them. We use ours.
Local development under Kubernetes
NS : Have you encountered such a Wishlist when you need to raise all the pods on one machine and do such a little testing. In order to quickly get a proof of concept, see that the application works in Kubernetes, without allocating a bunch of machines for it. Is there a Minikube, right?DS : It seems to me that this case - deploy on one node - is exclusively about local development. Or some manifestations of such a pattern. There is
Minikube , there are
k3s ,
KIND . We are going to use Kubernetes IN Docker. Now they started working with him for tests.
NS : I used to think that this is an attempt to wrap all pods in one Docker image. But it turned out that this is about something else. Anyway, there are separate containers, separate pods - just in the Docker.DS : Yes. And there a rather funny imitation is done, but the point is ... We have a deployment tool -
werf . We want to make a mode in it - conditionally
werf up
: âRaise me a local Kubernetesâ. And then run the conditional
werf follow
. Then the developer will be able to edit in the IDE, and a process is launched in the system that sees the changes and reassembles the images, remodels them into the local K8s. So we want to try to solve the problem of local development.
Snapshots and database cloning in the realities of K8s
NS : If you go back to copy-on-write. I noticed that the clouds also have snapshots. They work differently. For example, in GCP: you have a multi-terabyte instance on the east coast of the USA. You do periodically snapshots. You pick up a disk copy on the west coast from a snapshot - in a few minutes everything is ready, it works very quickly, only the cache needs to be filled in memory. But these clones (snapshots) - in order to 'provision'it a new volume. This is great when you need to create many instances.But for the tests, it seems to me, snapshots that you talk about in Docker or I talk about in ZFS, btrfs and even LVM ... - they allow you to not make really new data on the same machine. In the cloud, you still have to pay for them each time and wait not minutes, but minutes (and in the case of a lazy load , itâs probably hours).Instead, you can get this data in a second or two, drive the test and throw it away. These snapshots solve different problems. In the first case - to scale and get new replicas, and in the second - for tests.DS : I do not agree. Cloning volumes normally is the task of the cloud. I did not watch their implementation, but I know how we do it on hardware. We have Ceph, in it you can tell any physical volume (
RBD ) to
clone and get a second volume with the same characteristics,
IOPSs , etc., in tens of milliseconds. You have to understand that there is a tricky copy-on-write inside. Why doesn't the cloud do the same? I am sure that they are somehow trying to do this.
NS : But they will still take seconds, tens of seconds to raise the instance, bring Docker there, etc.DS : Why is it necessary to raise an entire instance? But we have an instance for 32 cores, for 16 ... and it somehow fits into it - for example, four. When we order the fifth, the instance will rise, and then it will be deleted.
NS : Yes, interestingly, Kubernetes has a different story. Our database is not in K8s, and one instance. But cloning a multi-terabyte database takes no more than two seconds.DS : That's cool. But my initial message is that this is not a generic solution. Yes, itâs cool, but only Postgres is suitable and only on one node.
NS : It is suitable not only for Postgres: these plans, as I described, will work only in that way. But if you donât bother with the plans, but we just need all the data for functional testing, then this is suitable for any DBMS.DS : Many years ago we did this on LVM snapshots. This is a classic. This approach has been very actively used. Just stateful nodes are a pain. Because they need not be dropped, always remember about them ...
NS : Do you see any hybrid possibility here? Let's say stateful is some kind of pod, it works for several people (many testers). We have one volume, but thanks to the file system, the clones are local. If the pod falls, the disk remains â the pod rises, it considers the information about all the clones, takes everything back and says: âHere are your clones on these ports, start working with them further.âDS : Technically, this means that within Kubernetes, this is one pod, inside which we run many Postgres.
NS : Yes. He has a limit: suppose, at the same time, no more than 10 people work with him. If you need 20 - run the second such pod. Fully realistically clone it, having received the second full volume, it will have the same 10 âthinâ clones. Do not see such an opportunity?DS : We have to add security issues here. Such an organization option implies that this pod has high capabilities because it can perform non-standard operations on the file system ... But I repeat: I believe that in the medium term, storage will be fixed in Kubernetes, the whole story with volumes will be fixed in the clouds - everything will be "just working." It will resize, cloning ... There is a volume - we say: âCreate a new one on the basis of thatâ - and after a second and a half we get what we need.
NS : I do not believe in one and a half seconds for many terabytes. At Ceph, you do it yourself, and you talk about clouds. Go to the cloud, on EC2, make a clone of the EBS volume of many terabytes and see what performance will be. It does not take a few seconds. I am very interested when they reach such an indicator. I understand what you're talking about, but let me disagree.DS : Ok, but I said that in the medium term, not short term. For several years.
Pro operator for PostgreSQL from Zalando
In the middle of this meeting, Alexey Klyukin, a former developer from Zalando, who spoke about the history of the PostgreSQL operator, also joined her:
Itâs great that in general this topic was touched upon: both Postgres and Kubernetes. When we started doing it in Zalando in 2017, it was such a topic that everyone wanted to do, but no one did. Everyone already had Kubernetes, but when asked what to do with the databases, even people like Kelsey Hightower who preached K8s said something like this:
âGo to managed services and use them; do not start the database in Kubernetes. Otherwise, your K8s will decide, for example, to upgrade, put out all the nodes, and your data will fly far, far away. "
We decided to make an operator that, contrary to this advice, will launch the Postgres database in Kubernetes. And we had a good foundation - Patroni . This is an automatic failover for PostgreSQL, done correctly, i.e. using etcd, consul or ZooKeeper as a repository for cluster information. Such a repository that will be given to everyone who asks, for example, what kind of leader is now, the same information - despite the fact that we have everything distributed - so that there is no split brain. Plus, we had a Docker image for him.
In general, the need for auto failover in the company appeared after the migration from the internal iron data center to the cloud. The cloud was based on PaaS (Platform-as-a-Service) proprietary solution. It is Open Source, but to raise it, you had to work hard. It was called STUPS .
Initially, there was no Kubernetes. More precisely, when its own solution was deployed, K8s was already, but so crude that it was not suitable for production. It was, in my opinion, 2015 or 2016. By 2017, Kubernetes became more or less mature - there was a need to migrate there.
And we already had a docker container. There was PaaS that used Docker. Why not try K8s? Why not write your own statement? Murat Kabilov, who came to us from Avito, started this as a project on his own initiative - âplayâ - and the project âtook offâ.
But in general, I wanted to talk about AWS. Why was there historically AWS related code ...
When you run something in Kubernetes, you need to understand that K8s is such a work in progress. It is constantly developing, improving and periodically even breaking. You need to carefully monitor all the changes in Kubernetes, you need to be prepared to immerse yourself in it, and find out how it works in detail - perhaps more than you would like. This is, in principle, any platform on which you run your databases ...
So, when we did the statement, we had Postgres, which worked with an external volume (in this case, EBS, since we worked in AWS). The database was growing, at some point it was necessary to resize: for example, the original size of EBS is 100 TB, the database has grown to it, now we want to make EBS in 200 TB. How? Suppose you can dump / restore to a new instance, but this is long and with downtime.
Therefore, I wanted a resize that would expand the EBS partition and then tell the file system to use the new space. And we did it, but at that time Kubernetes had no API for the resize operation. Since we worked on AWS, we wrote code for its API.
No one bothers to do the same for other platforms. There is no complication in the statement that it can only be run on AWS, and it wonât work on everything else. In general, this is an Open Source project: if anyone wants to accelerate the emergence of the use of the new API, we are welcome. There is GitHub , pull-requests - the Zalando team is trying to quickly respond to them and promote the operator. As far as I know, the project participated in Google Summer of Code and some other similar initiatives. Zalando is very active on it.
PS Bonus!
If you are interested in the topic of PostgreSQL and Kubernetes, then we also draw attention to the fact that last week the next Postgres took place, where
Alexander Kukushkin from Zalando talked with Nikolai. Video from it is available
here .
PPS
Read also in our blog: