Should Kubernetes controllers+CRDs managing resources in the external world (non-k8s APIs) try to fix drifts (such as out of band modifications made in the external API) periodically –or assume they're the only actor, and wait for the next modification to reconfigure? This makes scaling harder. - ThreadSky

ahmet.dev • 45 days ago

Should Kubernetes controllers+CRDs managing resources in the external world (non-k8s APIs) try to fix drifts (such as out of band modifications made in the external API) periodically –or assume they're the only actor, and wait for the next modification to reconfigure? This makes scaling harder.

Comments

kakkoyun.me•44 days ago

Using k8s for external APIs bloats the API server, hurts performance, and adds fragility.

Not every system needs to fit the k8s paradigm.

ahmet.dev•44 days ago

There's no point of writing KRM paradigm off that quickly. We achieve marvelous things with it. Any alternative would look like writing hefty temporal workflows that are not event-driven. I like this better.

kakkoyun.me•44 days ago

Convenience is undeniably powerful, and KRM enables incredible things. But pushing it too far can come at a cost. Convenient isn’t always the same as good, and it’s worth balancing trade-offs carefully.

ry.codes•44 days ago

What makes you think the CR is more correct than reality? IMO once reconciled the operator is done.

ahmet.dev•44 days ago

CR is the intended config, so I expect it to overwrite the reality –the question is whether it should try to course-correct periodically.

There are some systems (like Prodspec/Annealing) that learn from the current state and bring it back to the configuration SoT, but they're few and rare.

ry.codes•44 days ago

My interpretation is that the CR is the intended state during reconciliation, not forever. If something/someone changes the actual state, they'll be surprised to find it asynchronously revert. I don't think the CR can realistically be a SoT most of the time.

ahmet.dev•44 days ago

For most folks that ship has sailed. KRM-based tools like Crossplane manage infra in many shops. That’s why probably as a platform team, the moment you give users access to CRs, take the access away to modify the resource another way.

ry.codes•44 days ago

Yeah that worries me. My team's operators own their resources -- even we can't manipulate them ourselves. But there is still drift. Some state transitions are irreversible, for example.

ahmet.dev•44 days ago

Yeah we have a few APIs like that too, they don't play well with the "assume an open world" assumption of KRM. We do our best to validate the state transitions in webhooks.

IMO not being able to manipulate them is a feature. :P when we need to manipulate we set a field called `spec.paused:true`.

willgottschalk.bsky.social•45 days ago

To me, it kinda depends on how much you trust the users of the CRD. Fixing the drift is more in the spirit of Kubernetes though

embano1.mgasch.com•45 days ago

If you’re in camp „k8s is the source of truth“ (like me) you might want to check the AWS ACK project which uses this model by default to reconcile AWS resources - incl periodic resync.

cc/ @a-hilaly.bsky.social

ahmet.dev•45 days ago

Concrete example: imagine a Bucket resource creates/configures buckets on s3 API. Someone goes to the s3 console and edits bucket configuration. Unless your controller periodically queries s3 API, it won't overwrite the out-of-band modifications until next time Bucket resource is modified.

karlkfi.bsky.social•45 days ago

Ideally an operator would watch the resources it manages and revert drift as soon as possible, to minimize downtime.

But in the case where watching is not possible, polling may be acceptable.

For scaling, consider adding a component that centralizes the polling and transforms it into watches.

ahmet.dev•45 days ago

To convert data polled from external API to watches, would you bring data from external API back into KRM APIs?

I feel if there's an assumption like CRD will always overwrite the external API, not sure if there's value in reconciling sooner (vs eventually when the resource.generation++).

karlkfi.bsky.social•45 days ago

Depends on the use case.

In ConfigSync we have multiple layers and user can decide which they want:
1. Periodic Remediation
2. (Reactive) Drift Remediation
3. Drift Prevention

karlkfi.bsky.social•45 days ago

In most case, drift prevention is actually too draconian, and periodic remediation too slow.

Most users prefer reactive remediation, with the option to pause remediation to deliberately apply drift in case of emergency.

nikovirtala.io•45 days ago

how about using a resource policy that prevents out of band configuration changes?

drebes.org•45 days ago

It’s the old terraform (you’re the only actor) versus kubernetes resource model (reconciliation loops/annealing) conflict. In both cases they seem to have taken a conscious approach.

drebes.org•45 days ago

The problem with the first is that it implies a need for much more ordering (where TF benefits from having a resource graph, which has its own issues), as on a single run you’re expected to completely converge to the model.

drebes.org•45 days ago

The problem with the second is that you can’t be sure you can ever converge to the model, since it’s part of a he design to rerun rerun rerun until spec and status converge.

drebes.org•45 days ago

For simple systems, the single run/actor model tends to do better IMO, for complex ones the later. But then the operator that reconciliates becomes its own beast.

drebes.org•45 days ago

In a way, we’re back to the old configuration management (ansible versus puppet) arguments. :)

lalithsuresh.bsky.social•45 days ago

I'd expect that they're periodically querying the S3 API in this case. At a high-level, operators are supposed to continuously reconcile the current and desired states (which imply watching out for drift). Unfortunately, most external systems lack APIs to listen for changes, hence the polling.

ahmet.dev•45 days ago

If your controller is managing 20,000 buckets, you'll make 20k external API calls during controller startup full sync to verify buckets exist and their config is up-to-date (99.99% are) –during which you won't get to processing a new object soon enough, bc the workqueue is still full.

lalithsuresh.bsky.social•45 days ago

This sounds like foreground and background tasks are sharing the same queue. You could consider two task queues internally, with weighted fair sharing or priorities in how they are drained. This way, the 20K external API calls during startup don't starve the more important new object processing.

embano1.mgasch.com•45 days ago

question: is there a reason a single controller needs to do this work vs partitioning for scale and blast radius containment?

ahmet.dev•45 days ago

I'm going off the usual assumption that most people run k8s controllers w/ single active leader + a lot of external APIs have rate limits. Partitioning controllers is not a trivial task (Tim Ebert has a masters thesis on this with a prototype impl.) and nobody I know does this.

embano1.mgasch.com•44 days ago

sorry, I didn’t meant partitioning using controller patterns (our good Knative friends also prototyped this), but by AWS accounts (spreading the buckets and running multiple controllers per account) - if that’s an option in your design (cell-based architectures have benefits).

thock.in•45 days ago

Same in k8s with load-balancers. If you have 100 services with LBs, and someone modifies one behind the scenes, we don't know to reconcile, because (AFAICT) none of the clouds have watch APIs and they all have API quota. Polling 100 LBs every 10 seconds = 10 QPS of 99.999% do-nothing traffic.

ahmet.dev•45 days ago

I'm on this boat exactly –but lots of differing opinions. IMO if you're going behind your own back making out of band edits, don't expect the controller (which assumes it's the sole actor managing this resource, will overwrite any changes) to have this responsibility of promptly fixing the resource.

thock.in•44 days ago

OTOH, maybe Kubernetes will prompt cloud providers to support some form of WATCH operations, just to alleviate the load of millions of kube clusters polling?

I'm an optimist. :)

Comments

Posting Rules

Reply