（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40942732

开放容器倡议 (OCI) 分发规范存在多个问题，导致其效率较低且用户友好性较低。尽管分块上传，但层推送必须按顺序发生。通过 Docker Hub 和 GitHub PackageRegistry (GHPR) 等服务进行分块上传目前已损坏。规范中指定的内容范围值不符合注释请求 (RFC) 7233 格式。并行性发生在单个 blob 级别，但不是每个 blob 级别。此外，由于在标准创建过程中丢失了文本，因此错过了标准化标签分页列表的一个关键机会。由于这一遗漏，不同的组织实施了独特的解决方案。将应用程序及其依赖项容器化到单个可传输文件中具有许多优点。开发人员不再需要单独安装依赖项、管理配置文件或处理各种操作系统的复杂问题。这种便利可以更轻松地跨不同平台部署应用程序。以前，实现类似的功能需要对配置、设置过程和自动化部署工具进行大量的编程工作。然而，容器化提供了一个广泛的、通用的标准，简化了无数开发人员的开发。通用方法（包括配置、存储、数据和环境元素的标准化）使创建和维护系统变得更加简单。虽然用户可能会在某些情况下体验到改进，但仍担心由于将完整的库集作为部署工件而导致带宽和存储消耗。尽管高速互联网接入变得越来越普遍，但集装箱化的影响在没有快速连接的地区仍然相当大。许多缺点可以通过有效的增量压缩和基础图像利用来弥补。总之，虽然容器化以前是可以实现的，但它所带来的好处（例如易于设置、减少依赖管理和提高可移植性）证明了其广泛采用的合理性。尽管容器化在某些情况下可能会带来挑战，但它可以解决开发人员和运营团队等的关键痛点。对于许多人来说，用更高的带宽和存储容量来换取简化的操作知识和降低的复杂性证明是可以接受的折衷方案。

The OCI Distribution Spec is not great, it does not read like a specification that was carefully designed.

> According to the specification, a layer push must happen sequentially: even if you upload the layer in chunks, each chunk needs to finish uploading before you can move on to the next one.

As far as I've tested with DockerHub and GHCR, chunked upload is broken anyways, and clients upload each blob/layer as a whole. The spec also promotes `Content-Range` value formats that do not match the RFC7233 format.

(That said, there's parallelism on the level of blobs, just not per blob)

Another gripe of mine is that they missed the opportunity to standardize pagination of listing tags, because they accidentally deleted some text from the standard [1]. Now different registries roll their own.

[1] https://github.com/opencontainers/distribution-spec/issues/4...

> The OCI Distribution Spec is not great, it does not read like a specification that was carefully designed.

That’s par for everything around Docker and containers. As a user experience Docker is amazing, but as technology it is hot garbage. That’s not as much of a dig on it as it might sound: it really was revolutionary; it really did make using Linux namespaces radically easier than they had ever been; it really did change the world for the better. But it has always prioritised experience over technology. That’s not even really a bad thing! Just as there are tons of boring companies solving expensive problems with Perl or with CSVs being FTPed around, there is a lot of value in delivering boring or even bad tech in a good package.

It’s just sometimes it gets sad thinking how much better things could be.

> it really did change the world for the better.

I don’t know about that (hyperbole aside). I’ve been in IT for more than 25 years now. I can’t see that Docker container actually delivered any tangible benefits in terms of end-product reliability or velocity of development to be honest. This might not necessarily be Dockers fault though, maybe it’s just that all the potential benefits get eaten up by things like web development frameworks and Kubernetes.

But at the end of the day, todays Docker-based web app development delivers less than fat-client desktop app development delivered 20 years ago, as sad as that is.

If you haven’t seen the benefits, you’re not in the business of deploying a variety of applications to servers.

The fact that I don’t have to install dependencies on a server, or set up third-party applications like PHP, Apache, Redis, and the myriad of other packages anymore, or manage config files in /etc, or handle upgrades of libc gracefully, or worry about rolling restarts and maintenance downtime… all of this was solvable before, but has become radically easier with containers.

Packaging an application and its dependencies into a single, distributable artifact that can be passed around and used on all kinds of machines was a glorious success.

Circa 2005 I was working at places where I was responsible for 80 and 300 web sites respectively using a large range of technologies. On my own account I had about 30 domain names.

I had scripts that would automatically generate the Apache configuration to deploy a new site in less than 30 seconds.

At that time I found that most web sites have just a few things to configure: often a database connection, the path to where files are, and maybe a cryptographic secret. If you are systematic about where you put your files and how you do your configuration running servers with a lot of sites is about as easy as falling off a log, not to mention running development, test, staging, prod and any other sites you need.

I have a Python system now with gunicorn servers and celery workers that exists in three instances on my PC, because I am disciplined and everything is documented I could bring it up on another machine manually pretty quickly, probably more quickly than I could download 3GB worth of docker images over my ADSL connection. With a script it would be no contest.

There also was a time I was building AMIs and even selling them on the AMZN marketplace and the formula was write a Java program that writes a shell script that an EC2 instance runs on boot, when it is done it sends a message through SQS to tell the Java program to shut down and image the new machine.

If Docker is anything it is a system that turns 1 MB worth of I/O into 1 GB of I/O. I found Docker was slowing me down when I was using a gigabit connection, I found it basically impossible to do anything with it (like boot up an image) on a 2MB/sec ADSL connection, with my current pair of 20MB/s connections it is still horrifyingly slow.

I like how the OP is concerned about I/O speed and bringing it up and I think it could be improved if there was a better cache system (e.g. Docker might even work on slow ADSL if it properly recovered from failed downloads)

However I think Docker has a conflict between “dev” (where I’d say your build is slow if you ever perceive yourself to be waiting) and “ops” (where a 20 minute build is “internet time”)

I think ops is often happy with Docker, some devs really seem to like it, but for some of us it is a way to make a 20 sec task a 20 minute task.

And I'm guessing with this system you had a standard version of python, apache, and everything else. I imagine that with this system if you wanted to update to the latest version of python, in involved a long process making sure those 80 or 300 websites didn't break because of some random undocumented breaking change.

As for docker image size, really just depends on dev discipline for better or for worse. The nginx image, for example, adds about 1MB of data on top of the whatever you did with your website.

You hit a few important notes that are worth keeping in mind, but I think you handwave some valuable impacts.

By virtue of shipping around an entire system's worth of libraries as a deployment artifact, you are indeed drastically increasing the payload size. It's easy to question whether payload efficiency is worthwhile when the advent of >100, and even >1000 Mbit internet connections available to the home, but that is certainly not the case everywhere. That said, assuming smart squashing of image deltas and basing off of a sane upstream image, much of that pain is felt only once.

You bring up that you built a system that helped you quickly and efficiently configure systems, and that discipline and good systems design can bring many of the same benefits that containerized workloads do. No argument! What the Docker ecosystem provided however was a standard implemented in practice that became ubiquitous. It became less important to need to build one's own system, because the container image vendor could define that, using a collection of environment variables or config files being placed in a standardized location.

You built up a great environment, and one that works well for you. The containerization convention replicates much of what you developed, with the benefit that it grabbed a majority mindshare, so now many more folks are building with things like standardization of config, storage, data, and environment in mind. It's certainly not the only way to do things, and much as you described, it's not great in your case. But if something solves a significant amount of cases well, then it's doing something right and well. For a non inconsequential amount of people, trading bandwidth and storage for operational knowledge and complexity are a more than equitable trade

Agreed, I remember having to vendor runtimes to my services because we couldn't risk upgrading the system installed versions with the number of things running on the box, which then led into horrible hacks with LD_PRELOAD to workaround a mixture of OS / glibc version's in the fleet. Adding another replica of anything was a pain.

Now I don't have to care what OS the host is running, or what dependencies are installed, and adding replicas is either automatic or editing a number in a config file.

Containerization and orchestration tools like k8s have made life so much easier.

As you note, it was all solvable before.

A lot of us were just forced to "switch" from VMs to Docker; Docker that still got deployed to a VM.

And then we got forced to switch to podman as they didn't want to pay for Docker.

> As you note, it was all solvable before.

Washing clothes was possible before people had a washing machine, too; I’m not sure they would want to go back to that, though.

I was there in the VM time, and I had to set up appliances shipped as a VM instance. It was awful. The complexity around updates and hypervisors, and all that OS adjustment work just to get a runtime environment going, that just disappeared with Docker (if done right, I’ll give you that).

Organisations manage to abuse technology all the time. Remember when Roy Fielding wrote about using HTTP sensibly to transfer state from one system to another? Suddenly everything had to be „RESTful“, which for most people just meant that you tried to use as many HTTP verbs as possible and performed awkward URL gymnastics to get speaking resource identifiers. Horrible. But all of this doesn’t mean REST is a bad idea of itself - it’s a wonderful one, in fact, and can make an API substantially easier to reason about.

I’m aware of all of that, I’m just saying that this has not translated into more reliable and better software in the end, interestingly enough. As said, I’m not blaming Docker, at least not directly. It’s more that the whole “ecosystem” around it seems to have so many disadvantages that in the end overweigh the advantages of Docker.

It has translated to reliable legacy software. You can snapshot a piece of software, together with its runtime environment, at the point when it's still possible to build it; and then you can continue to run that built OCI image, with low overhead, on modern hardware — even when building the image from scratch has long become impossible due to e.g. all the package archives that the image fetched from going offline.

(And this enables some increasingly wondrous acts of software archaeology, due to people building OCI images not for preservation, but just for "use at the time" — and then just never purging them from whatever repository they've pushed them to. People are preserving historical software builds in a runnable state, completely by accident!)

Before Docker, the nearest thing you could do to this was to package software as a VM image — and there was no standard for what "a VM image" was, so this wasn't a particularly portable/long-term solution. Often VM-image formats became unsupported faster than the software held in them did!

But now, with OCI images, we're nearly to the point where we've e.g. convinced academic science to publish a paper's computational apparatus as an OCI image, so that it can be pulled 10 years later when attempting to replicate the paper.

> You can snapshot a piece of software, together with its runtime environment, at the point when it's still possible to build it

I think you’re onto part of the problem here. The thing is that you have to snapshot a lot of nowadays software together with its runtime environment.

I mean, I can still run Windows software (for example) that is 10 years or older without that requirement.

The price for that kind of backwards compatibility is a literal army of engineers working for a global megacorporation. Free software could not manage that, so having a pragmatic way to keep software running in isolated containers seems like a great solution to me.

There’s an army of developers working on Linux as well, employed by companies like IBM and Oracle. I don’t see a huge difference to Microsoft here to be honest.

What are you even talking about? Being able to run 10 year old software (on any OS) is orthogonal to being able to build a piece software whose dependencies are completely missing. Don't pretend like this doesn't happen on Windows.

My point was that a lot of older software, especially desktop apps, did not have such wild dependencies. Therefore this was less of an issue. Today with Python and with JavaScript and its NPM hell it is of course.

> My point was that a lot of older software, especially desktop apps, did not have such wild dependencies. Therefore this was less of an issue.

Anyone who worked with Perl CGI and CPAN would tell you managing dependencies across environments has always been an issue. Regarding desktop software; the phrase "DLL hell" precedes NPM and pip by decades and is fundamentally the same dependency management challenge that docker mostly solves.

I think the disconnect is in viewing your trees and not viewing the forest. Sure you were a responsible disciplined tree engineer for your acres, but what about the rest of the forest? Can we at least agree that docker made plant husbandry easier for the masses world-wide??

Im not sure I would agree here: from my personal experience, the increasing containerisation has definitely nudged lots of large software projects to behave better; they don’t spew so many artifacts all over the filesystem anymore, for example, and increasingly adopt environment variables for configuration.

Additionally, I think lots of projects became able to adopt better tooling faster, since the barrier to use container-based tools is lower. Just think of GitHub Actions, which suddenly enabled everyone and their mother to adopt CI pipelines. That simply wasn’t possible before, and has led to more software adopting static analysis and automated testing, I think.

This might all be true, but has this actually resulted in better software for end users? More stability, faster delivery of useful features? That is my concern.

For SaaS, I'd say it definitely improved and sped up delivery of the software from development machine to CI to production environment. How this translates to actual end users, it's totally up to the developers/DevOps/etc. of each product.

For self-hosted software, be it for business or personal use, it immensely simplified how a software package can be pulled, and run in isolated environment.

Dependency hell is avoided, and you can easily create/start/stop/delete a specific software, without affecting the rest of the host machine.

> But at the end of the day, todays Docker-based web app development delivers less than fat-client desktop app development delivered 20 years ago, as sad as that is.

You mean, aside from not having to handle installation of your software on your users' machines?

Also I'm not sure this is related to docker at all.

I actually did work in software packaging (amongst other things) around 20 years ago. This was never a huge issue to be honest, neither was deployment.

I know, in theory this stuff all sounds very nice. With web apps, you can "deploy" within seconds ideally, compared to say at least a couple of minutes or maybe hours with desktop software distribution.

But all of that doesn't really matter if the endusers now actually have to wait weeks or months to get the features they want, because all that new stuff added so much complexity that the devs have to handle.

And that was my point. In terms of enduser quality, I don't think we have gained much, if anything at all.

Being able to create a portable artifact with only the userspace components in it, and that can be shipped and run anywhere with minimal fuss is something that didn't really exist before containers.

There were multiple ways to do it as long as you stayed inside one very narrow ecosystem; JARs from the JVM, Python's virtualenv, kind of PHP, I think Ruby had something? But containers gave you a single way to do it for any of those ecosystems. Docker lets you run a particular JVM with its JARs, and an exact version of the database behind that application, and the Ruby on Rails in front of it, and all these parts use the same format and commands.

25 years ago I could tell you what version of every CPAN library was in use at my company (because I installed them). What version of what libraries are the devs I support using now? I couldn't begin to tell you. This makes devs happy but I think has harmed the industry in aggregate.

Because of containers, my company now can roll out deployments using well defined CI/CD scripts, where we can control installations to force usage of pass-through caches (GCP artifact registry). So it actually has that data you're talking about, but instead of living in one person's head it's stored in a database and accessable to everyone via an API.

Tried that. The devs revolted and said the whole point of containers was to escape the tyranny of ops. Management sided with them, so it's the wild west there.

Huh. I actually can understand devs not wanting to need permission to install libraries/versions, but with a pull-through cache there's no restrictions save for security vulnerabilities.

I think it actually winds up speeding up ci/cd docker builds, too.

> As a user experience Docker is amazing, but as technology it is hot garbage.

I mean, Podman exists, as do lots of custom build tools and other useful options. Personally, I mostly just stick with vanilla Docker (and Compose/Swarm), because it's pretty coherent and everything just fits together, even if it isn't always perfect.

Either way, agreed about the concepts behind the technology making things better for a lot of folks out there, myself included (haven't had prod issues with mismatched packages or inconsistent environments in years at this point, most of my personal stuff also runs on containers).

Yeah, but the Open Container Initiative is supposed to be the responsible adults in the room taking the "fail fast" corporate Docker Inc stuff, and taking time to apply good engineering principles to it.

It's somewhat surprising that the results of that process are looking to be nearly as fly-by-the-seat-of-your-pants as Docker itself is.

Lines of code is irrelevant.

Docker is important because:

1) it made a convenient process to build a “system” image of sorts, upload it, download it, and run it.

2) (the important bit!) Enough people adopted this process for it to become basically a standard

Before Docker, it wasnt uncommon to ship some complicated apps in VMs. Packaging those was downright awful with all of the bespoke scripting needed for the various steps of distribution. And then you get a new job? Time to learn a brand new process.

I guess Docker has been around long enough now that people have forgotten just how much of an absolute pain it used to end up being. Just how often I'd have to repeat the joke Them: "Well, it works on my machine!" Me: "Great, back up your email, we're putting your laptop in production..."

The other half is the other 90%.

Looking at it now, it won't even run in the latest systemd, which now refuses to boot with cgroups v1. Good luck even accessing /dev/null under cgroups v2 with systemd.

And like the famous hacker news comment goes, Dropbox is trivial by just using FTP, curlftpfs and SVN. Docker might have many faults, but for anybody that dealt with the problems that it aimed to solve do know in that it was revolutionary in simplifying things.

And for people that disagree, please write a library like TestContainers using cobbled together bash scripts, that can download and cleanly execute and then clean up almost any common use backend dependency.

On top of that, it's either the OCI spec that's broken or it's just AWS being nuts, but unlike GitLab and Nexus, AWS ECR doesn't support automatically creating folders (e.g. ".dkr.ecr..amazonaws.com/foo/bar/baz:tag"), it can only do flat storage and either have seriously long image names or tags.

Yes you can theoretically create a repository object in ECR in Terraform to mimic that behavior, but it sucks in pipelines where the result image path is dynamic - you need to give more privileges to the IAM role of the CI pipeline than I'm comfortable with, not to mention that I don't like any AWS resources managed outside of the central Terraform repository.

[1] https://stackoverflow.com/questions/64232268/storing-images-...

IIRC it's not in the spec because administration of resources is out of scope. For example, perhaps you offer a public repository and you want folks to sign up for an account before they can push? Or you want to have an approval process before new repositories are created?

Regardless it's a huge pain that ECR doesn't support this. Everybody I know of who has used ECR has run into this.

There's a long standing issue open which I've been subscribed to for years now: https://github.com/aws/containers-roadmap/issues/853

Looks cool. Thanks for linking it.

It does mention that it's limited to 500MB per layer.

For some people's use case that limitation might not be a big deal, but for others that's a dealbreaker.

From the README:

* Pushing with docker is limited to images that have layers of maximum size 500MB. Refer to maximum request body sizes in your Workers plan.

* To circumvent that limitation, you can manually add the layer and the manifest into the R2 bucket or use a client that is able to chunk uploads in sizes less than 500MB (or the limit that you have in your Workers plan).

Hi HN, author here. If anyone knows why layer pushes need to be sequential in the OCI specification, please tell! Is it merely a historical accident, or is there some hidden rationale behind it?

Edit: to clarify, I'm talking about sequentially pushing a _single_ layer's contents. You can, of course, push multiple layers in parallel.

Source: I have implemented a OCI-compliant registry [1], though for the most part I've been following the behavior of the reference implementation [2] rather than the spec, on account of its convolutedness.

When the client finalizes a blob upload, they need to supply the digest of the full blob. This requirement evidently serves to enable the server side to validate the integrity of the supplied bytes. If the server only started checking the digest as part of the finalize HTTP request, it would have to read back all the blob contents that had already been written into storage in previous HTTP requests. For large layers, this can introduce an unreasonable delay. (Because of specific client requirements, I have verified my implementation to work with blobs as large as 150 GiB.)

Instead, my implementation runs the digest computation throughout the entire sequence of requests. As blob data is taken in chunk by chunk, it is simultaneously streamed into the digest computation and into blob storage. Between each request, the state of the digest computation is serialized in the upload URL that is passed back to the client in the Location header. This is roughly the part where it happens in my code: https://github.com/sapcc/keppel/blob/7e43d1f6e77ca72f0020645...

I believe that this is the same approach that the reference implementation uses. Because digest computation can only work sequentially, therefore the upload has to proceed sequentially.

[1] https://github.com/sapcc/keppel [2] https://github.com/distribution/distribution

It makes clean-up simpler - if you never got to the "last" one, it's obvious you didn't finish after N+Timeout and thus you can expunge it. It simplifies an implementation detail (how do you deal with partial uploads? make them easy to spot). Otherwise you basically have to trigger at the end of every chunk, see if all the other chunks are there and then do the 'completion'.

But that's an implementation detail, and I suspect isn't one that's meaningful or intentional. Your S3 approach should work fine btw, I've done it before in a prior life when I was at a company shipping huge images and $.10/gb/month _really_ added up.

You lose the 'bells and whistles' of ECR, but those are pretty limited (imho)

In the case of a docker registry, isn’t the “final bit” just uploading the final manifest that actually references the layers you’re uploading?

At this point you’d validate that the layers exist and have been uploaded, otherwise you’d just bail out?

And those missing chunks would be handled by the normal registry GC, which evicts unreferenced layers?

It's been a long time, but I think you're correct. In my environment I didn't actually care (any failed push would be retried so the layers would always eventually complete, and anything that for whatever reason didn't retry, well, it didn't happen enough that we cared at the cost of S3 to do anything clever).

I think OCI ordered manifests first to "open the flow", but then close is only when the manifests last entry was completed - which led to this ordered upload problem.

If your uploader knows where the chunks are going to live (OCI is more or less CAS, so it's predictable), it can just put them there in any order as long as it's all readable before something tries to pull it.

Never dealt with pushes, but it’s nice to see this — back when Docker was getting started I dumped an image behind nginx and pulled from that because there was no usable private registry container, so I enjoyed reading your article.

Hi, thanks for the blog post!

> For the last four months I’ve been developing a custom container image builder, collaborating with Outerbounds

I know you said this was something for another blog post but could you already provide some details? Maybe a link to a GitHub repo?

Background: I'm looking for (or might implement myself) a way to programmatically build OCI images from within $PROGRAMMING_LANGUAGE. Think Buildah, but as an API for an actual programming language instead of a command line interface. I could of course just invoke Buildah as a subprocess but that seems a bit unwieldy (and I would have to worry about interacting with & cleaning up Buildah's internal state), plus Buildah currently doesn't support Mac.

Ah too bad :)

Thanks for the link! Though I'm less worried about the tarball / OCI spec part, more about platform compatibility. I tried running runc/crun by hand at some point and let's just say I've done things before that were more fun. :)

I can't think of an obvious one, maybe load based?

~~I added parallel pushes to docker I think, unless I'm mixing up pulls & pushes, it was a while ago.~~ My stuff was around parallelising the checks not the final pushes.

Edit - does a layer say which layer it goes "on top" of? If so perhaps that's the reason, so the IDs of what's being pointed to exist.

Layers are fully independent of each other in the OCI spec (which makes them reusable). They are wired together through a separate manifest file that lists the layers of a specific image.

It's a mystery... Here are the bits of the OCI spec about multipart pushes (https://github.com/opencontainers/distribution-spec/blob/58d...). In short, you can only upload the next chunk after the previous one finishes, because you need to use information from the response's headers.

Indeed, you are free to push multiple layers in parallel. But when you have a 1 GiB layer full of AI/ML stuff you can feel the pain!

(I just updated my original comment to make clear I'm talking about single-layer pushes here)

If you've got plenty of time for the build, you can. Make a two-stage build where the first stage installs Python and pytorch, and the second stage does ten COPYs which each grab 1/10th of the files from the first stage. Now you've got ten evenly sized layers. I've done this for very large images (lots of Python/R/ML crap) and it takes significant extra time during the build but speeds up pulls because layers can be pulled in parallel.

I see your point on the pull speed. Most of my pulls are stuck at waiting for the pytorch/dependencies layer.

This might work with pip but I absolutely hate pip and using poetry with great success. I will investigate how to do this with poetry.

Surely you can have one layer per directory or something like that? Splitting along those lines works as long as everything isn't in one big file.

I think it was a mistake to make layers as a storage model visible in to the end user. This should just have been an internal implementation detail, perhaps similar to how Git handles delta compression and makes it independent of branching structure. We also should have delta pushes and pulls, using global caches (for public content), and the ability to start containers while their image is still in transfer.

It should be possible to split into multiple layers as long as each file is wholly within in its layer. This is the exact opposite of the work recommended combining commands to keep everything in one layer which I think is done ultimately for runtime performance reasons.

I've dug fairly deep into docker layering, it would be wonderful if there was a sort of `LAYER ...` barrier instead of implicitly via `RUN ...` lines.

Theoretically there's nothing stopping you from building the docker image and "re-layering it", as they're "just" bundles of tar files at the end of the day.

eg: `RUN ... ; LAYER /usr ; LAYER /var ; LAYER /etc ; LAYER [discard|remainder]`

Thanks, that helps a lot and I didn't know about it:) It's a touch less powerful than full transactions (because AFAICT you can't say merge a COPY and RUN together) but it's a big improvement.

That's a pretty cool use case!

Personally, I just use Nexus because it works well enough (and supports everything from OCI images to apt packages and stuff like a custom Maven, NuGet, npm repo etc.), however the configuration and resource usage both are a bit annoying, especially when it comes to cleanup policies: https://www.sonatype.com/products/sonatype-nexus-repository

That said:

> More specifically, I logged the requests issued by docker pull and saw that they are “just” a bunch of HEAD and GET requests.

this is immensely nice and I wish more tech out there made common sense decisions like this, just using what has worked for a long time and not overcomplicating.

I am a bit surprised that there aren't more simple container repositories out there (especially with auth and cleanup support), since Nexus and Harbor are both a bit complex in practice.

I hadn't seen that before, and it indeed does support S3, but does it also offer the clients the downloads directly from S3, or does it merely use it as its own storage backend (so basically work as a proxy when pulling)?

But it still means a downtime of your service is a downtime for anything which might need a docker container, whereas if it went to S3 (or cloudfront in front of S3) directly, you profit from the many nines that S3 offers without paying an arm and a leg for ECR (data in ECR costs five times as much as S3 standard tier).

S3's standard tier costs a fifth of ECR in terms of costs per GB stored. Egress costs to the free internet are the same, with the exception that for public ECR repositories they make egress to inner-AWS usage free.

I don't do a ton with Docker outside dev tooling, but I have never understood why private container registries even exist? It just smells like rent seeking. What real advantage does it provide over say just generating some sort of image file you manage yourself, as you please?

You don't have to use it. You can use docker save and docker import:

    docker save alpine:3.19 > alpine.tar
    docker load < alpine.tar

But now I have to manage that tar file, have all my systems be aware of where it is, how to access it, etc. Or, I could just not re-invent the wheel and use what docker already has provided.

You will probably have images that you will not share to the world. Said images will probably be made available to your infrastructure (k8s clusters, CI/CD runners etc). So you have to either build your own registry or pay someone to do it for you.

Of course, if you use images for dev only, all of that are worthless and you just store your images on your dev machine

Also if your infrastructure is within AWS, you want your images to also be within AWS when the infrastructure wants them. That doesn't necessarily imply a private registry, but it's a lot less work that way.

Why have a code repository instead of just emailing files around?

Because you want a central store someplace with all the previous versions that is easily accessible to lots of consumers.

I don't want to build my app and then have to push it to every single place that might run it. Instead, I'll build it and push it to a central repo and have everything reference that repo.

> It just smells like rent seeking.

You don't need to pay someone to host a private repo for you. There are lots of tools out there so you can self-host.

Private (cloud) registries are very useful when there are mandatory AuthN/AuthZ things in the project related to the docker images. You can terraform/bicep/pulumi everything per environment.

Companies send young engineers (and older engineers who should know more but don't) to AWS and Microsoft for "cloud certification". They learn how to operate cloud services because thats what benefits AWS and MS, so thats what their solutions use.

It's a difficult uphill battle to get people interested in how things work under the hood, which is what you need in order to know you can do things like easily host your own package repositories.

This is a odd assessment. I agree certifications aren't all that, but having people learn them isn't about that. It's more that people don't feel like reinventing the wheel at every company, so they can focus on the real work, like shipping the application they've written. So companies like AWS, Docker etc, write things, abstract things away, so someone else doesn't have to redo the whole thing.

Yes I can host my packages and write tooling around it to make it easy. But JFrog already has all the tooling around it, and it integrates with current tooling. Why would I write the whole thing again?

I am responding to this part of the parent comment:

> I don't do a ton with Docker outside dev tooling, but I have never understood why private container registries even exist?

You know the options and have made a conscious choice:

> Yes I can host my packages and write tooling around it to make it easy. But JFrog already has all the tooling around it, and it integrates with current tooling. Why would I write the whole thing again?

So presumably you are not the kind of people I was talking about.

EDIT: I'm also assuming by the rent seeking part that the parent is referring to paid hosted services like ECR etc.

It seems that ECR is actually designed in a way to support uploading image layers in multiple parts.

Related ECR APIs:

- InitiateLayerUpload API: called at the beginning of upload of each image layer

- UploadLayerPart API: called for each layer chunk (up to 20 MB)

- PutImage API: called after layers are uploaded, to push image manifest, containing references to all image layers

The only weird thing seems to be that you have to upload layer chunks in base64 encoding, which increases data for ~33%.

Interesting idea to use the file path layout as a way to control the endpoints.

I do wonder though how you would deal with the Docker-Content-Digest header. While not required it is suggested that responses should include it as many clients expect it and will reject layers without the header.

Another thing to consider is that you will miss out on some feature from the OCI 1.1 spec like the referrers API as that would be a bit tricky to implement.

> ECR 24 MiB/s (8.2 s)

> S3 115 MiB/s (1.7 s)

It's great that it's faster but absolutely, it's only an improvement of 6.5s observed, as you said, on the CI server. And it means using something for a purpose that it's not intended for. I'd hate to have to spend time debugging this if it breaks for whatever reason.

To be clear, the 8x was comparing the slowest ECR throughput measurement against the fastest S3 one. In any case, the improvement is significant.

What I would really love is for the OCI Distribution spec to support just static files, so we can use dumb http servers directly, or even file:// (for pull). All the metadata could be/is already in the manifests, having Content-Type: octet-stream could work just fine.

Other than backwards-compatibility, I can imagine simplicity being a reason. For instance, sequential pushing makes it easier to calculate the sha256 hash of the layer as it's being uploaded, without having to do it after-the-fact when the uploaded chunks are assembled.

The fact that layers are hashed with SHA256 is IMO a mistake. Layers are large, and using SHA256 means that you can’t incrementally verify the layer as you download it, which means that extreme care would be needed to start unpacking a layer while downloading it. And SHA256 is fast but not that fast, whereas if you really feel like downloading in parallel, a hash tree can be verified in parallel.

A hash tree would have been nicer, and parallel uploads would have been an extra bonus.

sha256 has been around a long time and is highly compatible.

blake3 support has been proposed both in the OCI spec and in the runtimes, which at least for runtimes I expect to happen soon.

I tend to think gzip is the bigger problem, though.

> sha256 has been around a long time and is highly compatible.

Sure, and one can construct a perfectly nice tree hash from SHA256. (AWS Glacier did this, but their construction should not be emulated.)

Every single client already had to implement enough of the OCI distribution spec to be able to parse and download OCI images. Implementing a more appropriate hash, which could be done using SHA-256 as a primitive, would have been a rather small complication. A better compression algorithm (zstd?) is far more complex.

Reading JSON that contains a sort of hash tree already. It’s a simple format that contains a mess of hashes that need verifying over certain files.

Adding a rule that you hash the files in question in, say, 1 MiB chunks and hash the resulting hashes (and maybe that’s it, or maybe you add another level) is maybe 10 lines of code in any high level language.

Note that secure tree hashing requires a distinguisher between the leaves and the parents (to avoid collisions) and ideally another distinguisher between the root and everything else (to avoid extensions). Surprisingly few bespoke tree hashes in the wild get this right.

This is why I said that Glacier’s hash should not be emulated.

FWIW, using a (root hash, data length) pair hides many sins, although I haven’t formally proven this. And I don’t think that extension attacks are very relevant to the OCI use case.

It's complicated. If you are using the containerd backed image store (opt-in still) OR if you push with "build --push" then yes.

The default storage backend does not keep compressed layers, so those need to be recreated and digested on push.

With the new store all that stuff is kept and reused.

That's true, but I'd assume the server would like to double-check that the hashes are valid (for robustness / consistency)... That's something my little experiment doesn't do, obviously.

> Why can’t ECR support this kind of parallel uploads? The “problem” is that it implements the OCI Distribution Spec…

I don't see any reason why ECR couldn't support parallel uploads as an optimization. Provide an alternative to `docker push` for those who care about speed that doesn't conform to the spec.

It's cool to see it, I was interested in trying something similar a couple years ago but priorities changed.

My interest was mainly around a hardening stand point. The base idea was the release system through IAM permissions would be the only system with any write access to the underlying S3 bucket. All the public / internet facing components could then be limited to read only access as part of the hardening.

This would of course be in addition to signing the images, but I don't think many of the customers at the time knew anything about or configured any of the signature verification mechanisms.

That's true, unfortunately. I'm thinking about ways to somehow support private repos without introducing a proxy in between... Not sure if it will be possible.

Yep, I originally thought that wouldn't work... But now I discovered (thanks to a comment here) that the registry is allowed to return an HTTP redirect instead of serving the layer blobs directly... Which opens new possibilities :)

This is such a wonderful idea, congrats.

There is a real usecase for this in some high security sectors. I can't put complete info here for the security reasons, let me know if you are interested.

I wonder whether the folks at Cloudflare could take the ideas from the blog post and create a high-performance serverless container registry based on R2. They could call it scrubs, for "serverless container registry using blob storage" :P

I didn't expect that! It's a pity they don't expose an API for parallel uploads, for those of us who need to maximize throughput and don't mind using something non-standard.

Make sure you use HTTPS, or someone could theoretically inject malicious code into your container. If you want to use your own domain you'll have to use CloudFront to wrap S3 though.

Aside from the casino story (high value target that likely faces tons of attacks, therefore an expensive customer for CF), did something happen with them? I'm not aware of bad press around them in general

The source code is proprietary, but it shouldn't take much work to replicate, fortunately (you just need to upload files at the right paths).

I've started to grow annoyed with container registry cloud products. Always surprisingly cumbersome to auto-delete old tags, deal with ACL or limit the networking.

It would be nice if a Kubernetes distro took a page out of the "serverless" playbook and just embedded a registry. Or maybe I should just use GHCR

I'm using google's artifact registry -- aside from upload speed another thing that kills me is freakin download speed ... Why in the world should it take 2 minutes to download a 2.6 GB layer to a cloud build instance sitting in the same region as the artifact registry ... Stupidly slow networking really harms the stateless ci machine + docker registry cache which actually would be quite cool if it was fast enough ...

In my case it's still faster than doing the builds would be -- but I'm definitely gonna have to get machines with persistent local cache in the mix at some point so that these operations will finish within a few seconds instead of a few minutes ...

We did this in the Gravity Kubernetes Distribution (which development is shut down), but we had to for the use case. Since the distribution was used to take kubernetes applications behind the firewall with no internet access we needed the registry... and it was dead simple just running the docker-distribution registry on some of the nodes.

In theory it wouldn't be hard to just take docker-distribution and run it as a pod in the cluster with an attached volume if you wanted a registry in the cluster. So it's probably somewhere between trivial and takes a bit of effort if you're really motivated to have something in cluster.

Kubernetes is extremely bare-bones, there's no way they'll embed a registry. Kubernetes doesn't touch images at all, AFAIK, it delegates that to the container runtime, e.g. containerd.

If you want some lightweight registry, use "official" docker registry. I'm running it inside Kubernetes and it consumes it just fine.

（评论） (comments)

（评论）
(comments)