(评论)

(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43353223

Hacker News 的一篇帖子讨论了 Git 的新特性 bundle-URI 以及加速克隆大型仓库（多 GB，数百万次提交）的策略。一些用户分享了他们使用大型仓库的经验和面临的挑战，包括克隆速度慢以及 GitHub 等服务造成的限流。提到的解决方案包括： * 将“种子”仓库存储在 S3 中以加快初始克隆速度。 * 使用 EBS 快照（但令人惊讶的是，并不总是更快）。 * Linux 内核在 CDN 上使用 bundle 文件。 * 浅克隆（`git clone --depth 1`），尽管关于其对 GitHub 等服务的性能影响存在争议。 * 快照的 tar 包。讨论还涉及到当只需要单个快照时克隆整个项目历史记录的低效率。其他话题包括替代版本控制系统 Mercurial 以及在 Git 中存储大型二进制文件的解决方案。

相关文章

深入探究 Git 的新 bundle-URI 2025-03-17

Git 提示和技巧 2024-02-14

（评论） 2023-11-24

（评论） 2025-03-07

使用 S3 作为容器注册表 2024-07-13

原文

Hacker News new | past | comments | ask | show | jobs | submit

login

		Going down the rabbit hole of Git's new bundle-URI (gitbutler.com)
		120 points by chmaynard 5 hours ago \| hide \| past \| favorite \| 35 comments

jakub_g 2 hours ago | [–]

This is super interesting, as I maintain a 1M commits / 10GB size repo at work, and I'm researching ways to have it cloned by the users faster. Basically for now I do a very similar thing manually, storing a "seed" repo in S3 and having a custom script to fetch from S3 instead of doing `git clone`. (It's faster than cloning from GitHub, as apart from not having to enumerate millions of objects, S3 doesn't throttle the download, while GH seem to throttle at 16MiB/s.)

Semi-related: I always wondered but never got time to dig into what exactly are the contents of the exchange between server and client; I sometimes notice that when creating a new branch off main (still talking the 1M commits repo), with just one new tiny commit, the amount of data the client sends is way bigger than I expected (tens of MBs). I always assumed the client somehow established with the server that it has a certain sha, and only uploads missing commit, but it seems it's not exactly the case when creating a new branch.

maccard 2 hours ago | | [–]

Funny you say this. At my last job I managed a 1.5TB perforce depot with hundreds of thousands of files and had the problem of “how can we speed up CI”. We were on AWS, so I synced the repo, created an ebs snapshot and used that to make a volume, with the intention of reusing it (as we could shove build intermediates in there too.

It was faster to just sync the workspace over the internet than it was to create the volume from the snapshot, and a clean build was quicker from the just sync’ed workspace than the snapshotted one, presumably to do with however EBS volumes work internally.

We just moved our build machines to the same VPC as the server and our download speeds were no longer an issue.

dijit 2 hours ago | | | [–]

I used to use fuse and overlayfs for this, I’m not sure it still works well as I’m not a build engineer and I did it for myself.

Its a lot faster in my case (little over 3TiB for latest revision only).

jclarkcom 2 hours ago | | | | [–]

VMware?

captn3m0 2 hours ago | | | [–]

The linux kernel does the same thing, and publishes bundle files over CDN[0] for CI systems using a script called linux-bundle-clone[1]

[0]: https://www.kernel.org/best-way-to-do-linux-clones-for-your-...

[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/mricon/k...

djfivyvusn 1 hour ago | | | [–]

Have you tried downloading the .zip archive of the repo? Or does that run into similar throttling?

autarch 3 hours ago | | [–]

> This has resulted in a contender for the world's smallest open source patch:

Hah, got you beat: https://github.com/eki3z/mise.el/pull/12/files

It's one ASCII character, so a one-byte patch. I don't think you can get smaller than that.

yangman 2 hours ago | | [–]

There is a cursor rendering fix in xf86-video-radeonhd (or perhaps -radeon) that flips a single bit.

It took the group several years to narrow in on.

ZeWaka 2 hours ago | | | [–]

That's a line modification, so presumably you'd count just an insertion or just a deletion as 'smaller'.

autarch 2 hours ago | | | [–]

Yes, but so is the PR shown in the article. You're not going to get a diff that's less than one line unless you are using something besides the typical diff and patch tools.

san1t1 2 hours ago | | | [–]

My smallest PR was adding a missing executable file permission.

autarch 2 hours ago | | | [–]

I think that wins, since presumably it was smaller than a one-byte change. I guess the smallest would be a single-bit file mode change, maybe?

timdorr 2 hours ago | | | [–]

Sure you can: https://github.com/timdorr/-/commit/9e5a571abd3fc4f8714e8c40...

falcor84 2 hours ago | | | [–]

What's the story behind that? Did you just deploy a blank commit to trigger a hook?

nine_k 2 hours ago | | | | [–]

Only accepted and merged commits count!

ks2048 1 hour ago | | [–]

How much bandwidth and time is wasted cloning the entire history of large projects when people only need single snapshot in a single branch?

According to SO, newer versions of git can do,

  git init
  git remote add origin 
  git fetch --depth 1 origin 
  git checkout FETCH_HEAD

jes5199 1 hour ago | | [–]

I have a vague recollection that GitHub is optimized for whole repo cloning and they were asking projects not to do shallow fetching automatically, for performance reasons

nyanpasu64 38 minutes ago | | | [–]

As I understand this issue affected Homebrew and CocoaPods: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

> Apparently, most of the initial clones are shallow, meaning that not the whole history is fetched, but just the top commit. But then subsequent fetches don't use the --depth=1 option. Ironically, this practice can be much more expensive than full fetches/clones, especially over the long term. It is usually preferable to pay the price of a full clone once, then incrementally fetch into the repository, because then Git is better able to negotiate the minimum set of changes that have to be transferred to bring the clone up to date.

sureIy 38 minutes ago | | | | [–]

I don't know if that applies anymore or if it doesn't apply on GitHub Actions, but shallow clones is the default there. See `actions/checkout`

acheong08 1 hour ago | | | [–]

git clone --depth 1 works as well. If you're just cloning to build and not contributing it makes much more sense

mikepurvis 1 hour ago | | | [–]

Github can also just serve you a tarball of a snapshot, which is faster and smaller than a shallow clone (and therefore it's the preferred option for a lot of source package managers, like nix, homebrew, etc).

It’s frustrating that tarball urls are a proprietary thing and not something that was ever standardized in the git protocol.

bobbylarrybobby 59 minutes ago | | | [–]

I believe there is a bit of a footgun here because if you don't git clone then you don't fetch all branches, just the default. Can be very confusing and annoying if you know a branch exists on remote but don't have it locally (the first time you hit it, at least).

geenat 1 hour ago | | [–]

git needs built in handling of large binary files without a ton of hassle, it's all I ask. It'd make git universally applicable to all projects.

mercurial had it for ages.

svn had it for ages.

perforce had it for ages.

just keep the latest binary, or last x versions. Let us purge the rest easily.

GrantMoyer 15 minutes ago | | [–]

Getting better: https://git-scm.com/docs/partial-clone

mbac32768 1 hour ago | | [–]

One consequence of git clone is that if you have mega repos, it kind of ejects everything else from your cache for no win.

You'd actually rather special case full clones and instruct the storage layer to avoid adding to the cache for the clone. But this isn't always possible to do.

Git bundles seem like a good way to improve the performance of other requests, since they punt off to a CDN and protect the cache.

jedimastert 3 hours ago | | [–]

This actually might solve a massive CI problem we've been having...will report back tomorrow

jwpapi 3 hours ago | | [–]

!remind me

andrewshadura 2 hours ago | [–]

Interestingly, Mercurial had solved the bundles more than ten years ago and back then they already worked better than Git's today

capitainenemo 2 hours ago | | [–]

Not the only mercurial feature where that's the case.. sad, I keep rooting for the project to implement mercurial frontend over a git db, but they seem to be limited by missing git features.

kps 1 hour ago | | | [–]

Jujutsu (jj) is heavily inspired by Mercurial (though with some significant differences) and can operate with git as a storage backend. https://github.com/jj-vcs/jj

nine_k 2 hours ago | | | [–]

But branches were more problematic.

capitainenemo 2 hours ago | | | [–]

Mercurial has had git-like "lightweight branches"/bookmarks without the revision record of mercurial named branches for over 15 years. There are good reasons to use the traditional branches though.

https://mercurial.aragost.com/kick-start/en/bookmarks/

DrinkyBird 1 hour ago | | | [–]

The topics[0] feature in the evolution extension is probably even closer to Git branches, since they are completely mutable and needn't be a permanent part of your repo. Bookmarks are just pointers to changesets, and although that's technically how Git branches work, it's not how they work in practice in Mercurial because of its focus on immutability (and because hg and git work differently).

[0]: https://www.mercurial-scm.org/doc/evolution/tutorials/topic-...

dgfitz 1 hour ago | | [–]

Someone once put together an llm backed list of things people on hn post about a lot, mine was about this “other” dvcs system.

It is superior, and it’s not even much of a comparison.

Already__Taken 1 hour ago | | [–]

I used mercurial in anger for about 9 months or something, with a gitlab fork too. when git goes wrong there's forums, blogs, books and manuals. When hg does it's a python stack trace, good luck.

Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

联系我们 contact @ memedata.com