(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=38171322

根据提供的文字材料,我的观察如下: 1. 讨论的重点是 Bluesky 的 PDS 基础设施是否能够针对其快速增长的用户群进行有效扩展,以及是否会因其决定采用现有开源项目而对其业务战略产生任何显着影响。 2. 大多数讨论都围绕着分享和分发有限的加入 Bluesky 的邀请。 一些用户通过联系人推荐或直接回复包含邀请链接的帖子收到了邀请。 然而,很大一部分报告称可用的邀请不再有效。 3. 对比了Bluesky和Mastodon的用户增长率,后者显得较慢。 人们还对蓝天与乳齿象相比的潜在影响表示了兴趣。 4. 值得注意的是,科技社区往往关注较小且专用的社交网络,而不是主流社交网络。 5. 评论表明,与 Mastodon 相比,Bluesky 缺乏用户交互计数功能可能会影响参与度。 此外,一些人对 Bluesky 缺乏“暴民文化”、“成瘾”和广告收入模式的长期性表示怀疑。 总体而言,讨论集中在可扩展性、受欢迎程度以及由于在新平台上采用现有开源项目和业务策略而导致的潜在限制等问题。 此外,还提到了利基社区内的共同影响以及对潜在后果的担忧。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Bluesky migrates to single-tenant SQLite (github.com/bluesky-social)
340 points by HillRat 1 day ago | hide | past | favorite | 231 comments










Love SQLite - in general there are many challenges with a schema or database per tenant setup of any kind though. Consider the luxury of row-level security in a shared instance where your migration either works or rolls back. Not now! If you are doing a data migration and failed to account for some unexpected data, now you have people on different schema versions until you figure it out. Now, yes, if you are at sharding scale this may occur anyway, but consider that before you hit that point, a single database is easiest.

You will possibly want to combine the data for some reason in the future as well. Or, move ownership of resources atomically.

I'm not opposed to this setup at all and it does have its place. But we are running away from schema-per-tenant setup at warp speed at work. There are so many issues if you don't invest in it properly and I don't think many are prepared when they initially have the idea.

The funny thing is that about a decade ago, the app was born on a SQLite per tenant setup, then it moved to schema per tenant on Postgres, now it's finally moving to a single schema with RLS. So, the exact opposite progression.



I dont know - I have experience working with monster DBs in production and never again. Under large enough load every change becomes risky because you can’t test performance corner cases fully. Having a free-tier user take out your prod because they found a non-indexed code path is also classic


> If you are doing a data migration and failed to account for some unexpected data, now you have people on different schema versions until you figure it out.

That shouldn't be a big issue. Any service large/complex enough to care does the schema upgrades in phases, so it's 1. Make code future compatible. 2. Migrate data. 3. Remove old schema support.

So typically it should be safe to run between steps 1 and 2 for a long time. (Modulo new bugs of course) As an ops-y person I'm comfortable with the system running mid-migration as long as the steps as described are used.



> That shouldn't be a big issue. Any service large/complex enough to care does the schema upgrades in phases, so it's 1. Make code future compatible. 2. Migrate data. 3. Remove old schema support.

Exactly this, schema migrations should be an append, deprecate, drop operation over time.



I wish there were ways to enforce this on the db so you never accidentally grabbed a table lock during these operations.

definitely have shot myself in the foot with postgres on this



> I wish there were ways to enforce this on the db so you never accidentally grabbed a table lock during these operations.

You can use a linter for PostgreSQL migrations https://squawkhq.com/



And squitch is a wonderful Perl tool for this as well


> now you have people on different schema versions until you figure it out.

That can be a good thing if your product has say

I guess it depends on the business structure.



Totally correct. But not a good thing in our case!


> The funny thing is that about a decade ago, the app was born on a SQLite per tenant setup, then it moved to schema per tenant on Postgres, now it's finally moving to a single schema with RLS.

To be fair, RLS was not available yet a decade ago :) It appeared in PostgreSQL 9.5 in 2016.





> If you are doing a data migration and failed to account for some unexpected data, now you have people on different schema versions until you figure it out. Now, yes, if you are at sharding scale this may occur anyway, but consider that before you hit that point, a single database is easiest.

This can be accounted for and handled. Though if schema issues are enough of a scare I wonder if a documentdb style embedable database like a couch/pouchdb might make more sense.



What do they mean by "Since SQLite does not support concurrent transactions" - it supports them, as long as you don't access the .db file through a file share (UNC, or NFS, etc) - https://www.sqlite.org/wal.html

I've been using this to update/read db from multiple threads/processes on the same machine. You can also do snapshotting with the sqlite backup API, if you want consistent view, and to not hold on transaction (or copy in-memory).

But maybe I'm missing something here... Also haven't touched sqlite in years, so not sure...



Nope! I was mistaken - it's really multiple readers, single writer - Probably I was assuming things the whole time, and did not spent enough time thoroughly checking - granted most of the db's I've done with sqlite were more about reading than writing.

So stand corrected!



Have patience and eventually hctree [1] will become stable and will be offered to us to choose between its traditional backend mechanism and the newly implemented to support concurrency!

[1] https://sqlite.org/hctree/doc/hctree/doc/hctree/index.html



> Writers merely append new content to the end of the WAL file. Because writers do nothing that would interfere with the actions of readers, writers and readers can run at the same time. However, since there is only one WAL file, there can only be one writer at a time.

I think the OP meant that updates have to run sequentially.



Which just means the lock happens at user scope in this case instead of per table or row. This limitation still causes so much confusion when it’s a completely reasonable design.


I've been importing data to sqlite databases running being actively written to for years. Just throws exception if the database is locked and I retry. Do 10k row batches, with a small sleep between. No issues. Helps if your use case doesn't really care about data being in order I guess.


it works if there is low traffic, as soon as you got bigger transactions or the amount of concurrent writes becomes heavier, you will at some point (even with WAL enabled) get "database locked" issues. You can work around that on application level to a certain point, but in general, if you are at that point, you should really consider using another database backend.


Probably means there’s still no row-level locking at least last I checked, very limited table-level locking. The writer still grabs a lock on entire db per docs


Did you disable auto checkpointing? Wouldn’t checkpointing result in potential corruption or at least data loss if two processes do that simultaneously? Or is that scenario exhaustively prevented with a lock file?


Your post made me look more into checkpoints, and understand better the tradeoffs in sqlite+wal about it - e.g. less frequent checkpoint means larger log (wal files), hence slower reads - as reads have to go through the wal file (there is some index, but still) to vet the data. But it makes writing faster.

And the opposite more frequent checkpoints, means faster reads (no need to through bigger wal file, and smaller index), but writes are slower.

So it really depends on what's happening right now, if you can anticipate it - e.g. populating for the first time data into it (maybe decrease checkpoint updates, then turn it back on).

Or if you constantly log, and read only that much (though not sure if you have constraints, triggers whether there are no hidden reads).

Or the opposite - a "read-only" if possible version of sqlite.db would be ideally without any wal.

So your post helped understand that there is stuff that I don't know and need to look further into it.

Thanks!



This is the tradeoff of SQLite, it is extremely fast, as long as you only mostly have one user. With WAL you can get multiple readers, but it doesn't scale the same way that e.g. PostgreSQL does


Always happy to see more server SQLite/Litestream adoption which we've also been using to build our new Apps with.

SQLite + Litestream is an even greater choice for tenant databases, that's vastly cheaper to replicate/backup to S3/R2 than expensive cloud managed databases [1] (up to 3900% cheaper vs SQLServer on Azure).

[1] https://docs.servicestack.net/ormlite/litestream



What does 3900% cheaper mean? I don't get it.


yeah.. a weird way to say 39 times cheaper ;)


100% of, say 42 is 42. So 100% less than 42 is 0.

3900% cheaper makes no sense.



Interesting... I like the strategy of having each user be 1:1 with a DB. What would be done for data that needs to be aggregated across users though? If I'm subscribed to another user and they post, how does my DB get updated with that new post? Or is this meant just for durable data and not feed data (like profile data, which users are followed / not followed / etc.) and all the interactive stuff happens separately?

I like that "connection pooling" is just limiting the number of open handles in a LRU cache. It's also interesting because instead of having to manage concurrency at the connection level, it handles it at the tenancy level since each DB connection is single-threaded. You could build up per-DB rate limiting on top of this pretty easily to prevent abuse by a given user.

Is there a straightforward way to set up Litestream to handle any arbitrary number of DBs?





To summarise the relevant details, the "AppView" service is responsible for the sorts of queries that aggregate across users, and that has its own database setup - I think postgres but I'm not 100% sure on that.


You're right, as usual. AppView is on a Postgres cluster with read replicas doing timeline generation (and other things) on-demand. We're in the process of moving it toward a beefy ScyllaDB cluster designed around a fanout-on-write system.

The v1 backend system was optimized for rapid development and served us well. The v2 backend will be somewhat less flexible (no joins!) but is designed for much higher scale.



Does the BGS pull all the tenant‘s individual SQLite data? Or do the PDS push new posts to the BGS?


The BGS (which is an atproto "relay" service) subscribes to all PDS event streams on the entire network, and aggregates and relays them.

This way it's possible to get all network data from a single place (the BGS) rather than having to connect to every PDS, which is simpler for consumers and dramatically reduces the workload of PDS hosts.

Some details about event streams here, although the APIs are still evolving: https://atproto.com/specs/event-stream



Thank you!


> The BGS handles "big-world" networking. It crawls the network, gathering as much data as it can, and outputs it in one big stream for other services to use. It’s analogous to a firehose provider or a super-powered relay node.

"Big-world" networking by Big Tech-to-be Bluesky with super-powers, I wonder? Is this BGS also going to be federated, or is that the big centralized beating heart of this platform managed exclusively by BS?



There can be multiple BGSes (like there are a few Web search engines) but it's expensive to run so there probably won't be many. Alternative designs are either more expensive or don't have the same features.


At a previous fintech role the company would store customer accounts as encrypted sqlite3 files on blob storage ... this worked out decently well for our access patterns.


How did they lock the file when re-uploading it after edits?


Each was encrypted before and after. The keys were stored securely and retrieved temporarily and then rotated I think. It’s a process I wasn’t 100% privy too as I was too junior.


On the surface, this looks like the worst combined with the awful. I hope someone will make a good article with some hard numbers to explain the benefits and analyze the assumed flaws, because this could be something really, fascinating to learn about.


Can you explain why this looks like the "worst combined with the awful" to you?

To me, on the surface, particularly assuming you are building a distributed system to be run and deployed by many users, some of which are not professional sysadmins (which I believe is likely to be a goal here, and should be), this seems like quite a sane choice. I'd definitely expect a design goal to be avoiding the need to setup/configure/look after any additional database or other servers.



This looks like someone is building their own filebased database-system, in typescript, while still using mature features of database-servers. So instead of trusting the optimized, regularly maintained and battletested solution, they build something by themselves. This smells ugly, like something that will scale poor in performance, and will have security and tooling-problems.

Simplification of installation seems not like a good enough reason to trust your whole backend on this. Installing and maintaining a database-server is not that hard today. This is well established and documented, unlike this. But I also don't know enough about this app, maybe this is just one of several options, meant for a specific usecase? Using this in a standalone desktop-app would make sense, while still offering a mature sql-backend available for server-installations.



Using SQLite is most certainly not "building their own filebased database-system"

SQLite is just about as mature and well-tested as it gets in the entire world of software: https://www.sqlite.org/testing.html

Each users' data is naturally partitioned at the atproto repository level, so this is the sweet spot for per-user SQLite databases. It would make total sense for a PDS instance to have just a single user on it, and in fact that is likely for many self-hosters. It's also worth noting that the PDS software already had SQLite support, which made this change somewhat easier.

There are legitimte trade-offs to this kind of a system but it comes out way ahead in this case, and it's not as wild as it may seem to those not familiar with the power of SQLite.

A major consideration is that we're planning to run at least 100+ instances, which would require operating 100+ high availability (primary+replica) Postgres clusters. This would be a huge amount of operational and financial overhead.

We chose this route because it is better for us as a small team, with relatively limited resources. But it also has the property of being much easier for self-hosters, of which we hope there will be many.



> Using SQLite is most certainly not "building their own filebased database-system"

[..] Each user has their own SQLite file [..]

[..] We also introduce 3 separate SQLite databases for managing service state [..]

This doesn't use SQLite for the database-managment, but for the individual "document". The database-managment itself is handled in the application-server. You jiggle around files and poke wherever it matches, this is by a classical filebased database-managment-system.

> It would make total sense for a PDS instance to have a single user, and in fact that is likely for many self-hosters.

Sure, if it's just a low-user-instance, the performance is not much of a deal. But from my impression here, this is also the code Bluesky uses for everything else, from low to massively high user-instances. And then I want to see how RAM holds up, when you have 10k+ user-databases open at the same time on one instance.

> There are trade-offs to this kind of a system but it comes out way ahead in this case.

Which is why I want to see some actual numbers and solid explanations going more into details then the gossip in the comments here.

> A major consideration is that we're planning to run at least 100+ instances, which would require operating 100+ high availability (primary+replica) Postgres clusters.

Are those independent instances, or just 100+ instances servers from the same company on different locations? But I don't see how this can replace a whole postgres-cluster without removing significant functionality. I mean sqlite does not have good replication on it's own AFAIK, so as you seem to still use replication, you just replace it with another solution? Which also means you remove the same options for anyone else, and forces them to use your solutions?. I don't see how this will be beneficial for self-hosters.



I'm not a professional coder, only side-projects. Never formally taught. I looked at the solution and thought it kinda sounds like something I'd come up with. Like when I didn't know how to use data tables and would hold data in an array of arrays to form the rows and columns. Somewhat clever, "works", but would probably make my professional coder friends vomit if I explained it to them.


If anyone needs a Bluesky invite, there are three in the About section of my profile page here (note: you will need to prepend "bsky-social-" to the code).


Can someone that knows more about bluesky explain what data is stored in sqlite and not? Because i assume it isnt messages etc between users.


I assume that messages between users are stored in those SQLite DBs.

Think email. When you send an email and CC five other people as well then seven people now have the same copy of the email stored on their email servers. That is, there’s no central database that contains a single email that is referenced by others.

This is basically how sharding with relational DBs works as well.

This sort of data denormalization is almost a requirement as applications scale and especially for many-to-many applications that have a high write to read ratio.

Low write to read and you can get away with a single master to many slave relational DB architecture for quite astonishing numbers of requests and data!



It's all your posts and replies as a user. While they currently host the only* PDS themselves, the end goal is for every end user to have their own PDS. Inrupt/SOLID calls this concept a "pod".

*(actually they just onboarded a second production PDS yesterday.. progress!)



By messages, do you mean direct messages (private messages between two parties)? Because Bluesky doesn't have those at the moment. There's only public messages broadcast to the world.

Haven't done any research to determine if there are plans for direct messages.



Why sha256 hash the user into to get a two character target directory? Wouldn't md5 be much faster and solve the same problem?


At a guess: that hash is performed relatively few times, so any performance difference is lost in the noise floor. Never having to answer "why did you use this insecure hash" or eliminating/minimising any possibility of a class of security problem is worth more.


This has nothing to do with security. It's just wasted CPU. I imagine you have to do this every time you make a query to lookup the users DB?

Security is not a concern here. It's just literally bucketing ids. Also, this is not needed with modern file systems.



all modern server CPUs have intrinsics for sha256, it just doesnt matter CPU-wise


It looks like for something the size of a UUID on Node 18, on my 8th Gen i7 (main machine broken) MD5 is only 10% faster. I guess I was remembering a time when it was like twice as fast... Neat. :)


At their scale maybe they're worried about collisions?

Or, like me, they're drowning in security tooling from corporate and don't want to have to carve out exceptions for md5 usage in each.



> At their scale maybe they're worried about collisions?

With their scheme, collisions are already guaranteed to happen if they have >256 users.



I guess parent meant abusable non-uniform distribution of collisions (they have collisions anyway as the take only the first two characters according to GP comment)


It could be they didn't want to explain the md5 usage, yeah. But that's kinda nuts if they do this every query.


It's probably not healthy to have broken cryptographic hashes running around. If you don't need a secure hash there are plenty of fast non-cryptographic hashes.


There's nothing about security here. By this logic you should probably stop using hashmaps, then? :)


That's literally not their logic.

They said:

if you need security don't use md5.

If you don't need security, use something faster than md5.

md5 is neither secure nor fast, why use it at all?



That's dumb. Security is not a component here. They literally just want to put the files into buckets because filesystem. Also MD5 is much faster in my experience and on any benchmark I can find.


This is probably not about collisions but about filesystem limitations (max number of files in a directory).


I've done something similar and that's absolutely what it was. I'm no pro, knew I wasn't doing it the right way, but it was for a personal side project and Windows starts to get weird when you have a million files in a single directory.


having a good hash uniformly distribute content helps scaling (by sharding of data)


This should make leaving the service rather simple. Download your sqlite file and throw up a simple local-only html front end to the data and you're solid.


Is Bluesky still invite only?


It is, but not as a "growth hack" or anything. It's just a way of limiting growth while the system is scaled (in terms of the backend and abuse prevention).

There's a dedicated waitlist for developers that will get you access quite quickly: https://atproto.com/blog/call-for-developers



It's pretty hard not to see it as a growth hack given that posts can't even be viewed without an account. That seems pretty transparently to be a system to create a feeling of FOMO/exclusivity, to make it so that you don't only need an account to participate, you need an account to even see what the network is or to follow anyone on it at all.

As a comparison, Cohost limited account setup when it launched as a way to limit growth. But it didn't lock viewing the entire site behind an account requirement because... come on. What does that have to do with scaling, we all know why that restriction is there :)

To be fair, it seems to be working. Needing to seek out and find invite codes means that signups are more visible -- signup codes get shared over social media and that means mentioning Bluesky publicly and keeping it in people's minds. It also forces people to ask publicly about access, which makes the network feel more exclusive and turns every signup or expression of interest into an advertisement for the network. It's a good marketing strategy, and I suspect that a nontrivial portion of Bluesky's current buzz comes from that marketing strategy, so I can understand why it hasn't been abandoned yet. I mean, look at the current thread; if people didn't need to coordinate publicly on HN to get access then this subthread wouldn't exist and then there wouldn't be a public thread where a bunch of people express interest in trying out the network -- and that publicly expressed interest in this very subthread makes Bluesky feel more in-demand.

In fact, this is such an effective marketing strategy that I've seen Bluesky users complain that invite codes are too common now and that their invite codes aren't in as much demand as they used to be. That FOMO loop is so powerful that it's even affecting the people who already have access to the network who enjoyed the feeling of being in control of an artificially scarce resource.

But sure, all of this is definitely not a growth hack, I believe you ;)

Regardless of whether it's good marketing, the account requirements make the platform a lot less relevant in any serious discussions about the direction of social media, because despite its plans for the future for federation and access, what Bluesky is today is a platform that is in practice even more locked down than Twitter is.



I asked the Bluesky devs about this back in May (of 2023).

Me: "if the network is intended to be public, why are user profiles and posts currently hidden behind a login wall?"

Paul Frazee: "it was a kind of bad artifact of how we set things up initially (just trying to ship). once we realized it communicated the wrong idea it was too late, and we now need to spend a heavy bit of effort communicating before we spring it on everybody."



Oof. That is not a fantastic answer for them to give. I'm not even 100% sure what that means.

> Once we realized it communicated the wrong idea it was too late

For what? Too late to change the technical side of things? Is there a major technical barrier to having a public interface that matches the public firehose APIs? Because I can't figure out what that barrier would be.

What magical deadline or restriction was in place that would have prevented fixing an obvious barrier to the network?

And "once we realized it communicated the wrong idea"? People aren't misinterpreting the message, they're accurately assessing that Bluesky is not an Open network even though it is marketed as one and pretends to be one. This isn't a communication problem, it's not that blocking public access communicates the wrong idea to the public, it communicates correctly that the network isn't Open. It's complete nonsense to try and phrase a failure to fulfill the basic promises of the network as if it's actually just a PR problem.

----

> and we now need to spend a heavy bit of communicating before we spring it on everybody.

Communicating to whom? The users? Is this an admission that Bluesky users don't view the network as public or that they don't want the network to be public?

This is phrased like "we need to spend a bunch of time clarifying and explaining how this will all work before we pull the rug out from under people's feet" but who on earth would this be pulling the rug out from under? Who would be confused about this change? This isn't actually complicated; if a user without an account looks at a post it will either be visible or it won't be visible. That doesn't require a FAQ.

If the idea of that post being visible is contrary to community expectations and if the devs feel they literally can't make open decisions because the community would oppose those changes, then that's a pretty heckin big problem and it sounds like they should stop advertising that this is intended to be an open network or that federation is coming any day now, because it doesn't sound like the community is on board with that idea.

Or is it a communication problem for people outside the network? But how? What would that even mean?

Who outside of Bluesky would be confused if the devs took measures to fulfill the promises they've been publicly making since day 1 of the network? This isn't some complicated thing where people will be misinformed or they'll be confused by the idea that they can view without an account but can't post without one. That's how most networks work, locking viewing behind a login is the abnormal confusing decision to outsiders.

Ultimately, the network will be publicly viewable or it won't be. I do not understand what about that would require a PR campaign. Were people signing up for Bluesky thinking that the network was going to be permanently private? Because if so, that is something the owners should be horribly embarrassed about.

----

It's just a fundamentally weird statement. The only thing that a closed-down network communicates is that it's closed down and exclusive. The only reason that fixing the network to reflect their own marketing would be a problem is if the network doesn't want to reflect the marketing. In which case, they should stop pretending that this is a temporary limit on growth to help prevent out-of-control scaling.

The most charitable take I can have about the response is that it's corporate bullcrap from people trying to take a simple decision that was made for marketing reasons (or has accidentally been found to be extremely valuable for marketing) and to after-the-fact justify it as something complicated and difficult so that they have an excuse to avoid actually making the change.

The less charitable take I could have is that they're being honest, and they're unable to make changes to make the network more open because their userbase would be hostile to those changes or without a PR campaign would view opening up the network as if it was an attack -- and if that's the case, that sure as heck is not making me feel confident that this network has any potential at all as an Open platform. If the users aren't on board with Bluesky as a federated and Open network for everyone, then y'all don't have an Open network and it doesn't matter what your plans are.

And in either case, it's clearly not a temporary technical restriction to help with scaling so I don't know why devs are jumping into threads pretending that it is. It's clearly a deeper problem or else the devs wouldn't be giving you this kind of a nonsense response as soon as you tried to dig into it more.



I still have a single invite left.


I'll take it if you still have it? [email protected].


Sent.


Thanks, and really enjoyed your blog while searching for your email! :)


Yes. I have invite codes if you would like one. Email in my profile.

Edit: they're all gone!



Was browsing around your website (mentioned in profile), noticed https://0x85.org/contact.html only mentions Twitter and email. Maybe the bluesky omission is intentional, but probably it just hasn't been updated yet? I'm not on bsky myself, currently having fun on mastodon and I'm not familiar with bsky enough to know what I'm missing out on, but for other folks I figured I'd mention it


Hey thanks. It's just outdated, what with young kids and grad school. Appreciate the note.


Is your offer invite code available to other randoms like myself? I tried to register on bsky months ago and still haven't been approved.


I have some extra if you'd like one. Let me know how to get it to you and I will.


I'm still trying to get one if anyone see's this. Keep missing the ones posted. Email in profile.

Thanks.



emailed you with one


Finally! Thank you! My username is the same as on here if you (or anyone) want to connect.


Would also like one if you have an extra. Thx in advance. (Click on username to see my email in profile)


Got an invite from another user. Thx!!!


Sent you a code.


I'd love to have one if you have any left. (email in my bio)


emailed you with a code


I’d also like an invite if anyone still reading has any.


emailed you with one


well received, thanks a lot.


If you're offering I'd love one. My email is my username on hackernews at gmail.com


Sent you an invite code.


Could you send me one? Email in my profile. Thanks.


i emailed you one


Thanks a lot! I was able to signup using the code.


I'm happy to give them to any HN user but I'm afraid I have only three left and there are three emails in my inbox asking for invites, so if one is you then congratulations! Otherwise, sorry.


If anyone else still has invites I am also interested.

My mail is at the bottom of my bio.



I'll also add my 4 invite codes if anyone wants them

EDIT: I'm fresh out for now, sorry!



I'd also love a code if anyone has any to spare (email in bio)


if anyone still has any codes, please DM me one, email in my profile. Thanks muchly


Will the BGS also be federated, or is that to be the centralized big spider in Bluesky's web?


The BGS is a "dumb" relay and mirror of the network, so it generally shouldn't matter which one your client app is ultimately sourcing data from.

But yes, anyone is free to operate a BGS. It does necessarily require a non-trivial amount of storage, compute, and bandwidth. A funded startup, well-funded non-profit, or any just about any cloud provider could likely afford to run one.

It's also entirely possible to operate a BGS that only mirrors a slice of the network (for instance, only users in one country) if desired, which could in some cases make it affordable for a single user or small coop to operate.



In theory you can migrate between BGSes, but you can always just use one at any point in time.

In practice no one will switch because it makes no sense to do it. If there happen to ever be more than one real BGS contender, it will be from something like Cloudflare that will just replicate everything Bluesky Inc decides.



I don't know if it does not make sense. AFAIU these BGSes could be special-purposed e.g. for a business, community or topic of interest. Why wouldn't it make sense to synchronise the collected data between these BGSes and get a combined view on the data? With just a single BGS we have another centralized big tech platform. I think decentralized BGSes are a major factor in how interested people are in becoming part of the ecosystem.


Slightly related: is Bluesky moderated good enough or do I get lots of rightwing and conspiracy crap like on twitter currently?

I‘d really love to have some more civilized hub again that isn’t full of hate and anti-intellectualism.



I think it is still too small and people seem quite nice there. But, that has its drawbacks, as I keep returning to Twitter due to the slow migration in the recent months.

It is a shame, as it seems like a nice alternative that has some cool ideas.



Haven't poked my head in there in a while, but in my experience it was more the opposite where much of the discourse is dominated by tech-left influencer types and their followers who migrated from Twitter. Choose your echo chamber I guess.


It seems a lot nicer than Twitter. Though I'd wonder how much of that is just that it's invite-only right now. I haven't really gotten into it, for a variety of reasons (happy enough with Mastodon for most stuff, no decent client apps, vaguely suspicious of the involvement of Dorsey) but it seems... fine?


Are you assuming that hate and anti-intellectualism are exclusively a rightwing thing?


On Twitter, the place that hired Tucker Carlson after Fox News dumped him? Yeah it is. No need for "both sides"-ing on this one.


One example is hardly proof of an absolute though. Assuming all conspiracy or anti-intellectual thought comes from one side because Tucker Carlson is a huge logical misstep in my opinion


Not exclusively, but on the mainstream internet in 2023? Yeah, more or less, bar a few tankies.


It's anti-intellectual and uncivilized, but not because of rightwing conspiracy content. There is a strong culture of intolerance and censorship of viewpoints that diverge from the norm.


Get out of your bubble


good luck with running updates.


That looks like the PR from hell - 190 files changed, 143 commits? Mostly with names like "tidy" and "wip"

Props to whoever actually reviewed that, you are a warrior



I prefer to read the unified diff and commits don't matter as much.


Same. Do whatever you want in your feature branch, what matters is the Files list and the description in the PR. The whole thing gets squashed into a single commit anyway (which also makes reverting much easier).


I my experence with teams as long as you

1. require reviews before merging

2. have not very disjunct PRs (sometimes for e.g. legacy maintenance projects you mostly have disjunct PRs normaly you do not)

then you need stacked PRs for productivity, i.e. you need to be able to continue working on a new PR based on the old PR before that is fully merged (or reviwed).

In this case in my experience three workflow work:

1. you (may) squash commits, and rebase stacked PRs once the previous PR has been merged (or sometimes majorly modified, but that quite advanced rebase usage). This works but has some major pain points: 1) rebased during reviews are terrible bad handled by github, 2) git doesn't keep track of the original start of a branch, this can lead to issues if you squash the commits when merging, 3) no good build in tooling for it

2. All forms of history manipulation are forbidden including rebasing and squashing. It's merge only because of this git doesn't get confused when merging squashed commits and everything seems fine... Until you now realize that follow up changes from reviews of a parent PR happen in the git history chronologically after your follow up PR and that can be a total pain depending on what changes. (Through you are allowed to fully rebase your history before marking a PR as ready for review so as long as the "stack" of PRs isn't too deep it's fine).

3. you agree with Linus that github PRs have major issues and go with a patch based approach for merging, now you need completely different tooling which often less nice modern UI but id doesn't have any of the issues of point 1 or 2

It's was quite a wtf are you doing industry moment when I realized that the most widely used contributions flows (weather in open source or in companies) are either quite flawed (1&2), productivity nightmares (no stacked commits) or quite inconvenient (3).



Reverts are also easy even if one merges the whole branch. Just revert the merge commit.

I almost never look at them, but once in a while it is really great to see the thought process that led to something.



don't know why, but recent teams around me have always made strict rules about number of commits in PRs. I just wanted to tell them the same thing you said: "Why don't you just look at the diffs?" curious for other opinions. (sorry not really about this particular topic)


I prefer to have clear commits that tell a tidy story. For example:

* Refactor function `foo` to accept a second parameter

* Add function `bar`

* Use `bar` and `foo` in component `Baz` to implement feature #X

If you give me a commit history like this, I can easily validate that each step in your claimed process does what you describe.

If you instead give me a messy history and ask me to read the diff, you might know that the change to file `Something.ts` on line 125 was conceptually part of the refactor to `foo`, but I'll have to piece that together myself. It's not obvious to the person who didn't write the code what the purpose of any given change was supposed to be.

This isn't a huge deal if your team's process is such that each step above is a PR on its own, but if your PRs are at the coarseness of a full feature, it's helpful to break down the sub-steps into smaller (but sane and readable) diffs.



This is reasonable, but the problem I encounter is how stifling it seems to ask others to structure their work so specifically. By way of comparison, getting compliance on conventional commit messages is a challenge, and that's an appreciably smaller ask than this.


Oh, for sure. This is how I structure my own PRs, but I've certainly never bothered to ask a coworker to do so, I just appreciate it when I see it.

That said, OP is in an environment where it sounds like this kind of structure is already the cultural norm.



From another one who tries to do the same (but doesn't enforce it):

Thanks!



In the context of Github PR you can’t leave reviews on commits other than what’s currently the tip commit of the pr branch so structuring this way is just wasted effort.

What you should be doing is breaking down PRs more finely so that your unrelated refactors are all separate single-commit PRs. That ofc requires that your pr review round trip time is fast



I'm pretty sure I've left comments on a commit before in a GitHub PR. The comment just goes in the right place in the PR diff, assuming no changes, or comments can actually be attached to commits themselves (which is what happens when a comment becomes stale—it retains a reference to the original commit).


Funny that two of your commits don't actually tell us why they exist, one simply describes the diff (which you should never need lol?) and the other proxies that responsibility to some other system.

You could have simply randomized the text in each commit, put the ticket id and the one "why" in the merge commit body and gotten the same end result amount of real information in the end.



The first line of the commit message isn't about including information that couldn't be gleaned from the commit. That can be done in subsequent lines. The first line is for two purposes:

* Priming the reader so they are able to quickly interpret what they're seeing when they open the commit.

* Making it easy to search or scan for a specific change.

The last commit message in my example would probably have included the name of the feature as well as the ticket number, but I couldn't be bothered to invent an actual feature name.

DRY doesn't really apply to technical writing, at least not as extremely as you seem to think it should. Headings are supposed to summarize the contents, and that's what commit messages are: headings.



> Making it easy to search or scan for a specific change.

I'm trying to imagine the near infinite terms I would have to search for to find the commit where I "changed from a hash to a set".

Regardless, every other thing you said could also just be done in the central PR body (and thus the merge commit) and be much easier to access.

Instead of "priming the reader" it's infinitely more helpful to tell the reader why you did something, because you can't extract that from a diff.



> Instead of "priming the reader" it's infinitely more helpful to tell the reader why you did something, because you can't extract that from a diff.

Again, that can go in the PR body or in subsequent lines. You have ~50 characters in that first line, which is never going to be enough to fully explain anything.

I'm also not suggesting that you eliminate the PR body: that should also include more context. All I'm suggesting is that taking the trouble to organize your commits into discrete units helps reviewers to understand how you perceive the various changes in a single PR as being related to one another, and no amount of text in the PR body will provide the same benefit as being able to look at several distinct diffs containing related changes.



I like to leave comments like this too:

loop i up to n times

break when false

check value returned is not null



A good practice is to rebase your commits before creating a PR into a single commit. You are free to commit as many times as you want to while doing your work. This minimizes the noise in the log.


It's only a good practice if the PR is a single logical change.


Squash is our git given right.


Maybe it's just an approach to try to force logically smaller PRs without trying to limit the number of lines changes.

I.e. with an idea like:

- if we try to commit so that each commit does a singular change

- then by limiting the number of commits we limit the number of "logical" changes in a PR

- and in turn make reviews and similar easier



Easy workaround. Start with feature branch f.

1. Branch f-prime from master. 2. Squash merge f to f-prime. 3. Pull request f-prime to master. 4. Profit.



Commit and push often. Put a novel explaining yourself in the PR. And that's enough IMO.


> Commit and push often. Put a novel explaining yourself in the PR. And that's enough IMO.

Someone reading the git changelog 5 years down the line most likely wouldn't be able to find your "novel" in the PR and definitely won't appreciate if instead of a "novel" you ended up with a "short call" with the assigned reviewer explained what you actually did in your 50 "wip" commits.



Someone reading 5 year old git logs is lost to begin with.


When debugging I routinely explore git blame and read the changelog. This sometimes leads to 3, 5 or even 10 years old code. Doesn't mean I'm lost.


You do you but, at the point of publishing a branch for review, I'd insist the changes are presented as a story, with well-written commit messages that helps the reader/reviewer orient themselves and presents a coherent narrative.

Anthing else, I call it a landfill site, not a maintained repository.

In fact, I'd go as far as using their commit habit as a measure of a candidate's consideration for their colleagues.



Commit your code and commit it often. There's no reason not to.


Sure, but then there's nothing wrong with rebasing it and making a nicer story for other people that want to review it.

Diffs are great but sometimes they're just as overwhelming in a huge PR. It's nice to first follow 5-10 commits in chunks of logical change.



I don't know why people are obsessed with squash merging. I always rebase (when needed) to preserve commit history. It's a good best practice, and makes it easier to spot errors after fixing conflicts.

I suspect squashers use the wrong tools. Use source tree, or, if you are on linux, smartgit. You can see a detailed log, which makes it much easier.



Dont send huge prs. They are hard/impossible to review anyway with good commit history or not


Sure, commit often while you're working.

But then when you're done, turn it into a series of patches for a reviewer to read. In the words of Greg Kroah-Hartman, "pretend I'm your math teacher and show your working".

In a maths assignment, you spend ages making a big mess on a scrap of paper. Then when you've got the solution, you list the steps nice and clearly for the teacher as if you got it right first time. In software development, if you're not a dick, you do the same. You make a big old mess with loads of commits, then when you're done and it's review time, you turn it into a series of tidy commits that are easy for someone to review one-by-one.



Why on Earth did people flag this? Indeed, you won't have a good time sending series of 50 "wip" commits to any kernel mailing list. Having a good split with proper commit messages and cover letter will both make your code much easier to understand for current reviewers and any future "code archeologist" who will have to fix bug in that code 10 years down the line.

Am I living in a bubble and all the glorified 500k TC FAANG devs from HN really routinely submit a changes consisting of a tangled mess of 50 "wip" commits for their code review without any repercussions?



Commit and commit often, but then clean up the history into discrete, readable chunks.

If your PRs are tiny it's not a big deal, but with 190 files changed in this one, it absolutely should have been rebased into a more reasonable commit history.



Unless you’d like to maintain your train of thought.

I don’t want to interrupt my flow with intermediary commits.



Also continuously integrate (from trunk) if you want to hit that moving target sooner.


I don't think any method is gonna make it easy to grok 3,336 added lines and 5,421 removed


This is the answer.


Those two work very closely together, so probably not as nightmarish as it may appear to an observer. But, the two of them are most certainly warriors.


What if it was 190 files changed in 1 commit, would that make a difference?


It might.

With commits like "typo", you might as well squash these into the commit which introduced the typo in the changeset.

If there are changes across many files, and the changes were made automatically with some search-and-replace (or some refactoring tool).. by having a commit that's only that automatic change, it's easy to look at that commit and tell what the changes were. -- Presumably, non-automatic changes are going to be smaller.

I guess roughly, if it makes sense to apply a changeset that changes 5 things, you'd want 5 commits. Having commits like "typo" means there are more commits; but squashing those 5 things together makes it harder to discern the granular change.



> Props to whoever actually reviewed that, you are a warrior

Or a ghost.



lgtm


This seems like a very misleading title, the Bluesky PDS is the meant-for-selfhosting thing they distribute, not the bluesky service as experienced and used by most of its users.


The “Personal” in PDS doesn’t mean it is only for self-hosting.

Bluesky has a main PDS instance at https://bsky.social that serves almost all of the Bluesky user base.

There is a good overview of the architecture here:

https://blueskyweb.xyz/blog/5-5-2023-federation-architecture

Here’s a snippet from the protocol roadmap they published 3-4 weeks ago [1]:

Multiple PDS instances

The Bluesky PDS (bsky.social) is currently a monolithic PostgreSQL database with over a million hosted repositories. We will be splitting accounts across multiple instances, using the protocol itself to help with scaling.

[1] https://atproto.com/blog/2023-protocol-roadmap



AFAIK there's only one version of the software so "the service" runs the same thing that you self-host. SQLite seems like it will simplify the single-user case though.


That's right. This is the same code Bluesky is running on our new PDS hosts. It's all open source.

The main motivation in moving from a big central Postgres cluster to single tenant SQLite databases is to make hosting users much more efficient, inexpensive, and operationally simpler.

But it's also part of the plan to run regional PDS hosts near users, increasing performance by decreasing end-to-end latency.

The most experimental part of this setup is using Litestream to replicate these many SQLite databases (there are almost 2 million user repositories) to cloud storage. But we're not relying on this alone, we're also going to maintain standard SQLite ".backup" snapshots as well.



no this actually is moving every single user currently on the service into this setup. Everyone gets their own sqlite under the hood.


Cool, but maybe let people actually use your service before everyone forgets what it is?


They have over 1.8 million users currently, or do you mean PDSes specifically? Federation is in open beta on a test network, you can try it out today if you'd like.


I have been on the wait list since they launched. They seem to mostly rely on invites.


  bsky-social-scbch-eolha
  bsky-social-fs26y-d6gnv
  bsky-social-2lx5u-ntrdv
  bsky-social-hboq7-dyuue
  bsky-social-b2v3f-3a23q


Thank you so very much! :-)


damn, seems already all gone


    bsky-social-lkzsp-7x7ja
    bsky-social-p4vwr-nrthu
    bsky-social-bdu6c-6tbv4
    bsky-social-fkpgk-oestw


Got one, thank you sir!


Got a few invites, DM me on twitter, substack or masto if you want them (listing on https://bitecode.dev)


It shouldn't be too difficult to find an invite? They hand them out pretty frequently.


I have a few invites. Email me and I'll pass them out. :)


all gone


I’ve some invites lying around. DM me if you want one.


I'd like to take you up on that, if you still have one going?


> They have over 1.8 million users currently

How many of them active?



Come on, "over 1.8 million users" is not an impressive number

These kind of movements makes me think they're not serious about scaling up. Wouldn't surprise me if then end up as an also-ran



Maybe not impressive but none of the services of my customers had or has 1.8 million users. And yet they do well (my customers.)


That. And none of the big social media platforms were big at the start either.


Yes but 1.8Mi at a time where people are longing for a Twitter alternative is just leaving money on the table


There are two usual strategies for growing: 1) low cost, organic and slow or 2) high cost, throw a lot of money at advertisement, saturate all media, grow quickly or bust.

The exceptions are those rare products that despite a low cost marketing sell themselves so well that their organic growth is fast and in a few months everybody use them.

Maybe Bluesky don't have the money to advertise or is not compelling enough. As one data point: I know about Mastodon but I think that I learned about Bluesky only today. I went to their site and there is nothing to explain how it works except that it's some social thing. I learned more by reading the comments here. Apparently it's being marketed at a very low cost.



Anyone interested in joining Bluesky, please grab these. I have extra and I've already invited all my Twitter mutuals I wanted to invite.

Edit: I'm all out now :)



All used :( Do you have anymore ?


I am curious, does the HN folks know if bluesky is more active than nostr or the mastodon network?


Less active than Mastodon, I'd assume more active than Nostr.

But the interesting thing for me isn't activity — it's the people on there.

Of the cohort who had >100k followers on Twitter, I think more of them post regularly on Bluesky than post on Mastodon. Bluesky definitely has a more cohesive feel, especially because there's currently just one instance & mod team.



I'm a donating supporter of the Mathstodon.xyz instance, but (sadly?) most of "math Twitter", at least the education-focused university faculty, ended up on BlueSky. I think there's a strong appeal for "a straight forward Twitter clone without Musk" for a lot of people.


Mastodon, and the Fediverse in general, make user interaction decisions on purpose to limit many of the issues common to social media. Think about: mob culture, addiction, and the like.

I wonder if BlueSky intends to follow on those. For example, hiding user actions counts (repeats, favourites, etc...) until the user acts on one.

Things like these may be strange for those accustumed to Twitter, but personally, is what makes me stick with smaller instances on the Fediverse.



My bet is regardless of any initial good intentions, since BlueSky is a company, market pressures will inevitably force them into dark patterns like we see on every other commercial social network (going back to the early days of the companies, Facebook, Twitter, and even Google looked really good early on until all were corrupted by profit motive). My belief is that the profit motive is necessarily at odds with free communication.

To me, the ActivityPub network (Mastodon and friends) is relatively unique in the social media space in having no direct commercial pressures (the protocol is developed by W3C) and therefore being inoculated against the causes for these dark patterns.



I don't know about nostr, but I find it is a lot less active than Mastodon. In general the tech accounts I am interested in have moved to Mastodon rather than bluesky. I imagine this would depend on whose activity you are interested in, and where they have chosen to migrate to


Less active than Mastodon, more active than nostr.

https://vqv.app/stats/chart is useful for Bluesky and draws on the Bluesky firehose for data. https://stats.nostr.band/ seems useful for nostr.



Bluesky is much smaller than Mastodon, but how active it feels will depend on who you're following. It also has an Algorithm (TM); I never really missed this when I went from Twitter to Mastodon as I mostly used the linear timeline anyway, but I gather that some people find that Mastodon feels empty/inactive without one.


Bluesky is pretty positive and definitely lacks the "American Suburbia HOA" energy that some Mastodon instances have.

It's pretty active during North American hours.



They’re all just arbitrary ghettos that aren’t dissimilar to each other. None of them matter in terms of influence but are like nice Reddit boards for certain interests.


I've got a bunch of invites if folks want them:

bsky-social-etdu7-njigu

bsky-social-2ktcs-uwoxg

bsky-social-6f5nh-36gnq

bsky-social-ciwro-3gzk5

bsky-social-y4h57-dxh3g



The codes are all gone. That was fast.

E: Happy to take one, if somebody happens to have a spare one left. Email is in my bio.



Grabbed bsky-social-6f5nh-36gnq, thanks!


Damn, they all gone.


There you go folks:

bsky-social-h3d4w-u6yn4

bsky-social-74bqi-vkmcq

bsky-social-n3fdq-46nxz

bsky-social-yippe-32vdr

bsky-social-l2fbt-xnscx



Either people were really prepared for these codes to appear, or they are being scraped. Regardless, they're all gone


It seems a temporary, anonymous, private, receive-only dropbox(not the USB drive replacement kind) on the Internet is an unsolved problem. It doesn't have to be completely out-of-band like email, could be just an encrypted public reply by `cat | base64 -d | openssl rsautl -decrypt -inkey temp.key`, so long up to few bits(70 in this instance) of encrypted content would be allowed on a platform.


piracy websites were using base64 encoding for this purpose a while ago, but now it seems they moved on to a proprietary algorithm


Any more?


Some more for y'all!

bsky-social-ge2mz-mfmpi

bsky-social-hykwa-x3ox4

bsky-social-gh4mt-2od6p

bsky-social-dejzy-mmcxf

edit: all gone :(



Snagged bsky-social-hykwa-x3ox4


4 more: (prepend bsky-social-)

7poji-p36pm

irn4h-ncvic

2hb2e-xhxnb

2k4na-5qiqu



And they are gone.


Gone in 60 seconds.


Any more?


bsky-social-lbjkg-gcxs4

bsky-social-zigwm-f3qpq

bsky-social-2jlu7-apy5a

bsky-social-6ct52-4egmz

bsky-social-cy64m-53sqn



Ya'll got anymore of that.


Maybe if you stop posting them with the easily-greppable first part they won't be so easy to scrape.


No one seems to be taking two codes I'm putting up without the prefix for ~hour, this is likely the case

e: second one now used, first still up

e: both used



I wonder what they're being used for. The UI doesn't expose it, but the Bluesky API will tell you who redeemed your invites. Open the site, watch for a "com.atproto.server.getAccountInviteCodes" request in your browser's network inspector, look in the "usedBy" field in the response JSON, and append the DID value there onto "https://bsky.app/profile/". Any commenters in the parent chain who got scraped want to take a look?


I get "$username joined using your invite code!" in notification tab that leads to the user profile. So far the user hasn't done anything.


grep "bsky-" internet.txt


e: burner email didn't work, sorry

e4: check out dns on [my username].com



can't believe it's gone, too many smart people on this website lol


Not called hn for nothing! Glad I'm absolute bottomest on the floor in terms of intelligence or ability here


Nope that might be me. I started this comment thread and I didn't even manage to snag one despite getting direct replies multiple times with codes.


Thanks a lot!


to the people doing this: your codes will most likely be instantly stolen by bots and not real people


These seem to be gone.


they're all exhausted now :(


That sounds like centralizing


I sure hope they don’t ever want to change their db structure.

Why not use Postgres with RBAC (Row Based Access Control).



- simpler db client

- simpler cloud architecture

- simpler resource management

- simpler partial backups/restore

- simpler compliance with law enforcement

- partitioning might be easier, e.g. when handing "user account storage which should be undo-able for a while" (e.g. long term absent users data could be moved to cold storage, blocked/deleted users data could move to some scheduled for deletion space allowing undoing it for a while but then reliable auto deleting them, a copy of users data where crime detection triggered (e.g. CASM) could be moved to a quarantine space, etc. And each of the spaces can be completely different servers with different storage methods and retention policies, virtual access control and physical access control. Sure you can have all of that with RBAC + partitioning + triggers + roles in postgres, but it's the personal data store of a user so you don't need cross users FK constraint enforcement and it makes it much easier to make sure you don't miss anything wrt. access controll or forgetting to partition/move some columns of a new table etc.)

- maybe simpler billing for storage ("just" size of DB)

now simpler doesn't mean better, but often it pays of as long as you don't run into the limits of what is possible with the simpler architecture (and as far as I can tell you can shard this approach really nice, so there at least there shouldn't be scaling performance limits, scaling cost and future feature complexity limits might still apply)



You didn’t systemically document “harder”.






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com