![]() |
|
![]() |
| Being able to ask qualifying questions like that, or presenting options with different caveats clearly spelled out, is part of the job description IMO, at least for senior roles. |
![]() |
| I would have had that uncertainty that you are describing when I was a junior dev.
But now as a senior, I have the same questions and answers regardless if I’m being interviewed or not. |
![]() |
| I want my boss to be straight with, I need those below me to call me out of Im talking bullshit.
Tell me this isn't big data, and then, if you must, tell me about hadoop (or whatever big data is) |
![]() |
| I think the point is that if it fits on a single drive, you can still get away with a much simpler solution (like a traditional SQL database) than any kind of "big data" stack. |
![]() |
| 980 is an M.2 drive, PCIe 3.0 x4, 3 years old, up to 3500MB/s sequential read.
You want something like PM1735: PCIe 4.0 x8, up to 8000 MB/s sequential read. And while DDR5 is surely faster the question is what the data access patterns are there. In almost all cases (ie mix of random access, occasional sequential reads) just reading from the NVMe drive would be faster than loading to RAM and reading from there. In some cases you would spend more time processing the data than reading it. PS all these RAM bandwidth rates are good for the sequential access, as you go random access the bandwidth drops. https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1733-... |
![]() |
| I agree completely for this scale. I did want to point out that it's fairly easy these days to do the kinds of things one would do on a cluster, which I learned just a few months ago myself :) |
![]() |
| > Parquet is not a database.
This is not emphasized often enough. Parquet is useless for anything that requires writing back computed results as in data used by signal processing applications. |
![]() |
| I do not know a good one.
A former colleague of mine is now working on a memory-mapped log-structured merge tree implementation and it can be a good alternative. LSM provides elasticity, one can store as much data as one needs, it is static, thus it can be compressed as well as Parquet-stored data, memory mapping and implicit indexing of data do not require additional data structures. Something like LevelDB and/or RocksDB can provide most of that, especially when used in covering index [1] mode. [1] https://www.sqlite.org/queryplanner.html#_covering_indexes |
![]() |
| > which works for CSVs that fit in memory.
what? Why CSV is required to fit in memory in this case? I tested CSVs which are far larger than memory, and it works just fine. |
![]() |
| If you want to shine with snide remarks, you should at least understand the point being made:
|
![]() |
| I've downloaded many csv files that were mal-formatted (extra commas or tabs etc.), or had dates in non-standard formats. Parquet format probably would not have had these issues! |
![]() |
| clickhouse-local had been astonishingly fast for operating on many GB of local CSVs.
I had a heck of a time running the server locally before I discovered the CLI. |
![]() |
| Log files aren’t data. That’s your first problem. But that’s the only thing that most people have that generates more bytes than can fit on screen in a single spreadsheet. |
![]() |
| > I already qualified my statement quite well by stating my background
No. You qualified it with "blows my mind" . Why would it 'blow your mind' if you don't have any data background. |
![]() |
| Are you trolling? Did you miss the part where I said I worked with data but wouldn't say I'm a professional data scientist?
This negative cherry picking does not do your image any favors. |
![]() |
| > The winner of course was the guy who understood that 6TiB is what 6 of us in the room could store on our smart phones, or a $199 enterprise HDD (or three of them for redundancy), and it could be loaded (multiple times) to memory as CSV and simply run awk scripts on it.
If it's not a very write heavy workload but you'd still want to be able to look things up, wouldn't something like SQLite be a good choice, up to 281 TB: https://www.sqlite.org/limits.html It even has basic JSON support, if you're up against some freeform JSON and not all of your data neatly fits into a schema: https://sqlite.org/json1.html A step up from that would be PostgreSQL running in a container: giving you the support for all sorts of workloads, more advanced extensions for pretty much anything you might ever want to do, from geospatial data with PostGIS, to something like pgvector, timescaledb etc., while still having a plethora of drivers and still not making your drown in complexity and having no issues with a few dozen/hundred TB of data. Either of those would be something that most people on the market know, neither will make anyone want to pull their hair out and they'll give you the benefit of both quick data writes/retrieval, as well as querying. Not that everything needs or can even work with a relational database, but it's still an okay tool to reach for past trivial file storage needs. Plus, you have to build a bit less of whatever functionality you might need around the data you store, in addition to there even being nice options for transparent compression. |
![]() |
| Now, you have to consider the cost it takes for you whole team to learn how to use AWK instead of SQL. Then you do these TCO calculations and revert back to the BigQuery solution. |
![]() |
| And since the data scientist cannot verify the very complex AWK output that should be 100% compatible with his SQL query, he relies on the GPT output for business-critical analysis. |
![]() |
| I'm having flashbacks to some new outside-hire CEO making flim-flam about capex-vs-opex in order to justify sending business towards a contracting firm they happened to know. |
![]() |
| >I mean if you're doing data science the data is not always organized and of course you would want multi-processing
Not necessarily - I might not want it or need it. It's a few TB, it can be on a fast HD, on an even faster SSD, or even in memory. I can crunch them quite fast even with basic linear scripts/tools. And organized could just mean some massaging or just having them in csv format. This is already the same rushed notions about "needing this" and "must have that" that the OP describes people jumping to, that leads them to suggest huge setups, distributed processing, multi-machine infrastructure, for use cases and data sizes that could fit on a single server with redundancy and be done it. DHH has often written about this for their Basecamp needs (scalling vertically where others scale horizontally having worked for them for most of their operation), there's also this classic post: https://adamdrake.com/command-line-tools-can-be-235x-faster-... >1 TB of memory is like 5 grand from a quick Google search then you probably need specialized motherboards. Not that specialized, I've work with server deployments (HP) with 1, 1.5 and 2TB RAM (and > 100 cores), it's trivial to get. And 5 or even 30 grand would still be cheaper (and more effective and simpler) than the "big data" setups some of those candidates have in mind. |
![]() |
| I'm trying to understand what the person I'm replying to had in mind when they said fit six terabytes in memory and search with awk.
is this what they were referring to just by a big ass Ram machine? |
![]() |
| I wouldn't underestimate how much a modern machine with a bunch of RAM and SSDs can do vs HDFS. This post[1] is now 10 years old and has find + awk running an analysis in 12 seconds (at speed roughly equal to his hard drive) vs Hadoop taking 26 minutes. I've had similar experiences with much bigger datasets at work (think years of per-second manufacturing data across 10ks of sensors).
I get that that post is only on 3.5GB, but, consumer SSDs are now much faster at 7.5GB/s vs 270MB/s HDD back when the article was written. Even with only mildly optimised solutions, people are churning through the 1 billion rows (±12GB) challenge in seconds as well. And, if you have the data in memory (not impossible) your bottlenecks won't even be reading speed. [1]: https://adamdrake.com/command-line-tools-can-be-235x-faster-... |
![]() |
| how exactly is this solution easier than putting the very Parquet files on a classic filesystem. Why does the easy solution require an amazon-subscription? |
![]() |
| Indeed, what I meant to say is that you can load it in multiple batches. However, now thinking, I did play around with servers of TiBs of memory :-) |
![]() |
| If you look at the article the data space is more commonly 10GB which matches my experience. For these sizes definitely simple tools are enough. |
![]() |
| What did you gather as 'needed domain' from that comment. 'needed domain' is often implicit, its not a blank slate. candidates assume all sorts of 'needed domain' even before the interview starts, if i am interviewing at bank I wouldn't suggest 'load it on your laptops' as my 'stack'.
OP even mentioned that it his favorite 'tricky question' . It would def trick me because they used the word 'stack' which has specific meaning in the industry. There are even websites dedicated to 'stack's https://stackshare.io/instacart/instacart |
![]() |
| On the other hand if salaries are at 300k then 10k compared to that is not a huge cost. If a scalable tool can make you even 10 percent more effective it would be worth 30k. |
![]() |
| I dont know anything but when doing that I always end up next Thursday having the same with 4TB and the next with 17 at which point I regret picking a solution that fit so exactly. |
![]() |
| In my context 99% of the problem is the ETL, nothing to do with complex technology. I see people stuck when they need to get this from different sources in different technologies and/or APIs. |
I patiently listened to all the big query hadoop habla-blabla, even asked questions about the financials (hardware/software/license BOM) and many of them came up with astonishing tens of thousands of dollars yearly.
The winner of course was the guy who understood that 6TiB is what 6 of us in the room could store on our smart phones, or a $199 enterprise HDD (or three of them for redundancy), and it could be loaded (multiple times) to memory as CSV and simply run awk scripts on it.
I am prone to the same fallacy: when I learn how to use a hammer, everything looks like a nail. Yet, not understanding the scale of "real" big data was a no-go in my eyes when hiring.