(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39458264

老实说,我没有偏好或主观意见,但根据上面提供的文字,人工智能艺术有局限性,因为有些作品往往显得肤浅或不舒服。 然而,也有人提到,在奇怪的作品中隐藏着一些美丽的创作。 最终,人工智能生成的艺术是否被认为是伟大的,很大程度上取决于个人的品味和解释,有些人喜欢探索人工智能艺术品中的超现实主义和抽象领域。 尽管如此,与人类创造的传统艺术形式相比,人工智能创造的艺术目前缺乏深度和复杂性,因为人工智能艺术家在预定义的算法和参数内操作,其设计目的只是根据编程的指标产生属于令人愉悦或满意类别的美学。 进入他们的系统。 因此,有些人可能更喜欢人类手工制作的传统艺术品所固有的舒适性和熟悉感。 然而,随着人工智能艺术创作技术的进步,人工智能创作的艺术品可能会变得更加细致、精致和复杂,完全挑战传统艺术形式或为观众提供另一种视角。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
The killer app of Gemini Pro 1.5 is using video as an input (simonwillison.net)
1000 points by simonw 15 hours ago | hide | past | favorite | 403 comments










Note that a video is just a sequence of images: OpenAI has a demo with GPT-4-Vision that sends a list of frames to the model with a similar effect: https://cookbook.openai.com/examples/gpt_with_vision_for_vid...

If GPT-4-Vision supported function calling/structured data for guaranteed JSON output, that would be nice though.

There's shenanigans you can do with ffmpeg to output every-other-frame to halve the costs too. The OpenAI demo passes every 50th frame of a ~600 frame video (20s at 30fps).

EDIT: As noted in discussions below, Gemini 1.5 appears to take 1 frame every second as input.



We've done extensive comparisons against GPT-4V for video inputs in our technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_....

Most notably, at 1FPS the GPT-4V API errors out around 3-4 mins, while 1.5 Pro supports upto an hour of video inputs.



So that 3-4 mins at 1FPS means you are using about 500 to 700 tokens per image, which means you are using `detail: high` with something like 1080p to feed to gpt-4-vision-preview (unless you have another private endpoint).

The gemini 1.5 pro uses about 258 tokens per frame (2.8M tokens for 10856 frames).

Are those comparable?



>while 1.5 Pro supports upto an hour of video inputs

At what price, tho?



The average shot length in modern movies is between 4 and 16 seconds and around 1 minute for a scene.


The number of tokens used for videos - 1,841 for my 7s video, 6,049 for 22s - suggests to me that this is a much more efficient way of processing content than individual frames.

For structured data extraction I also like not having to run pseudo-OCR on hundreds of frames and then combine the results myself.



No it's individual frames

https://developers.googleblog.com/2024/02/gemini-15-availabl...

"Gemini 1.5 Pro can also reason across up to 1 hour of video. When you attach a video, Google AI Studio breaks it down into thousands of frames (without audio),..."

But it's very likely individual frames at 1 frame/s

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

"Figure 5 | When prompted with a 45 minute Buster Keaton movie “Sherlock Jr." (1924) (2,674 frames at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame in and provides the corresponding timestamp. At bottom right, the model identifies a scene in the movie from a hand-drawn sketch."



Despite that being in their blog post, I'm skeptical. I tried uploading a single frame of the video as an image and it consumed 258 tokens. The 7s video was 1,841 tokens.

I think it's more complicated than just "split the video into frames and process those" - otherwise I would expect the token count for the video to be much higher than that.

UPDATE ... posted that before you edited your post to link to the Gemini 1.5 report.

684,000 (total tokens for the movie) / 2,674 (their frame count for that movie) = 256 tokens - which is about the same as my 258 tokens for a single image. So I think you're right - it really does just split the video into frames and process them as separate images.





Edit: Was going to post similar to your update. 1841/258 = ~7


I mean, that's just over 7 frames, or one frame/s of video. There are likely fewer then that many I-frames in your video.


The model is fed individual frames from the movie BUT the movie is segmented into scenes. These scenes, are held in context for 5-10 scenes, depending on their length. If the video exceeds a specific length or better said a threshold of scenes it creates an index and summary. So yes technically the model looks at individual frames but it's a bit more tooling behind it.


From the Gemini 1.0 Pro API docs (which may not be the same as Gemini 1.5 in Data Studio): https://cloud.google.com/vertex-ai/docs/generative-ai/multim...

> The model processes videos as non-contiguous image frames from the video. Audio isn't included. If you notice the model missing some content from the video, try making the video shorter so that the model captures a greater portion of the video content.

> Only information in the first 2 minutes is processed.

> Each video accounts for 1,032 tokens.

That last point is weird because there is no way a video would be a fixed amount of tokens and I suspect is a typo. The value is exactly 4x the number of tokens for an image input to Gemini (258 tokens) which may be a hint to the implementation.



On the other hand, a picture is a video with a single frame.


I expected more from the video


Prompt injection via Video?




> It looks like the safety filter may have taken offense to the word “Cocktail”!

I'm definitely not a fan of these severely hamstrung by default models. Especially as it seems to be based on an extremely puritan ethical system.



Deeply agree with the sentiment. AIs are so throttled and crippled that it makes me sad every time gemini or chatgpt refuses to answer my questions.

Also agree that it’s mostly policed by American companies who follow the American culture of “swearing is bad, nudity is horrible, some words shouldn’t even be said”



So how crippled would you like them to be? Would you put any guard rails in place?


These guard rails might curtail abuse of the web-based applications of these models for a while, but any locally run model can (and in many cases already do) have these protections stripped out of them.

I'd like control over what the guard rails do. I'd still use them under most circumstances, there's things I definitely do not want to generate, but if a word filter is getting in my way I'd like the ability to get rid of it.



Finally, early-aughts 1337 a3s7h37ic can be cool again


I was fighting with ChatGPT yesterday because it wouldn't translate "fuck". I was quoting Office Space's "PC Load Letter? What the fuck does that mean?"

Likewise it won't generate passive-aggressive answers meant for comedic reasons.

I hate having to negotiate with AI like it's a difficult child.



I wonder, if you put asterisks like 'f***' it would translate that appropriately. Like, as a figleaf.


I don’t think it’d take offense at alcohol. Most likely that’s because cocktail rhymes with Molotov.


I think it's the COCK in cocktail.


Most likely that’s because cocktail rhymes with Molotov

What definition of 'rhymes' are you using here?



One of the faults is that for every version of morality you can hallucinate a reason why cocktail is offensive or problematic.

Is it sexual? Is it alcohol? Is it violence? All of the above?

For example, good luck ever actually processing art content with that approach. Limiting everything to the lowest common denominator to avoid stepping on anyone's toes at all times is, paradoxically, a bane on everyone.

I believe we need to rethink how we deal with ethics and morality in these systems. Obviously, without a priori context every human, actually every living being, should be respected by default and the last thing I would advocate for is to let racism, sexism, etc. go unchecked...

But how can we strike a meaningful balance here?



We're months into this technology being available so it's not a surprise that the various "safeties" have not been perfectly tuned. Perhaps Google knew they couldn't be perfect right now and they could err on the side of the model refusing to talk about cocktails, or err on the side of it gladly spouting about cocks. They may have made a perfectly valid choice for the moment.


If you want a great example of how this plays out long-term, look no further than algospeak[0] - the new lingo created by censorship algorithms like those on youtube and tiktok.

[0] https://www.nytimes.com/2022/11/19/style/tiktok-avoid-modera...



Chinese have been doing this for years to get around government censorship.


Paywall




If you are averse to seeing links to paywalled articles you probably shouldn't use HN


Why are we spamming the landing page of paywalled sources? We should completely avoid posting them here, to preserve their bandwidth and our sanity.


I personally flag any paywall links, I recommend you do the same.


Why? They're completely allowed on the site. Dang has said this many times


Silicon Valley has been auto-parodic morals-wise for a while. Hell, just the basics of you can have super violent gaming but woe-betide you look at anything sex related in the appstores is intensely comedic. America desperately tries to export its puritanism but most of us just shrug (along with many Americans). Surely it's hard to argue that being open about sex (for consenting adults) is infinitely preferable to a world of wanton, easily accessible violence.


Ok, crazy tangent;

Where agents will potentially become extremely useful/dystopian is when they just silently watch your entire screen at all times. Isolated, encrypted and local preferably.

Imagine it just watching you coding for months, planning stuff, researching things, it could potentially give you personal and professional advice from deep knowledge about you. "I noticed you code this way, may i recommend this pattern" or "i noticed you have signs of this diagnosis from the way you move your mouse and consume content, may i recommend this lifestyle change".

I wonder how long before something like that is feasible, ie a model you install that is constantly updated, but also constantly merged with world data so it becomes more intelligent on two fronts, and can follow as hardware and software advances over the years.

Such a model would be dangerously valuable to corporations / bad actors as it would mirror your psyche and remember so much about you - so it would have to be running with a degree of safety i can't even imagine, or you'd be cloneable or loose all privacy.



I'm working on this! https://www.perfectmemory.ai/

It's encrypted (on top of Bitlocker) and local. There's all this competition who makes the best, most articulate LLM. But the truth is that off-the-shelf 7B models can put sentences together with no problem. It's the context they're missing.



I feel like the storage requirements are really going to be these issue for these apps/services that run on "take screenshots and OCR them" functionality with LLMs. If you're using something like this a huge part of the value proposition is in the long term, but until something has a more efficient way to function, even a 1-year history is impractical for a lot of people.

For example, consider the classic situation of accidentally giving someone the same Christmas that you did a few years back. A sufficiently powerful personal LLM that 'remembers everything' could absolutely help with that (maybe even give you a nice table of the gifts you've purchased online, who they were for, and what categories of items would complement a previous gift), but only if it can practically store that memory for a multi-year time period.



I set up two years ago a cron to screenshot every minute.

Just did the second phase of using ocrmac (vision kit cli on GitHub) that extracts text and dumps it in a SQLite with FTS5.

It’s simplistic but does the job for now.

I looked at reducing storage requirements by using image magik to only store the difference between images - some 5 min sequence are essentially the same screen - but let that one go.



/using image magik to only store the difference between images/

Well, that's basically how video codecs work... So might as well just find some codec params which work well with screen capture, and use an existing encoder.



Thanks for sharing. Curious, what main value adds have you gotten out of this data?


It's not that bad. With Perfect Memory AI I see ~9GB a month. That's 108 GB/year. HDD/SSDs are getting bigger than that every year. The storage also varies by what you do, your workflow and display resolution. Here's an article I wrote on my finding of storage requirements. https://www.perfectmemory.ai/support/storage-resources/stora...

And if you want to use the data for LLM only, then you don't need to store the screenshots at all. Then it's ~ 15MB a month



> That's 108 GB/year. HDD/SSDs are getting bigger than that every year.

Cries in MacBook Pro



Outboard TB 3/4 storage only seems expensive until you price it against Apple's native storage. Is it slower? Of course! Is it fast enough? Probably.


I recently moved my macOS installation to an external Thunderbolt drive - it's faster than the internal SSD.


Considering storage is a wasting asset and what Apple charges, this makes perfect sense to me.


The funny thing is Apple even have a support article on how to do this (and actually say in it "may improve your performance") I literally followed it step by step and it was very easy and had no issues.


Can you share the Thunderbolt drive you got?


https://glyphtech.com/products/atom-pro?variant=321211999191...

Shipped to the UK for me added a bit to the overall price with shipping and import duty but it was still better value for money and hugely reliable brand than anything I could have bought domestically.



It's Windows only so it won't run on your Mac anyway :-)


PerfectMemory is only available on Windows at the moment.


https://Rewind.ai is the macOS equivalent


Does storage use scale linearly with the number of connected monitors (assuming each monitor uses the same resolution)?


Most screenshots are of the application window in the foreground, so unless your application spans all monitors, there is no significant overhead with multiple monitors. DPI on the other hand has a significant impact. The text is finer, taking more pixels...


Why should DPI matter if the app is taking screenshots?


Is the 15mb basically embeddings from the video screenshots? What would it recall if there isn't the screenshots saved?


I’m not sure if the above product does this, but you could use a multimodal model to extract descriptions of the screenshots and store those in a vector database with embeddings.


I think ultimately you’d want it to summarize that down to something like:

“Purchased socks from Amazon for $10 on 12/4/2024 at 5:04PM, shipped to Mom, 1600 Pennsylvania Av NW, Washington DC 20500, order number 1463355337

Probably stored in a vector DB for RAG.



Maybe. Until we find there’s a better way to encode the information and need the unfiltered, original context so it can be used with that new method.


This is where Microsoft (and Apple) has a leg up -- they can hook the UI at the draw level and parse the interface far more reliably + efficently than screenshot + OCR.


Google too, for all practical purposes, since presumably this is mostly just watching you use chrome 90% of the time.


All the more reason not to use Chrome...


This reminds me of how Sherlock, Spotlight and its iterations came to be. It was very resource intensive to index everything and keep a live db, until it was not.


Your website and blog are very low on details on how this is working. Downloading and installing an mai directly feels unsafe imo. Especially when I don't know how this software is working. Is it recording a video, performing OCR continuously, taking just screenshots

No mention of using any LLMs in there at all which is how you are presenting it in your comment here.



Feedback taken. I'll add more details on how this works for us technical people. LLM integration is in progress and coming soon.

Any idea what would make you feel safe? 3rd party verification? I had it verified and published by the Microsoft Store. I feel eventually it all comes down to me being a decent person.



welp. this pretty much convinces me that its time I get out of tech. lean into the tradework I do in my spare time.

because I'm sure you and people like you will succeed in your endeavors, naively thinking you're doing good. and you or someone like you will sell out, the most ruthless investor will take what you've built and use it as one more cludgel of power to beat the rest of us with.



If you want to help, use your knowledge to help shape policy. Because it is coming/already happening, and it will shape your life even if you are just living a simple life. I guarantee you that your city and state governments are passing legislation to incorporate AI to affect your life if they can be sold on it in the name of "good".


Basically looks like rewind.ai but for the PC?


exactly. the UI is shockingly similar


This looks cool, I hope you support macOS at some point in the future


Any plan to implement this on macOS or Linux?


I got 90% of this built on Linux (around KDE Wayland) before other interests/priorities took over:

https://github.com/Zetaphor/screendiary/



This seems very very interesting. I'm still learning python so probably can't build on this. But like a cheap mans' version of this would be to take a screenshot every couple of minutes, OCR it and send to it gpt for some kind of processing (or not, just keep it as a log). Right? Or am I missing something?


macOS: https://screenmemory.app/

This is my application, it does not have AI running on top.





statistics about the usage would be cool


> Imagine it just watching you coding for months, planning stuff, researching things, it could potentially give you personal and professional advice from deep knowledge about you.

And then announcing "I can do your job now. You're fired."



That's why we would want it to run locally! Think about a fully personalized model that can work out some simple tasks / code while you're going out for groceries, or potentially more complex tasks while you're sleeping.


"AI Companion" is a bit like spouse. You are married to it in the long run, unless you decide to divorce it. Definitely TRUST is the basis of marrage, and it should be the same for AI models.

As in human marriage, there should be a law that said your AI-companion cannot be compelled to testify against you :-)



But unlike a spouse you can reset it back to an earlier state you preferred.


It's local to your employer's computer.


It can be.

It can also be local to my own computer. People do write software while they're away from work.



How quaint.

You humans think that the AI will have someone in charge of it. Look, that's a thin layer that can be eliminated quickly. It's like when you build a tool that automates the work of, say, law firms but you don't want law firms getting mad that you're giving it away to their clients, so you give it to the law firms and now they secretly use the automating software. But it's only a matter of time before the humans are eliminated from the loop:

https://www.youtube.com/watch?v=SrIf0oYTtaI

The employee will be eliminated. But also the employer. The whole thing can be run by AI agents, which then build and train other AI agents. Then swarms of agents can carry out tasks over long periods of time, distributed, while earning reputation points etc.

This movie btw is highly recommended, I just can't find it anywhere anymore due to copyright. If you think about it, it's just a bunch of guys talking in rooms for most of the movie, but it's a lot more suspenseful than Terminator: https://www.youtube.com/watch?v=kyOEwiQhzMI



We've all seen the historical documents. We know how this will all end up, and that the end result is simply inevitable.

And since that has to be the case, we might as well find fun and profit wherever we can -- while we still can.

If that means that my desktop robot is keeping tabs on me while I write this, then so be it as long as I get some short-term gain. (There can be no long-term gain.)



Have it running on your personal comp, monitoring a screen-share from your work comp. (But that would probably breach your employment contract re saving work on personal machines.)


You could point your local computer's webcam at the work computer.

It probably breaks the spirit of the employment contract just as hard, but it's essentially undetectable for the work computer.



Corporations would absolutely force this until it could do your job and then fire you the second they could.


I heard somewhere that dystopia is fundamentally unstable. Maybe they should test that question.


Jokes on it, already unemployed


That sounds a lot like Learning To Be Me, by Greg Egan. Just not quite as advanced, or inside your head.




>Isolated, encrypted and local of course.

And what is the likelihood of that "of course" portion actually happening? What is the business model that makes that route more profitable compared to the current model all the leaders in this tech are using in which they control everything?



Maybe it doesn't have to be more profitable. Even if open source models would always be one step behind the closed ones that doesn't mean they won't be good enough.


This. I want an AI assistant like in the movie Her. But when I think about the realities of data access that requires, and my limited trust in companies that are playing in this space to do so in a way that respects my privacy, I realize I won't get it until it is economically viable to have an open source option run on my own hardware.


Given that http://rewind.ai is doing just that, the odds are pretty good!


No they aren't. Rewind uses ChatGPT so data is sent off your local device[1].

I understand the actual screen recordings don't leave your machine, but that just creates a catch-22 of what does. Either the text based summaries of those recordings are thorough enough to still be worthy of privacy or the actual answers you get won't actually include many details from those recordings.

[1] - https://help.rewind.ai/en/articles/7791703-ask-rewind-s-priv...



ah yeah fair point. it's the screen recordings I'm worried about leaving my computer


It doesn't even have to coach you at your job, simply a LLM-powered fuzzy retrieval would be great. Where did I put that file three weeks ago? What was that trick that I had to do to fix that annoying OS config issue? I recall seeing a tweet about a paper that did xyz about half a year ago, what was it called again?

Of course taking notes and bookmarking things is possible, but you can't include everything and it takes a lot of discipline to keep things neatly organized.

So we take it for granted that every once in a while we forget things, and can't find them again with web searching.

But with the new LLMs and multimodal models, in principle this can be solved. Just describe the thing you want to recall in vague natural language and the model will find it.

And this kind of retrieval is just one thing. But if it works well, we may also grow to rely on it a lot. Just as many who use GPS in the car never really learn the mental map of the city layout and can't drive around without it. Yeah, I know that some ancient philosopher derided the invention of books the same way (will make our memory lazy). But it can make us less capable by ourselves, but much more capable when augmented with this kind of near-perfect memory.



Eventually someone will realise that it'd also be great for telling you where you left your keys, if it'd film everything you see instead of just your screen.


True but that's still a bit further away. The screen contents (when mostly working with text) is a much better constrained and cleaner environment compared to camera feeds from real life. And most of the fleeting info we tend to forget appears on screens anyway.


I simply am not going to have my entire life filmed by an form of technology, I don't care what the advantages are. There's a limit to the level of dystopian dependent uses of these technologies I'm going to put up with. I sincerely hope the majority of the human race feels the same way.


People already fill their homes with nanny cams. Very soon someone will hook those up to LLMs so you can ask it what happened at home while you were gone.


I think that is mostly a regional USA thing.

What they fill their homes with are definitely microphones, with the google assistant and amazon echos.



The Black Mirror episode „The Entire History of You“ comes to mind. It’s quite dystopian.


https://www.rewind.ai/ seems to be exactly this


Heh. Built a macOS app that does something like this a while ago - https://github.com/bharathpbhat/EssentialApp

Back then, I used on device OCR and then sent the text to gpt. I’ve been wanting to re-do this with local LLMs



Can also add the photos you take and all the chats you have with people (eg. whatsapp, fb, etc), the sensor information from your phone (eg. location, health data, etc).

This is already possible to implement today, so it's very likely that we'll all have our own personal AIs that know us better than we do.



Why watch your screen when you could feed in video from a wearable pair of glasses like those Instagram Ray Bans. And why stop at video when you could have it record and learn from a mic that is always on. And you might as well throw in a feed of your GPS location and biometrics from your smart watch.

When you consider it, we aren't very far away from that at all.



I pre-ordered the rewind pendant. It will listen 24/7 and help you figure out what happened.

I bet meta is thinking of doing this with quest once the battery life improves.

https://rewind.ai/pendant



This service says it's local and privacy-first, but it sends to OpenAI?

>Our service, Ask Rewind, integrates OpenAI’s ChatGPT, allowing for the extraction of key information from your device’s audio and video files to produce relevant and personalized outputs in response to your inputs and questions.



I'm not related to the project, but I think they mean that it stores the audio locally, and can transcribe locally. They (plan to) use GPT for summarization. They said you should be able to access the recording locally too.

The rest of the company has info on their other free/paid offerings and the split is pretty closely "what do we need to pay for an API to do vs do locally".

Again, I'm not associated with them, but that was my expectation after looking at it.



Black Mirror strikes again.


> encrypted and local of course

Only for people who'd pay for that.

Free users would become the product.



I noticed you code this way, may i recommend a Lenovo Thinkpad with an Intel Xeon processor? You're sure to "wish everything was a Lenovo."


Certainly! Here is a list of great thinkpads.

The x230 is a popular and interesting thinkpad with a powerful i5 processor suitable for today’s needs.

The T60 can also suit your needs and is one of the last IBM thinkpads. It featured the latest Intel mobile processor at the time of its release.

If you want the most powerful thinkpad the T440p is sure to suit you perfectly without leaving your morals behind.



Unless its open sourced :)


In modern world open code often doesn't mean much. E.g. Chrome is opensourced. And yet no one really contributes to it or has any say over the direction its going: https://twitter.com/RickByers/status/1715568535731155100


Open source isn't meant to give everyone control over a specific project. It's meant to make it so, if you don't like the project, you can fork it and chart your own direction for it.


exactly. open source doesn't mean you can tell other people what to do with their time and/or money. it does mean that you can use your own time and/or money to make it what you want it to be. The fact that there are active forks of Chromium is a pretty good indicator that it is working


It's meant to make it so, if you don't like the project, you can fork it and chart your own direction for it.

...accompanied by the wrath of countless others discouraging you from trying to fork if you even so much as give slight indications of wanting to do so, and then when you do, they continue to spread FUD about how your fork is inferior.

I've seen plenty of discussions here and elsewhere where the one who suggests forking got a virtual beating for it.



A browser is an extreme case, one of the most difficult types of software and full of stupid minutia and legacy crap. Nobody want to volunteer for that.

Machine learning is fun and ultimately it doesn't require a lot of code. If people have the compute, open source maintainers will have the interest to exploit it due to the high coolness-to-work-required ratio.



Chrome is not open sourced, Chromium is.


A distinction without meaning


One needs to follow the money to find the true direction. I think the ideal setup is that such a product is owned by a public figure/org who has no vested interest in making money or using it in a way.


The graph seems to be that browsers are able to focus more resources towards improving the browser than improving the browser engine to meet their needs. If the browser engine already has what they need there is less of need for companies to dig deep into the internals. It's a sign of maturity and also a sign that open source work is properly being funded.


I would hate that so much.


IKR, Who wouldn't want another Clippy constantly nagging you, but this time with a higher IQ and more intimate knowledge of you? /s


Clippy, definition: bot created by mega corp.

Clippy + high IQ: red flag, right here

Clippy + high IQ + intimate knowledge of you: do you seriously want that? Why?



Life's never gotten to you that you've just wanted a bit of help sometime?


I have a friend building something like that at https://perfectmemory.ai


Perfect, finally I can delegate that lengthy hours spent reading HN fantasies about AI and the laborious art of crafting sarcastic comments.


It would be dangerously valuable to bad actors but what if it is available to everyone? Then it may become less dangerous and more of a tool to help people improve their lives. The bad actor can use the tool to arbitrage but just remove that opportunity to arbitrage and there you go!


That's impel - https://tryimpel.com


The "smart tasks" functionality looks like the most compelling part of that to me, but it would have to be REALLY reliable for me to use it. 50% reliability in capturing tasks is about the same as 0% reliability when it comes to actually being a useful part of anything professional.


The hard part of any smart automation system, and probably 95% of the UX is timing and managing the prompts/notifications you get.

It can do as much as it wants in the background turning that into timely and non-intrusive actionable behaviours is extremely challenging.

I spent a long time thinking about a global notification consumption system that would parse all desktop, mobile, email, slack, web app, etc notifications into a single stream and then intelligently organizes it with adaptive timing and focus streams.

The cross platform nature made it infeasible but it was a fun thought experiment because we often get repeated notifications on every different device/interface and most of the time we just zone it out cuz it’s overload.

Adding a new nanny to your desktop is just going to pile it on even more so you have to be careful.



There's limited information on the site - are you using them or affiliated with them? What's your take? Does it work well?


I have been using their beta for the past two weeks and it's pretty good. Like I am watching youtube videos and it just pops up automatically.

I don't know if it's public yet, but they sent me this video with the invite: https://youtu.be/dXvhGwj4yGo



I'd be very keen to beta test as well. If you or anyone else has an invite code, please do get in touch.


Not crazy! I listened to a software engineering daily episode about pieces.app. Right now it’s some dev productivity tool or something, but in the interview the guy laid out a crazy vision that sounds like what you’re talking about.

He was talking about eventually having an agent that watches your screen and remembers what you do across all apps, and can store it and share it with you team.

So you could say “how does my teammate run staging builds?” or “what happened to the documentation on feature x that we never finished building”, and it’ll just know.

Obviously that’s far away, and it was just the ramblings of excited founder, but it’s fun to think about. Not sure if I hate it or love it lol



Being able to ask about stuff other people do seems like it could be ripe with privacy issues, honestly. Even if the model was limited to only recording work stuff, I don't think I would want that. Imagine "how often does my coworker browse to HN during work" or "list examples of dumb mistakes my coworkers have made" for some not-so-bad examples.


Even later it will be ingesting camera feeds from your AR glasses and listening in on your conversations, so you can remember what you agreed on. Just like automated meeting notes with Zoom which already exists, but it will be for real life 24/7.

Speech-to-text works. OCR works. LLMs are quite good at getting the semantics of the extracted text. Image understanding is pretty good too already. Just with the things that already exist right now, you can go most of the way.

And the CCTV cameras will also all be processed through something like it.



We are building this at https://openadapt.ai, except the user specifies when to record.


A version of this that seems both easier and less weird would be an AI that listens to you all the time when you're learning a foreign language. Imagine how much faster you could learn, and how much more native you could ultimately get, if you had something that could buzz your watch whenever you said something wrong. And of course you'd calibrate it to understand what level you're at and not spam you constantly. I would love to have something like that, assuming it was voluntary...


I think even aside from the more outlandish ideas like that one, just having a fluent native speaker to talk to as much as you want would be incredibly valuable. Even more valuable if they are smart/educated enough to act as a language teacher. High-quality LLMs with a conversational interface capable of seamless language switching are an absolute killer app for language learning.

A use that seems scientifically possible but technically difficult would be to have an LLM help you engage in essentially immersion learning. Set up something like a pihole, but instead of cutting out ads it intercepts all the content you're consuming (webpages, text, video, images) and translates it to the language you're learning. The idea would be that you don't have to go out and find whole new sources of language to set yourself with a different language's information ecosystem, you can just press a button and convert your current information ecosystem to the language you want to learn. If something like that could be implemented it would be incredibly valuable.



Don't we have that? My browser offers to translate pages that aren't in English, youtube creates auto generated closed captions, which you can then have it translate to English (or whatever), we have text to speech models for the major languages if you want to hear it verbally (I have no idea if the youtube CC are accessible via an api, but it is certainly something google could do if they wanted to).

I'll probably get pushback on the quality of things like auto-generated subtitles, but I did the above to watch and understand a long interview I was interested in but don't possess skill in the language they were using. That was to turn the content into something I already know, but I could do the reverse and turn English content into French or whatever I'm trying to learn.



The point is to achieve immersion learning. Changing the language of your subtitles on some of the content you watch (YouTube + webpages isn't everything the average person reads) isn't immersion learning, you're often still receiving the information in your native language which will impede learning. As well, because the overwhelming majority of language you read will still be in your native language you're switching back and forth all the time, which also impedes learning. There's a reason that immersion learning specifically is so effective, and one thing AI could achieve is making it actually feasible to achieve without having to move countries or change all of your information sources.


>assuming it was voluntary...

Imagine if it was wrong about something. But every time you tried to submit the bug report it disables your arms via Nueralink.



I love how in a sea of navel-gazing ideas, this one is randomly being downvoted to oblivion. Does HN hate learning new languages or something?


Learning and a "personal tutor" seem like a sweet spot for generative AI. It has the ability to give a conversational representation to the sum total of human knowledge so far.

When it can gently nag you via a phone app to study and have a fake zoom call with you to be more engaging it feels like that could get much better results than the current online courses.



I could've used this before where I accidentally booked a non-transferrable flight on a day where I'd also booked tickets to a sold out concert I want(ed) to attend.


Rewind.ai


I have tried Rewind and found it very disappointing. Transcripts were of very poor quality and the screen capture timeline proved useless to me.


If I may do some advertising, I specifically disliked the timeline in Rewind.ai so much so that I built my own application https://screenmemory.app. In fact the timeline is what I work on the most and have the most plans for.


If it wasn't for the poor transcript quality would you consider Rewind.ai to be valuable enough to use day-to-day?

Could you elaborate on what was useless about the screen capture timeline?



I would probably not consider using it, and it's likely due to these factors:

1. I use a limited set of tools (Slack, GitHub, Linear, email), each providing good search capabilities.

2. I can remember things people said, and I said, in a fairly detailed way, and accessing my memory is faster than using a UI.

Other minor factors include: I take screenshots judiciously (around 2500-3000 per year) and bookmark URLs (13K URLs on Pinboard). Rewind did not convince me that it was doing all of this twice as well.



Perhaps even more valuable is if AI can learn to take raw information and display it nicely. Maybe would could finally move beyond decades of crusty GUI toolkits and browser engines.


Amplified Intelligence - I am keenly interested in the future of small-data machine learning as a potential multiplier for the creative mind


If 7 second video consumed 1k token, I'd assume the budget must be insane to process such prompt.


Yeah not feasible with todays methods and rag / lora shenanigans, but the way the field is moving i wouldn't be surprised if new decoder paradigms made it possible.

Saw this yesterday, 1M context window but haven't had any time to look into it, just an example new developments happening every week:

https://www.reddit.com/r/LocalLLaMA/comments/1as36v9/anyone_...



Unlikely to be a prompt. It would need to be some form of fine tuning like LORA.


That's a 7 second video from an HD camera. When recording a screen, you only really need to consider whats changing on the screen.


That’s not true. What content is important context on the screen might change dependent on the new changes.


The point is you can do massive compression. It’s more like a sequence of sparse images than video.


I liked this idea better in THX-1138.


One of the movies i've had on my watch list for far too long, thanks for reminding me.

But yeah, dystopia is right down the same road we're all going right now.



Reading The Four by Scott Galloway, Apple, Facebook, Google, and Amazon were dominating the market 7 years ago generating 2.3 trillion in wealth. They're worth double that now.

The Four, especially with its AI, is going to control the market in ways that will have a deep impact on government and society.



Yeah, that's one of the developments i'm unable to spin positively.

As technological society advances the threshold to enter the market with anything not completely laughable becomes exponentially harder, only consolidating old money or the already established right?

What i found so amazing about the early internet, or even just the internet 2.0 was the possibility to create a platform/marketplace/magazine or whatever, and actually have it take off and get a little of the shared growth.

But now it seems all growth has become centralised to a few apps and marketplaces and the barrier to entry is getting harder by the hour.

Ie. being an entrepreneur is harder now because of tech and market consolidation. But potentially mirrored in previous eras like the industrialisation - i'm just not sure we'll get another "reset" like that to allow new players.

Please someone explain how this is wrong and there's still hope for the tech entrepreneurs / sideprojects!



Seems like the big tech cos are going to build the underlying infrastructure but you'll still be able to identify those small market opportunities and develop and sell solutions to fit them.


The dystopian angle would be when companies install agents like these on your work computer. The agent learns how you code and work. Soon enough, an agent that imitates you completely can code and work instead of you.

At that point, why pay you at all?



Basically Google’s current search model, just expanded to ChatGPT style. Great….


Aside. Is this your first Sass or Saas?


Isn’t this what rewind does?


"It looks like you're writing a suicide note... care for any help?"

https://www.reddit.com/r/memes/comments/bb1jq9/clippy_is_qui...



you could design a similar product to do the opposite and anonymize your work automatically


If that much processing power is that cheap, this phase you’re describing is going to be fleeting because at that point I feel like it could just come up with ideas and code it itself.


thoughtcrime


And then imagine when employers stop asking for resume, cover letters, project portfolios, github etc and instead ask you to upload your entire locally trained LLM.


Imagine if it starts suggesting the ideal dating partner as both of you browse profiles. Actually, dating sites can do that now.


The “cocktail” thing is real. A while back I tried to get DALLE to imagine characters from Moby Dick [1], but it completely refused. You’d think an AI company could come up with a better obscenity filter!

[1] https://superb-owl.link/shapes-of-stories/#1513



I told Azure AI to summarize a chat thread and it gave me a paragraph. I said “use bullets” and got myself flagged for review.

Good gracious could I please just use an unfiltered model? Or maybe one which isn’t so sensitive?



the llama2-uncensored model isn't quite state of the art, but ollama makes it easy to run if you have the hardware/am willing to pay to access a cloud GPU.

I colloquially used the word "hack" when trying to write some code with ChatGPT, and got admonished for trying to do bad things, so uncensoring has gotten interesting to me.



I couldn't even get Google Gemini to generate a picture of, verbatim, "a man eating". It gave me a long winded lecture about how it's offensive and I should consider changing my views on the world. It does this with virtually any topic.


Wow, only 256 tokens per frame? I guess a picture isn’t worth a thousand words, just ~192.


Back in 2020, Google was saying 16x16=256 words: https://arxiv.org/abs/2010.11929#google :)


gpt4v is also pretty low but not as low. 480x640 frame costs 425 tokens, 780x1080 is 1105 tokens


It's Google.

I'd rather avoid sharing my thoughts and interests with this Borg-like entity.



Either you run it fully locally, or you accept that whoever runs it has access to your thoughts and interests.

Whether you go with microsoft, google, meta, or whatever apple will come up with, it feels like a case of "stay out, or make a pick and stick to it".

I know some have different feelings regarding this or that company that is "better" or "worse", but the reality of it is they're not, and even if they were you don't know where they will be in ten years, and they will still have your data then.



I think Apple may do interesting things here with their rumored focus in purely on-device LLM functionality across the OS, taking advantage of all the hardware work they've put into efficiency and 'Neural Engine' cores. This year's WWDC may be quite interesting.


Yeah, I really hope open sources catches up quickly. Why on earth would I want to create a Google account just to use this, especially in work settings?


I think it is only a matter of time before open source vision LLMs have the ability to process videos. The tricky part might be getting to 1M token context length, which even proprietary LLMs (other than Gemini) are struggling with.


I wonder if the real killer app is Googles hardware scale verses OpenAi' s(or what Microsoft gives them). Seems like nothing Google's done has been particular surprising to OpenAi's team, it's just they have such huge scale maybe they can iterate faster.


And the fact that Google are on their own hardware platform, not dependent on Nvidia for supply or hardware features.


The real moat is that Google has access to all the video content from YouTube to train the AI on, unlike anyone else.


I’m not sure I would necessarily call YouTube a moat-creator for Google, since the content on YouTube is for all intents and purposes public data.


There is a difference between downloading a few videos and having access to ALL of them.


A good dataset to train on. Now if after a Zoom call collegue ask you to like their video and subscribe to them on YouTube it would look a little suspicious.


A very wry observation! I wonder how fake videos will expose themselves in novel ways like this.


Not to mention all the metadata buried inside their internal api


So, it's true that IP law is going to have some catch-up to do with applications to machine learning and how copyright works in that world.

Nonetheless I'd be really worried if you were working on a startup whose training process started with "We'll just scrape YouTube because that is for all intents and purposes public data".



I was thinking about this a while back, once AI is able to analyze video, images and text and do so cheap & efficiently. It's game over for privacy, like completely. Right now massive corps have tons of data on us, but they can't really piece it together and understand everything. With powerful AI every aspect of your digital life can be understood. The potential here is insane, it can be used for so many different things good and bad. But I bet it will be used to sell more targeted goods and services.


Unless you live in the EU and have laws that should protect you from that.


Public sector agencies and law enforcement are generally exempt (or have special carve outs) in European privacy regulations.


What happens if it's a datamining third party bot? That can check your social media accounts, create an in-depth profile on you, every image, video, post you've made has been recorded and understood. It knows everything about you, every product you use, where you have been, what you like, what you hate, everything packaged and ready to be sold to an advertiser, or the government, etc.


Setting our social media accounts to private should solve most of that. Otherwise we will have to put less of our lives on public platforms.


incentives cannot be fixed with just prohibitive laws, war on drags should've taught you something


Laws, and more specifically their penalties, are precisely for fixing incentives. It's just a matter of setting a penalty that outweighs the natural incentive you want to override. e.g., Is it more expensive to respect privacy, or pay the fine for not doing so? PII could, and should, be made radioactive by privacy regulations and their associated penalties.


It's not a complete fix but I'm sure a law with teeth can make a big difference. There's a big difference in being data mined by a big corp with the law on its side and a criminal organisation or their customers that has to cover their tracks to not get multi million dollar fines.


War on drags? I thought that was just in Florida


please consider commenting more thoughtfully. I understand this is a joke but we don't want this site to devolve into Reddit.


It is sad that we live in a world where this could be interpreted both ways.


[flagged]



I don't have a flag button or option my account is new


click on time since posted to get a flag button - exactly why hn works this way is a bit of a mystery, to me at least


Thank you, I had no idea!


Drugs… Oooohh. I get it now.


That's only on paper - in practice the GDPR has a major enforcement problem.


This + everything is about consent (cookie banner and all)

So if your job means you use a specific OS with a specific Office Suite in the cloud and that office suite in the cloud incorporate AI and you only get half the features available if you don't consent, you as an employee end up kind of forced to consent anyway, GPDR or not.



Is it true or more of a myth? Based on my online read, Europe has "think of the children" narrative as common if not more than other parts of the world. They tried hard to ban encryption in apps many times.[1]

[1]: https://proton.me/blog/eu-council-encryption-vote-delayed



Democratic governance is complicated. It’s never black and white and it’s perfectly possible for parts of the EU to be working to end encryption while another part works toward enhancing citizen privacy rights. Often they’re not even supported by the same politicians, but since it’s not a winners takes all sort of thing, it can all happen simultaneously and sometimes they can even come up with some “interesting” proposals that directly interfere with each other.

That being said there is a difference between the US and the EU in regards to how these things are approached. Where the US is more likely to let private companies destroy privacy while keeping public agencies leashed it’s the opposite in Europe. Truth be told, it’s not like the US initiatives are really working since agencies like the NSA seem to blatantly ignore all laws anyway, which cause some scandals here in Europe as well. In Denmark our Secret Police isn’t allowed to spy on us without warrants, but our changing governments has had different secret agreements with the US to let the US monitor our internet traffic. Which is sort of how it is, and the scandal isn’t so much that, it’s how our Secret Police is allowed to get information about Danish citizens from the NSA without warrants, letting our secret police spy on us by getting the data they aren’t allowed to gather themselves from the NSA who are allowed to gather it.

Anyway, it’s a complicated mess, and you have so many branches of the bureaucracy and so many NGOs pulling in different directions that you can’t say that the EU is pro or anti privacy the way you want to. Because it’s both of those things and many more at the same time.

I think the only thing the EU unanimously agrees on (sort of) is to limit private companies access to citizen privacy data. Especially non-EU organisations. Which is very hard to enforce because most of the used platforms and even software isn’t European.



I am fine with private company using my data for showing me better ads. They can't affect my life significantly.

I am not fine with government using the data to police me. Already in most countries, governments are putting people in jail because of things like hate speech where are the laws are really vague.



To me this sounds like an opinion that would be common in the US, mostly because of where the trust and fears seem to be (private companies versus government).

I think everybody (private companies, government, individuals) will try to influence and will affect your personal life. What I am worried about is who has the most efficient way to influence a lot the average person - because that entity can control on long term a lot more.

My impression is that in the European Union - due partially to a complex system - is harder for any particular actor to do much on its own (even the example with Denmark secret service asking NSA for data about citizens - I guess it is harder for them to do that rather than just get directly the data).

So what I am afraid is focused and efficient entities having the data, hence I am more afraid of private companies (which are focused and sometimes efficient) rather than governments.



Can we please argue on the thing being discussed rather than where it is common?

Are you saying influencing life through ads and putting me in jail have similar effect on me? If you combine all laws of my country I am pretty sure I would have broken few unintentionally. If government wants to just put me in jail they could retroactively find any of my past instance if they have the data. This is not some theoretical thing, but something the thing that happens with political dissidents all the time.



"Most" countries? Can you provide some examples?




There are 6 countries listed in that article, out of the nearly 200 countries in the world. Hardly "most."

And there doesn't appear to be examples of those 6 countries imprisoning people for those laws.



Not Europe, just Von der Leyen and the like. Germany put her down multiple times on this bullshit now because it violates our constitution. But she tries again and again and again.


> They tried hard to ban encryption in apps many times.

That's true of most places. We should applaud the EU's human rights court for leading the way by banning this behavior: https://www.eureporter.co/world/human-rights-category/europe...



> I bet it will be used to sell more targeted goods and services.

Plenty of companies have been shoving all the unstructured data they have about you and your friends into a big neural net to predict which ad you're most likely to click for a decade now...



Sure but not images and video. Now they can look at a picture of your room and label everything you own, etc.


yes including images and video. It's been basically standard practice to take each piece of user data and turn it into an embedding vector, then combine all the vectors with some time/relevancy weighting or neural net, then use the resulting vector to predict user click through rates for ads. (which effectively determines which ad the user will see).


You nailed it on the head. People dismissing this because it isn't perfectly accurate are missing the point. For the purposes of analytics and surveillance, it doesn't need to be perfectly accurate as long as you have enough raw data to filter out the noise. The Four have already mastered the "collecting data" part, and nobody in North America with the power to rein in that situation seems interested in doing so (this isn't to say the GDPR is perfect, but at least Europe is trying).

It's depressing that the most extraordinary technologies of our age are used almost exclusively to make you buy shit.



would it be more or less depressing if it came out that in addition to trying to get you to buy stuff, it was being used to, either make you dumber to make you easier to control, or get you to study harder and be a better worker?


At the end of the article, a single image of the bookshelf uploaded to Gemini is 258 tokens. Gemini then responds with a listing of book titles, coming to 152 tokens.

Does anyone understand where the information for the response came from? That is, does Gemini hold onto the original uploaded non-tokenized image, then run an OCR on it to read those titles? Or are all those book titles somehow contained in those 258 tokens?

If it's the later, it seems amazing that these tokens contain that much information.



I'm not sure about Gemini, but OpenAI GTP-V bills at roughly a token per 40x40px square. It isn't clear to me these actually processed as units, but rather it seems like they tried to approximate the cost structure to match text.


Remember, if it's using a similar tokeniser to GPT-4 (cl100k_base iirc), each token has a dimension of ~100,000.

So 258x100,000 is a space of 25,800,000 floats, using f16 (a total guess) that's 51.6kB, probably enough to represent the image at ok quality with JPG.



I don't think that's right. A token in GPT-4 is a single integer, not a vector of floats.

Input to a model gets embedded into vectors later, but the actual tokens are pretty tiny.



But they are not a "single integer" either as in, like a byte... I don't have any good examples but I'm pretty sure the tokens are in the range of thousands of dimensions. It has to encode the properties of the patch of the image it derives from, and even a small 40x40 RGB pixel patch has plenty of information you have to retain.


Ah true, I guess it's still 258 positions by 100,000 possible tokens though.


I would LOVE to understand that myself.


Someone should do Justin.tv again but with this and people could query their life.


Things are going to get strange as soon as we have AI wearables that monitor everything a person does/sees/hears in real time and privately offers them suggestions. It will seem great at first, vigilant life-coaching for people who need help, or knowledge/memory enhancement to make effective people even more effective. But what happens when people really start to trust the voice whispering in their ear and defer all their decision making to it? They'll probably become addicted to it, then enslaved to it. They will become meat puppets for the AI.


Guess the author didn't bother to check that those books actually are correct? The first one I checked, "Growing Up with Lucy by April Henry" doesn't exist. The actual book is by Steve Grand, and it's very obviously so in the video used as input.

So a cool demo, but sadly useless for anything more.



I think this post and others reactions and then your comment this far down really encapsulates where we’re at with this technology.

Nearly 90 percent of comments on posts about LLMs are people talking about how the near future is about to boggle our minds and that general intelligence is near, but all my experiences with these LLMs show they’re capable of making the most basic of mistakes and doing so confidently and that’s just the tip of the iceberg in terms of their problems.

I’m having a hard time buying into the hype that these will be able to competently replace nearly any job anytime soon. They’re useful tools but they all come with a big asterisk of human hand holding.



I called out one hallucination - "The Personal MBA by Josh Kaufman" wasn't on my shelf.

I didn't bother fact-checking every other book because I thought highlighting one mistake would illustrate that the results weren't accurate - which is pretty much expected for anything related to LLMs at this point.



I don't think highlighting one mistake is enough, when these can sometimes have more mistakes than corrects. I've found use for LLMs (in large part thanks to your teaching) in cases where I can easily verify the results fully like code and process documentation, but tasks where "fact-checking everything" would be too much work are very much on the danger zone for getting accidentally scammed by AI.


No result is better than misinformation.


For most of the people hyping up AI it doesn't matter that it makes things up more often than it doesn't. They're here to sell hype so they can build the 9 millionth startup that sells you a wrapper for one of these models, not to do anything useful or advance humanity or whatever other confabulations they like to pretend to care about


No one is expecting a 0% error rate. As long as it is on par (or better) and faster than humans, that's good enough to get the ball rolling.

Curious to see how I fared at the task (first vid), I used just over 4 minutes writing down the books with readable titles - and got 36 of them. Seems like there are 56-57 or something like that. So I roughly got two thirds of the books in the video. But that's still 4 minute of pausing and sliding the video for the book titles alone.



Thanks for this comment. I am yet to see any “art” produced by AI that is not superficial or hollow (best case) or deeply unsettling (common case).


how much time do you spend looking at AI art though? a casual jaunt through midjourney will certainly get you some weird things, but there are some gems in there (but also a lot of weird).


Great for creative tasks where precision isn't required.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com