P

philh

2 karmaJoined Nov 2022

Comments
3

So would you say that although you have less faith in Ben than before, Alice and Chloe should have more faith in him? That seems wrong to me; I feel like "faith" in context should cash out as something less interpersonal than that? Like it should be a prediction about how Ben will act in future situations. Then "Alice should have more faith in Ben than me" sounds like a prediction that in future Ben will favor team Alice over team Chris; but that's not a prediction I'd make and I don't think it's a prediction you'd make.

(It does seem reasonable to predict something like "in future, Ben will favor team person-who-was-hurt over team person-on-sidelines-who...". But I don't think that's where you're going with this either?)

I think having to rely on an archive makes this a lot less valuable. If I find something in 2033 and want to prove it existed in 2023, I think that's going to be much harder if I have to rely on the thing itself being archived in 2023, in an archive that still exists in 2033; compared to just relying on the thing being timestamped in 2023.

I also think if you're relying on the Internet Archive, the argument that this is urgent becomes weaker. (And honestly I didn't find it compelling to begin with, though not for legible reasons I could point at.) Consider three possibilities, for something that will be created in May 2023:

  1. We can prove it was created no later than May 2023.
  2. We can prove it was created no later than June 2023.
  3. We can prove it was created no later than June 2023; and that in June 2023, the Internet Archive claimed it was created no later than May 2023.

A one-month delay brings you from (1) to (2) if the IA isn't involved. But if they are, it brings you from (1) to (3). As long as you set it up before IA goes rogue, the cost of delay is lower.

So I think there's a couple levels this could be at.

There's "it's easy for someone to publish a thing and prove it was published before $time". Honestly that's pretty easy already, depending how much you have a site you can publish to and trust not to start backdating things in future (e.g. Twitter, Reddit, LW). Making it marginally lower friction/marginally more trustless (blockchain) would be marginally good, and I think cheap and easy.

(e: actually LW wouldn't be good for that because I don't think you can see last-edited timestamps there.)

But by itself it seems not that helpful because most people don't do it. So if someone in ten years shows me a video and says it's from today, it's not weird that they can't prove it.

If we could get it to a point where lots of people start timestamping, that would be an improvement. Then it might be weird if someone in the future can't prove something was from today. And the thing Plex said in comments about being able to train on non-AI generated things becomes more feasible. But this is more a social than technical problem.

But I think what you're talking about here is doing this for all public content, whether the author knows or not. And that seems neat, but... the big problem I see here is that a lot of things get edited after publishing. So if I edit a comment on Reddit, either we somehow pick up on that and it gets re-timestamped, or we lose the ability to verify edited comments. And if imgur decides to recompress old files (AI comes up with a new compression mechanism that gives us 1/4 the size of jpg with no visible loss of quality), everything on imgur can no longer be verified, at least not to before the recompression.

 So there's an empirical question of how often  happens, and maybe the answer is "not much". But it seems like something that even if it's rare, the few cases where it does happen could potentially be enough to lose most of the value? Like, even if imgur doesn't recompress their files, maybe some other file host has done, and you can just tell me it was hosted there.

There's a related question of how you distinguish content from metadata: if you timestamp my blog for me, you want to pick up the contents of the individual posts but not my blog theme, which I might change even if I don't edit the posts. Certainly not any ads that will change on every page load. I can think of two near-solutions for my blog specifically:

  • I have an RSS feed. But I write in markdown which gets rendered to HTML for the feed. If the markdown renderer changes, bad luck. I suppose stripping out all the HTML tags and just keeping the text might be fine?
  • To some extent this is a problem already solved by e.g. firefox reader mode, which tries to automatically extract and normalize the content from a page. But I don't by default expect a good content extractor today to be a good content extractor in ten years. (E.g. people find ways to make their ads look like content to reader mode, so reader mode updates to avoid those.) So you're hoping that a different content-extraction tool, applied to the same content in a different wrapper, extracts the exact same result.

This problem goes away if you're also hosting copies of everything, but that's no longer cheap. At that point I think you're back to "addition to the internet archive" discussed in other comments; you're only really defending against the internet archive going rogue (though this still seems valuale), and there's a lot that they don't capture.

Still. I'd be interested to see someone do this and then in a year go back and check how many hashes can be recreated. And I'd also be interested in the "make it marginally easier for people to do this themselves" thing;  perhaps combine a best-effort scan of the public internet, with a way for people to add their own content (which may be private), plus some kind of standard for people to point the robots at something they expect to be stable. (I could implement "RSS feed but without rendering the markdown" for my blog.) Could implement optional email alerts for "hey all your content changed when we rescanned it, did you goof?"