TL;DR
I believe we should use a 5-digit annual budget to create/serve trust-less, cryptographic timestamps of all public content, in order to significantly counteract the growing threat that AI-generated fake content poses to truth-seeking and trust. We should also encourage and help all organizations to do likewise with private content.
THE PROBLEM & SOLUTION
As the rate and quality of AI-generated content keeps increasing, it seems inevitable that it will become easier to create fake content and harder to verify/refute it. Remember the very recent past when faking a photo was so hard that simply providing a photo was considered proof? If we do nothing about it, these AI advances might have a devastating impact on people's opportunities to trust both each other and historic material, and might end up having negative value for humanity on net.
I believe that trust-less time-stamping is an effective, urgent, tractable and cheap method to partly, but significantly so, counteract this lamentable development. Here's why:
EFFECTIVE
It is likely that fake creation technology will outpace fake detection technology. If so, we will nominally end up in an indefinite state of having to doubt pretty much all content. However, with trust-less time-stamping, the contest instead becomes between the fake creation technology available at the time of timestamping, and the fake detection technology available at the time of truth-seeking.
Time-stamping everything today will protect all past and current content against suspicion of interference by all future fake creation technology. As both fake creation and fake detection technology progress, no matter at what relative pace, the value of timestamps will grow over time. Perhaps in a not so distant future, it will become an indispensable historical record.
URGENT
Need I say much about the pace of progress for AI technology, or the extent of existing content? The value of timestamping everything today rather than in one month, is some function of the value of the truth of all historical records and other content, and technological development during that time. I suspect there's a multiplication somewhere in that function.
TRACTABLE
We already have the cryptographic technology and infrastructure to make trust-less timestamps. We also have large public archives of digital and/or digitized content, including but not limited to the web. Time-stamping all of it might not be trivial, but it's not particularly hard. It can even be done without convincing very many people that it needs to be done. For non-public content, adding timestamping as a feature in backup software should be similarly tractable - here the main struggle will probably be to convince users of the value of timestamping.
Implementation: Each piece of content is hashed, the hashes put into a merkle tree, and the root of that tree published on several popular, secure, trust-less public ledgers. Proof of timestamp is produced as a list of hashes along the merkle branch from the content up to the root, together with transaction IDs. This technology, including implementations, services and public ledgers already exists. For private content, you might want to be able to prove a timestamp for one piece of content without divulging the existence of another piece of content. To do so, one would add one bottom level in the merkle tree where each content hash is hashed with a pseudo-random value rather than another content hash. This pseudo-random value can be produced from the content hash itself and a salt that is constant within an organization.
CHEAP
Timestamping n
pieces of content comprising a total of b
bytes will incur a one-time cost for processing on the order of O(b+n)
, a continuous cost for storage on the order of O(n)
, and a one-time cost for transactions on public, immutable ledgers on the order of O(1)
. Perhaps most significant is the storage. Using a hash function with h
bytes of output, it's actually possible to store it all in nh
bytes. When in active use, you want to be able to produce proofs without excessive processing. In this usage mode, it would be beneficial to have 2nh
bytes available.
Taking archive.org as an example, reasonable values are h=32
and n=7.8*10^11
[1], requiring 2nh ≈ 5.0*10^13 ≈ 50 TB
of storage (not TiB, since we're talking pricing). At current HDD prices in the 7.50-14.50 USD/TB range [2,3], that is a few hundred bucks. Add storage redundancy, labor, continuously adding timestamps and serving proofs etc., and we're still talking about a 5-digit amount yearly.
SUMMARY
I have here presented an idea that I believe has a pretty good chance of significantly counteracting some negative consequences of emerging AI technology, using only a 5-digit yearly budget. Beyond effective, tractable and cheap, I have also argued for why I believe it is urgent, in the sense that vast amounts of future public good is lost for every month this is not done. I am not in a position to do this myself, and failed to raise awareness at HN [4], which arguable wasn't the right forum. It seems to be an opportunity too good to pass on. Here's hoping to find someone who cares enough to get it done, or explain why I'm wrong.
Footnotes
- [1] https://archive.org/ tagline "Search the history of over 778 billion web pages on the Internet."
- [2] https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/
- [3] https://diskprices.com/
- [4] https://news.ycombinator.com/item?id=32817158
Let me clarify the cryptography involved:
There is cryptographic signing, that lets Alice sign a statement X so that Bob is able to cryptographically verify that Alice claims X. X could for example be "Content Y was created in 2023". This signature is evidence for X only to the extent that Bob trusts Alice. This is NOT what I suggest we use, at least not primarily.
There is cryptographic time-stamping, that lets Alice timestamp content X at time T so that Bob is able to cryptographically verify that content X existed before time T. Bob does not need to trust Alice, or anyone else at all, for this to work. This is what I suggest we use.
Back-dating content is therefore cryptographically impossible when using cryptographic time-stamping. That is kind of the point; otherwise I wouldn't be convinced that the value of the timestamps would grow over time. To the extent we use cryptographic time-stamping, the argument here is 'it will be entirely impossible in the future'.
However, cryptographic time-stamping and cryptographic signing can be combined in interesting ways:
We could sign first and then timestamp, achieving a cryptographic proof that in or before 2023, archive.org claimed that content X was created in 1987. This might be valuable if the organization or its cryptographic key at a later date were to be compromised, e.g. by corruption, hacking, or government overreach. Timestamps created after an organization is compromised can still be trusted: You can always know the content was created in or before 2023, even if you have reason to doubt a claim made at that time.
We could timestamp, then sign, then timestamp. This allows anyone to cryptographically verify that e.g. sometime between 2023-01-20 and 2023-01-30, Alice claimed that content X was created in 1987. This could be valuable if we later learn we have reason to distrust the organization before a certain date. Again, we will always know X was created before 2023-01-30, no matter anyone's trustworthiness.
As for the issue with 2023 timestamps being misleading for 1995 content: This issue is probably very real, but it's less urgent. Making the timestamps is urgent. On top of that underlying data and cryptographic proofs, different UIs can be built and improved over time.