DNA cassette tapes could solve global data storage problems

Domino@quokk.au · 2 days ago

DNA cassette tapes could solve global data storage problems

Salamander@mander.xyz · 1 day ago

Fair point - I completely forgot to take the 3D geometry into account. I guess this could be solved by either making both sp³ (sub the Si-O with Si-Cl) or both sp² (sub the H-O-Si with H-N=Si)? But then writing data becomes more complicated than just adding or removing hydrogens that, as you said, isn’t as simple as it looks like.

I think that the solution that life came up with - making a flexible double helix-forming backbone from which base pairs hang is actually a pretty good way of going about it. Similar as with proteins - a standard flexible backbone with different groups hanging off the chain and influencing how it folds. In your proposition you have the silicon backbone and a single atom as the ‘side chain’, so there is no separation between the backbone and the pairing elements to add this flexibility.

There are also some other details to consider. For example, the amount of data you can store in a given chain length changes depending on how many different types of chemistry you have. In your example, you are using only one type of ‘base’ because the only options are ‘hydrogen bond donor’ or ‘hydrogen bond acceptor’. If you have a chain length of 3, you get only 3 bits, which can store one of 2^3 = 8 values from 0 to 7 (000 to 111). With DNA, you have 4 different base pairs, so a chain length of three can encode 4^3 = 64 values.

That means that, to get a good information density, you would also want to increase the number of possibilities. The challenge here is that you need to tune the set of possibilities so that the thermodynamics are balanced. You don’t want some pairs to stick very strongly while others stick only loosely, and you also don’t want certain bases to be able to pair with each other. See: https://en.wikipedia.org/wiki/Nucleic_acid_thermodynamics

You can perhaps dispense with some of the thermodynamic tuning if you don’t need to be able to easily replicate the data through a process similar to DNA replication, as you don’t actually need to ‘pair’ at all - you have a single string of data. But in that case you lose a very powerful method as you are forced to re-synthesize every data chain from scratch - I think that with such a system you lose too many benefits.

If you go through the steps of creating a system of molecular data storage from scratch, I think it is easy to converge towards something similar to DNA. A lot of ‘origin of life’ research is actually about this - thinking about these systems and how to engineer them from scratch, and… DNA is pretty good at this. When you consider that early chemical evolution was an optimization algorithm to solve this problem, it makes sense that DNA is a good choice.

I do think it is good and fun to explore this. We do have at least some advantages over nature - for example, we have managed to purify many compounds that were not abundant in early chemical soups. So, perhaps we can find something.

Like the dNaM / dTPT3 pair, right? That’s perhaps more viable, at least to increase information density.

Yeah, like those. In this recent paper, for example, researchers sequenced a chain of four anthrophogenic base pairs that they refer to as ‘ALIEN bases’: https://www.nature.com/articles/s41467-025-61991-9