[Bioc-devel] Bioc integration with IPFS

Paul Jason pjason1985 at gmail.com
Fri Jul 21 16:06:21 CEST 2017


Hi all,


I am thinking of a opensource platform which will be built on top of IPFS (
https://medium.com/@ConsenSys/an-introduction-to-ipfs-9bba4860abd0)



I recently found out that NCBI/NIH is struggling with the large amounts of
data that is being generated from non-human genomes and is now sending the
data off to Europe (
https://ncbiinsights.ncbi.nlm.nih.gov/2017/05/09/phasing-out-support-for-non-human-genome-organism-data-in-dbsnp-and-dbvar/).
I am guessing soon even EVA might start suffering from the data overload,
as the volume of genomic data grows 4-5 fold over the next decade (Plos
<http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195>).
GB/$ or bandwidth/$ is unlikely to grow at this pace and this problem is
likely to get worse. As I see it, the prohibitive cost will drive these
agencies to go for some kind of cost sharing using distributed storage and
networking.



In short IPFS is a p2p distributed file system with built-in incentivizing
systems to encourage users to locally cache and distribute data by
combining pre-existing and new systems (git, dht, kademlia) that came out
of ARPA/DARPA/IETF/Bell labs (https://github.com/ipfs/ipfs#more-about-ipfs).



I am thinking of this more as a middle-layer between the platform
(potentially bioconductor) and IPFS. Potentially all the datasets will be
assigned a IPFS hash and the hash will be maintained on a website along
with any metadata regarding the information contained in the dataset.



So instead of users having to query specific servers with the server
specific API, they will query the dataset and IPFS through a uniform API
(ipfs get /ipfs/<hash>).



Potential advantages:

1. Standardizes data access/pipelines across multiple organizations instead
of having to use multiple Server specific API’s to get data ( using a
simple interface: ipfs get <hash>)


2. Reduces cost of data storage/distribution by distributing data
storage/access cost over the entire network



3. Proven to work with large datasets



4. Backward compatible with existing data transport networks



5. Inbuilt incentivization of users to store and distribute data with
bitswap, Filecoin and Ethereum



IPFS is developed as a networked file system, so it should be integrated
similar to how other software platforms are integrated with files systems.
So I was thinking that its best to have it as a default package within the
platform, perhaps as a middlelayer. There are already amazing platforms, so
I am not thinking of building an entire platform, but integrating IPFS with
a great platform so that it benefits potential suppliers and consumers of
data.


Based on this, and because the problem is faced acutely in the genomics
community and BioC is one of the most widely used softwares, I was thinking
its best to integrate with Bioconductor. However, now that I have read a
little more about BioC, I see that BioC is a set of packages.


Do you think its better to integrate with Rstudio/R as that is the platform
on which BioC is developed?


Are there any existing projects that do this already? Or similar projects
that I could look into to, to get ideas from?



Do you see any holes in my logic?


I can go into more details on the use case scenario, that I am currently
thinking of.


Thanks


Paul

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list