Make sure you have enough disk space available (2 TB for downloading, up to 5 TB while decompressing all files) and
a stable high-bandwidth Internet connection. The use of a download manager is recommended on desktop systems.
In addition, the following options are available:
An S3 client (e.g. Cyberduck, minio-client, rclone) can be used with the URL
s3://objectstore.hpccloud.mpcdf.mpg.de/deepclust/
To download via https on the command line, the following `wget` command can be used:
The total compressed size of this download resource is 1.906.886 MB.
File Description
joined_with_index_RowGroupFinal.parquet: Parquet file conatining all sequences clustered with DIAMOND DeepClust mentioned in the publication.
clust_index_RowGroup.parquet: Index file indicating where in joined_with_index_RowGroupFinal.parquet a cluster can be found.
persistent: DuckDB Database created from clust_index_RowGroup.parquet.
SeqIdMapClustId.parquet: Parquet file containing all sequence IDs from the DeepClust Database which then can be mapped onto the cluster to which they have been assigned.
(These files are contained within the archive DeepClustParquet.tar.zst.)
For more information see: https://github.com/drostlab/deepclust_dataretrieval
clust_bigg_2.mmseqs: MMseqs2 formatted Database containing all clusters from the DeepClust Database with more than two members to use in the context of Protein Structure Prediction and ColabFold.
clust_bigg2.fa: FASTA File containg all centroids representing clusters with more than two members.
(These files are contained within the archive clust_bigg2_mmseqs_db.tar.zst.)
For more information see: https://github.com/drostlab/deepclust_colabfold