Welcome, Guest Login

Enterprise Support

Anaconda Enterprise Scale: distributing Conda packages to data or compute clusters

Last Updated: Dec 04, 2017 01:56PM CST
Purpose: Review the benefits and challenges of four different ways of distributing Conda packages to data or compute clusters.
 
  1. Cloudera Manager Parcels

    Approach: Each parcel contains a complete conda environment with packages; generate parcels using Anaconda Repository, can use environment.yml as spec; upload parcel to Cloudera Manager and Manager handles distributing the parcel across the cluster.

    Appropriate for: AE4 & AE5
    NOTE: AE5 deployments currently require an AE4 Repo to support Parcels.

    Benefits: Leverage existing Cloudera management infrastructure and familiar tools.

    Challenges: Requires administrator privileges to deploy a parcel to a cluster i.e. data scientist users cannot deploy packages themselves using this approach.

  2. Ambari Mpacks for Hortonworks

    Approach: Each Management Pack contains a complete conda environment with packages; generate Mpacks using Anaconda Repository, can use environment.yml as spec; upload parcel to Ambari and Ambari handles distributing the parcel across cluster.

    Appropriate for: AE4 & AE5
    NOTE: AE5 deployments requires an AE4 Repo to support Mpacks.

    Benefits: Leverages existing Ambari infrastructure and familiar tools.

    Challenges: Requires administrator privileges to deploy Mpack to cluster i.e. data scientist users typically will not be able to deploy packages themselves using this approach. Ambari manager must be restarted after each deployment.

  3. Knit-Conda

    Approach: Store conda packages in HDFS, and run pre-job processes to distribute packages as step 0 of a spark job.

    Appropriate for: AE4 only. Similar solution works for AE5.

    Benefits: Data scientists/non-administrator users can deploy conda packages to the cluster. conda packages are deployed only to those nodes that the scheduler has identified for a particular job.

    Challenges: Increased overhead the first time a job is run with a new set of dependencies.

  4. Anaconda Adam/Remote Conda (formerly Anaconda Cluster)

    Approach: Install Adam agents on each data node, and Adam master on an edge node. Deploy conda packages to cluster by instructing data nodes to conda install packages in parallel.

    Appropriate for: AE4 & AE5

    Benefits: It is possible for data scientist or other non-administrator users to deploy conda packages to the cluster. It is possible to limit the number of nodes packages are deployed to.

    Challenges: Many steps and components are involved in installation of Adam. Administrative effort scales with the number of data nodes. Potentially unsupported in the near future.
support@continuum.io
https://cdn.desk.com/
false
desk
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete