Promote and support advanced computing to further Tier-One research and education at the University of Houston
Agenda CACDS Resources Operational changes Pricing structure Timeline Questions, discussion, feedback
Opuntia cluster Installation: 2014 83 nodes/ 1,740 cores 64 GB memory 4 GPU nodes / 4 GPUs (K40) 600 TB NFS storage 56 Gb Ethernet network
Sabine cluster Phase 1: 2017/18 Phase 2: 2018 116 nodes/ 3,248 cores 128/256 GB memory 8 GPU nodes / 16 GPUs (P100) 120 TB NFS storage 250 TB Lustre storage 100Gb Omnipath network 53 nodes/ 2,120 cores 192 GB memory 4 GPU nodes / 32 GPUs (V100) 600 TB NFS storage 100 Gb Omnipath network
Sabine usage policies Requires a compute allocation of the PI Home directories are being backed up Project directories are not being backed up Hardware is configured such that it can withstand most common failures without data loss Lustre vs. NFS for project directories Selection based on characteristics of your jobs
CACDS user community No. of research groups No. of users Opuntia ~140 465 Sabine 45 104 User community has grown significantly over the last years High Performance Computing (C, C++, Fortran, MPI, OpenMP) Data science (Python, R, Tensorflow, )
CACDS challenges Provide compute and active storage resources to the entire university research community Balance requirements of high-demand users with the needs of the overall community Provide state-of-the-art hardware to the university user community on a continuous basis
What are other universities doing? Survey of numerous sites shows wide range of solutions Common trends Support for centralized purchasing mechanism for PIs Capping free storage capacity Managing compute resources through an allocation process Limiting the max. amount of compute time per group Significant differences how to handle requests beyond this limit
Acquiring commercial compute and storage resources? Amazon pricing per SU / GPU SU m4.large $0.08 m4.xlarge $0.065 m4.x10large $0.05325 p3.2xlarge (V100) $2.654 p2.xlarge (K80) $0.58 Price per GB/month Price per TB/month Dropbox $0.008 $8.19 Google Drive $0.01 $10.24 icloud $0.02 $20.48 OneDrive $0.007 $7.16
Operational Model Free compute and storage allocations Limits set on a per compute resources and research group basis Purchasing additional compute SUs and storage resources Purchasing additional priority compute SUs Quicker job start times guaranteed Purchasing condo nodes Regular compute nodes, GPU nodes
Some boring definitions upfront 1 Service Unit (SU): using 1 CPU core for 1 hour 1 GPU SU: using 1 GPU for 1 hour Granularity of SUs controlled by SLURM accounting Costs of a parallel job using multiple cores and multiple GPUs Costs = Costs for no. of CPU SUs + Costs for no. of GPU SUs Since SLURM only tracks resources used + time, the real formula is Costs = time * ( n CPUs * weight CPUs + n GPUs * weight GPUs )
Free Resources granted by CACDS per research group and resource Short proposal required to utilize CACDS resources Brief description of research Estimate and justification of requested resources (SUs + Storage) Resources granted up to the maximum defined for the platform Compute allocation valid for one year Max. Free Storage Quota Opuntia 10 TB 2 Mio SU Sabine (Phase 1+2) 10 TB 3 Mio SU Max. Free Compute Quota
Opuntia Storage utilization
Purchasing additional compute SUs and storage If a research group is in need of additional compute SUs or storage capacity Sabine regular compute SU $0.0045 Opuntia regular compute SU $0.0033 Cost per additional TB $6.00/month
Purchasing priority compute SUs Have higher priority than standard jobs Jobs guaranteed to start within 4 hours* *up to the maximum no of resources agreed on Sabine priority compute SU $0.009 Opuntia priority compute SU $0.0066
Condo model PI purchases compute nodes/hardware through CACDS Hardware is hosted and operated as part of CACDS resources PI has 100% of compute SUs for the first 4 years CACDS users can utilize unused cycles Start of a PI job will not be delayed by more than 4 hours due to non-pi jobs PI has 80% of compute SUs for subsequent years Cluster end of life not expected to exceed 8 years Condo compute node Condo GPU node Sabine Regular compute nodes based on actual price of Gen10 nodes * Infrastructure Fee calculated in Aug. 2018: $1,142 $7,000 + Infrastructure Fee* Actual cost+ Infrastructure Fee*
Weight factors Sabine (Phase 2 nodes) Sabine (Phase 1 nodes) Opuntia 1 Compute SU 1.0 0.85 0.75 1.0 1 GPU SU 31 (V100) 24 (P100) 6 (K40) 4.5 Weight factors for CPUs based on SPEC CPU 2006 benchmark Weight factors for GPUs based on cost-difference between regular compute nodes and GPU node (P100), and theoretical peak performance of the GPU (K40, V100)
Effective pricing of individual systems (non-priority) System Costs Sabine Phase 2 compute SU* $0.0045 Sabine Phase 1 compute SU $0.0038 Opuntia compute SU* $0.0033 V100 GPU SU $0.139 P100 GPU SU $0.108 K40 GPU SU $0.027 * Relevant price for users
Effective pricing of individual systems (non-priority) System Costs Amazon AWS Sabine Phase 2 compute SU* $0.0045 Sabine Phase 1 compute SU $0.0038 $0.05325 Opuntia compute SU* $0.0033 V100 GPU SU $0.139 $2.654 P100 GPU SU $0.108 K40 GPU SU $0.027
Financial value of free resources granted by CACDS per group Free Storage Quota Free Compute Quota Opuntia 10 TB 2 Mio SU Sabine (Phase 1+2) 10 TB 3 Mio SU Financial costs 20 12 $6=$1,440 2 Mio $0.0033 + 3 Mio $0.0045 = $20,100
Next steps CACDS will have to go through certification step to become a university service center Anticipated start-date: 03/01/2019
September 17: Sabine downtime for integrating new hardware and OS update October 15 26: Opuntia downtime for reinstallation Adjust Opuntia usage policies to the Sabine model (i.e. allocations required for utilization) Every research group will initially get a default allocation valid until 01/31/2019 Account clean-up: will require your help to remove accounts of people not at UH anymore Survey on storage requirements from DoR, UIT, and UH Libraries coming in the next days
Discussion and Questions