Trying to run a big python job with AWS (EC2), is there a better way?

  • Thread starter Thread starter ergospherical
  • Start date Start date
  • Tags Tags
    Python
AI Thread Summary
The discussion focuses on running a computationally intensive Python job on AWS EC2, with the user seeking alternatives to improve performance beyond their laptop's capabilities. They have parallelized their code using Joblib and are simulating millions of particles over extensive timescales, which is inherently resource-intensive. Users suggest that free EC2 instances are inadequate for such tasks and recommend larger, more powerful instances, particularly using spot pricing for cost efficiency. Profiling the code is emphasized to identify potential optimizations, and there is a discussion about utilizing GPUs for better performance. Overall, the conversation highlights the challenges of cloud computing setup and the need for strategic resource management in high-performance computing tasks.
ergospherical
Science Advisor
Homework Helper
Education Advisor
Insights Author
Messages
1,097
Reaction score
1,384
I've got some code in a public repo with a module containing my model and a python script which runs the model and returns a .parquet file with data. I've parallelized all of the important processes with Joblib, and the underlying code is written in pure C so I can't make it much faster (at least, to the best of my knowledge. I'm sure someone else can).

That said, it's taking about 3 days to run on my crappy laptop. Did some digging and found that Amazon AWS might be sensible:

https://docs.aws.amazon.com/systems...egration-github-python.html?tag=pfamazon01-20

There are no EC2 free tier offers, so it's a good idea to make sure it's the right option before proceeding. Are there better alternatives (I just want to run fast on a big compute-optimized cluster, and I don't need much memory or storage optimization).

Has anyone used EC2 before? Is it relatively painless to set up?
 
Technology news on Phys.org
If your code is not running fast enough, profile it, see where its spending its time and whether that can be sped up. Don't just assume "C is fast". Bad algorithms can be implemented in C too.
 
  • Like
Likes pbuk and ergospherical
The integrators in the model are pretty well optimized, the job is just simulating millions of particles for timescales of Gigayears so I reckon it's always going to be a computationally difficult job. I have just gotten a free EC2 instance up and running, but currently I've chosen a free version which is less powerful than my laptop.
 
ergospherical said:
the job is just simulating millions of particles for timescales of Gigayears
That's a lot of particles and years. Do you have any intermediate indicators of how far the calculation has progressed in 3 days? There are a lot of problems what can not be solved even on supercomputers.

My experience on very long-running programs is to periodically store intermediate results so that the program can be continued from an intermediate point. There are power-outages, unplanned system resets, etc.

I have no experience with massively parallel algorithms, but you might want to look into the possibility of utilizing a GPU for parallelization.
 
Ah, if I had a dime for everyone who said "My code doesn't need to be profiled. It's already optimal" I'd be a wealthy man.
 
  • Like
Likes pbuk and FactChecker
ergospherical said:
Has anyone used EC2 before? Is it relatively painless to set up?
Yes. I'd say that of all the cloud offerings AWS is in general the most painful to set up, but this is just an inconvenience.

ergospherical said:
I have just gotten a free EC2 instance up and running, but currently I've chosen a free version which is less powerful than my laptop.
The free instances are a waste of time for this - they are intended for developing and staging websites.

To get anything better than an average desktop (top gaming machine performance) you need at least say c8g.8xlarge (32vCPU 64GB RAM) but bigger is always better so c8g.48xlarge will give you 192vCPU/384GB. At "on demand" prices they will cost you $1.27 and $7.63 an hour respectively so don't do that: "spot" prices (where your job is fitted around other workloads) are generally around 15% of the cost.

Pricing on Azure tends to be cheaper for committed resource (which you don't need) but more expensive for spot, but this is a generalization. There are other providers with various levels of price/performance/support e.g. Google at the top, Vultr in the middle, at the bottom I won't mention any names but I have not used them for HPC.

ergospherical said:
The integrators in the model are pretty well optimized, the job is just simulating millions of particles for timescales of Gigayears so I reckon it's always going to be a computationally difficult job.
'just' doing galaxy sims? I'm sure I don't need to point out that this is a challenging task and many sub-optimisations need to be considered (e.g. Barnes Hut, multipole...). Are you using GADGET-2?

Other considerations come to mind: are you sure your model retains meaningful accuracy over this sort of timescale? How are you dealing with tricky things: collisions, stellar evolution, GR in general?

Have you looked at compiling your code for GPU? I would have thought you would want to do that, whether for running on your own hardware or in the cloud.

Finally do you not have access to an academic institution's computing facilities, or have they already pointed you at AWS?
 
  • Like
  • Informative
Likes Vanadium 50, ergospherical, PeterDonis and 1 other person
There's several tuning parameters in the code that control the accuracy of the simulations; I've previously turned these way down, and the results you end up with are qualitatively correct -- even for the lower level of accuracy. Now it would be good to do a higher accuracy run, except the computation time scales exponentially with the tuning parameters. The bottleneck in the compute speed is the numerical integration of the dynamics, which is a fully parallelized tree code.

How do you mean compile for GPU -- compile into an MPS runtime and deploy that?

I spun up a v96 (c5.24xlarge) instance on AWS, it is faster (compute time ~ 1 day) but also quite expensive. I think this at least demonstrates that the parallelism is working and the code is effectively distributing the jobs across the cores.

The spot pricing is a bit confusing to me -- when I tried to set it up, AWS is gives me a whole load of separate instances. With a single c5.24xlarge instance, I can just connect via SSH, clone the repo & install requirements.txt then run the script. To use the spot pricing service, I don't know if I have to do some more work to manually distribute the job across all of the provided instances
 
ergospherical said:
How do you mean compile for GPU -- compile into an MPS runtime and deploy that?
If you want to maximize performance gains you will need to do more than that. There are some good references in https://developer.nvidia.com/gpugem...lation/chapter-31-fast-n-body-simulation-cuda

I think there are also some quirks to running the AWS GPU instances, see the SDK documentation for these.

It might also be interesting to compare the performance of a multipole method with your tree method - see for instance https://academic.oup.com/mnras/article/506/2/2871/6312509.

ergospherical said:
The spot pricing is a bit confusing to me -- when I tried to set it up, AWS is gives me a whole load of separate instances. With a single c5.24xlarge instance, I can just connect via SSH, clone the repo & install requirements.txt then run the script. To use the spot pricing service, I don't know if I have to do some more work to manually distribute the job across all of the provided instances
No, the separate instances are independent options. Basically you set up a job (in your case clone a repo, run a script, write some output to somewhere persistent e.g. S3) and choose from the big list of instances what hardware you want to run it on. The wider your choice of hardware the sooner you will get an available spot instance. Because your job is long-running you will probably find the instance is terminated part way through and you will want to save state periodically so that the job can recover.
 
  • Like
Likes ergospherical
Back
Top