Training a Lora for Hunyuan video with diffusion-pipe, VRAM maxed GPU mostly idle.
This is my first time trying to train any lora at all, so there might be some glaring mistakes.
So I have a dataset of videos (with a specific concept).
I chunked them all to 33 frame segments, and extracted a random frame from them, then used joy-caption alpha 2 to caption that frame and that's I use as caption for the segment.
I have a dataset of about 3k 33-frame chunks.
Now the issue is that it takes about 5 minutes for each step (also how can I tell how many steps in an epoch?)
The main issue : the VRAM is maxed out, but the GPU utilization is less than 10%.
Should I be using a much smaller dataset in general ?
Should I start training in the cloud with an A100 where the VRAM will not be a bottleneck ?
My diffusion-pipe config is training a Lora of rank 64, I've seen others using 32. What would that affect ?
Edit : I read too fast, the GPU reads as maxed out but it's not drawing as much power as it normally does, and it's not getting hot at all, so I suspect delay in VRAM pipeline makes it not fully utilized.