Another Error
I tried upgrading torch, torchvision, and torchaudio to see if it made a difference. Now getting a new error. I also downloaded distilled models in case I could not run with the 24B model.
(magi) root@46e1abf287b8:/workspace/MAGI-1# bash example/24B/run.sh
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 37, in main
[rank0]: pipeline = MagiPipeline(args.config_file)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 32, in init
[rank0]: dist_init(self.config)
[rank0]: File "/workspace/MAGI-1/inference/infra/distributed/dist_utils.py", line 48, in dist_init
[rank0]: assert config.engine_config.cp_size * config.engine_config.pp_size == torch.distributed.get_world_size()
[rank0]: AssertionError
[rank0]:[W423 02:54:17.933492678 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0423 02:54:19.241000 5094 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 5163) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 8, in
sys.exit(main())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
inference/pipeline/entry.py FAILED
Failures:
Root Cause (first observed failure):
[0]:
time : 2025-04-23_02:54:19
host : 46e1abf287b8
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5163)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
It looks like the config needs some modifications. Could you let me know how many GPUs you’re using and what type they are?
Also, make sure that pp_size * cp_size equals the total number of GPUs.
I started all over from scratch. I am getting further but still having problems.
[2025-04-24 01:04:51,105 - INFO] After build_dit_model, memory allocated: 0.02 GB, memory reserved: 0.08 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]: self._run(prompt, prefix_video, output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 48, in _run
[rank0]: dit = get_dit(self.config)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank0]: model = load_checkpoint(model)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 155, in load_checkpoint
[rank0]: state_dict = load_state_dict(model.runtime_config, model.engine_config)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 145, in load_state_dict
[rank0]: assert os.path.exists(inference_weight_dir)
[rank0]: AssertionError
E0424 01:04:52.556000 132482488543040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3378) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
This is from the 24B_config.json file
55 "clean_chunk_kvrange": 1,
56 "clean_t": 0.9999,
57 "seed": 83746,
58 "num_frames": 121,
59 "video_size_h": 540,
60 "video_size_w": 960,
61 "num_steps": 8,
62 "window_size": 4,
63 "fps": 24,
64 "chunk_width": 6,
65 "load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base/inference_weight",
66 "t5_pretrained": "/workspace/MAGI-1-models/models/T5/ckpt/t5",
67 "t5_device": "cuda",
68 "vae_pretrained": "/workspace/MAGI-1-models/models/VAE",
69 "scale_factor": 0.18215,
70 "temporal_downsample_factor": 4
I have no idea what is going on, but the files in the directory configured by the "load" parameter are the same as those on huggingface. I am not sure about this error: " assert os.path.exists(inference_weight_dir)" . I tried changing directories to one level back, bad that did not make a difference. I tried this with both a single L40 and with 2xL40s. I am not sure if that is too low of specs for this or not. I will try one of the other configurations with the other models, but I certainly cannot get this to work.
I used cpp_size = 2 when I was using 2xL40s and cpp_size=1 when using 1xL40.
Change the load path to load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base
Same Problem:
(magi) root@ca1683f2b34d:/workspace# ls -l /workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base/inference_weight
total 46757232
-rw-rw-rw- 1 root root 4988160184 Apr 23 21:18 model-00001-of-00006.safetensors
-rw-rw-rw- 1 root root 7247764000 Apr 23 21:18 model-00002-of-00006.safetensors
-rw-rw-rw- 1 root root 19327358992 Apr 23 21:19 model-00003-of-00006.safetensors
-rw-rw-rw- 1 root root 9663682528 Apr 23 21:18 model-00004-of-00006.safetensors
-rw-rw-rw- 1 root root 3623890200 Apr 23 21:18 model-00005-of-00006.safetensors
-rw-rw-rw- 1 root root 3028420248 Apr 23 21:18 model-00006-of-00006.safetensors
-rw-rw-rw- 1 root root 126708 Apr 23 21:17 model.safetensors.index.json
(magi) root@ca1683f2b34d:/workspace#
(magi) root@ca1683f2b34d:/workspace/MAGI-1# cat example/24B/24B_config.json
{
"model_config": {
"model_name": "videodit_ardf",
"num_layers": 48,
"hidden_size": 6144,
"ffn_hidden_size": 16384,
"num_attention_heads": 48,
"num_query_groups": 8,
"kv_channels": 128,
"layernorm_epsilon": 1e-06,
"apply_layernorm_1p": true,
"x_rescale_factor": 0.1,
"half_channel_vae": true,
"params_dtype": "torch.bfloat16",
"patch_size": 2,
"t_patch_size": 1,
"in_channels": 32,
"out_channels": 32,
"cond_hidden_ratio": 0.25,
"caption_channels": 4096,
"caption_max_length": 800,
"xattn_cond_hidden_ratio": 1.0,
"cond_gating_ratio": 1.0,
"gated_linear_unit": true
},
"runtime_config": {
"cfg_number": 1,
"cfg_t_range": [
0.0,
0.0217,
0.1,
0.3,
0.999
],
"prev_chunk_scales": [
1.5,
1.5,
1.5,
1.0,
1.0
],
"text_scales": [
7.5,
7.5,
7.5,
0.0,
0.0
],
"noise2clean_kvrange": [
5,
4,
3,
2
],
"clean_chunk_kvrange": 1,
"clean_t": 0.9999,
"seed": 83746,
"num_frames": 121,
"video_size_h": 540,
"video_size_w": 960,
"num_steps": 8,
"window_size": 4,
"fps": 24,
"chunk_width": 6,
"load": "/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base",
"t5_pretrained": "/workspace/MAGI-1-models/models/T5/ckpt/t5",
"t5_device": "cuda",
"vae_pretrained": "/workspace/MAGI-1-models/models/VAE",
"scale_factor": 0.18215,
"temporal_downsample_factor": 4
},
"engine_config": {
"distributed_backend": "nccl",
"distributed_timeout_minutes": 15,
"pp_size": 1,
"cp_size": 1,
"cp_strategy": "cp_ulysses",
"ulysses_overlap_degree": 1,
"fp8_quant": true,
"distill_nearly_clean_chunk_threshold": 0.3,
"shortcut_mode": "8,16,16",
"distill": true,
"kv_offload": true,
"enable_cuda_graph": false
}
}
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
[W425 00:54:29.094511239 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[2025-04-25 00:54:29,391 - INFO] Initialize torch distribution and model parallel successfully
[2025-04-25 00:54:29,391 - INFO] MagiConfig(model_config=ModelConfig(model_name='videodit_ardf', num_layers=48, hidden_size=6144, ffn_hidden_size=16384, num_attention_heads=48, num_query_groups=8, kv_channels=128, layernorm_epsilon=1e-06, apply_layernorm_1p=True, x_rescale_factor=0.1, half_channel_vae=True, params_dtype=torch.bfloat16, patch_size=2, t_patch_size=1, in_channels=32, out_channels=32, cond_hidden_ratio=0.25, caption_channels=4096, caption_max_length=800, xattn_cond_hidden_ratio=1.0, cond_gating_ratio=1.0, gated_linear_unit=True), runtime_config=RuntimeConfig(cfg_number=1, cfg_t_range=[0.0, 0.0217, 0.1, 0.3, 0.999], prev_chunk_scales=[1.5, 1.5, 1.5, 1.0, 1.0], text_scales=[7.5, 7.5, 7.5, 0.0, 0.0], noise2clean_kvrange=[5, 4, 3, 2], clean_chunk_kvrange=1, clean_t=0.9999, seed=83746, num_frames=121, video_size_h=540, video_size_w=960, num_steps=8, window_size=4, fps=24, chunk_width=6, t5_pretrained='/workspace/MAGI-1-models/models/T5/ckpt/t5', t5_device='cuda', vae_pretrained='/workspace/MAGI-1-models/models/VAE', scale_factor=0.18215, temporal_downsample_factor=4, load='/workspace/MAGI-1-models/models/MAGI/ckpt/magi/24B_base'), engine_config=EngineConfig(distributed_backend='nccl', distributed_timeout_minutes=15, pp_size=1, cp_size=1, cp_strategy='cp_ulysses', ulysses_overlap_degree=1, fp8_quant=True, distill_nearly_clean_chunk_threshold=0.3, shortcut_mode='8,16,16', distill=True, kv_offload=True, enable_cuda_graph=False))
/workspace/MAGI-1/inference/pipeline/video_process.py:229: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/conda/conda-bld/pytorch_1720538438429/work/torch/csrc/utils/tensor_new.cpp:1544.)
video = torch.frombuffer(out, dtype=torch.uint8).view(1, h, w, 3)
[2025-04-25 00:54:46,251 - INFO] Precompute validation prompt embeddings
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:00<00:00, 30.09s/it]
[2025-04-25 00:55:49,201 - INFO] VideoDiTModel(
(x_embedder): Conv3d(32, 6144, kernel_size=(1, 2, 2), stride=(1, 2, 2), bias=False)
(t_embedder): TimestepEmbedder(
(mlp): Sequential(
(0): Linear(in_features=256, out_features=1536, bias=True)
(1): SiLU()
(2): Linear(in_features=1536, out_features=1536, bias=True)
)
)
(y_embedder): CaptionEmbedder(
(y_proj_xattn): Sequential(
(0): Linear(in_features=4096, out_features=6144, bias=True)
(1): SiLU()
)
(y_proj_adaln): Sequential(
(0): Linear(in_features=4096, out_features=1536, bias=True)
)
)
(rope): LearnableRotaryEmbeddingCat()
(videodit_blocks): TransformerBlock(
(layers): ModuleList(
(0): TransformerLayer(
(ada_modulate_layer): AdaModulateLayer(
(act): SiLU()
(proj): Sequential(
(0): Linear(in_features=1536, out_features=12288, bias=True)
)
)
(self_attention): FullyParallelAttention(
(linear_qkv): CustomLayerNormLinear(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(q): Linear(in_features=6144, out_features=6144, bias=False)
(qx): Linear(in_features=6144, out_features=6144, bias=False)
(k): Linear(in_features=6144, out_features=1024, bias=False)
(v): Linear(in_features=6144, out_features=1024, bias=False)
)
(linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
(linear_proj): Linear(in_features=12288, out_features=6144, bias=False)
(q_layernorm): FusedLayerNorm()
(q_layernorm_xattn): FusedLayerNorm()
(k_layernorm): FusedLayerNorm()
(k_layernorm_xattn): FusedLayerNorm()
)
(self_attn_post_norm): FusedLayerNorm()
(mlp): CustomMLP(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(linear_fc1): Linear(in_features=6144, out_features=32768, bias=False)
(linear_fc2): Linear(in_features=16384, out_features=6144, bias=False)
)
(mlp_post_norm): FusedLayerNorm()
)
(1-46): 46 x TransformerLayer(
(ada_modulate_layer): AdaModulateLayer(
(act): SiLU()
(proj): Sequential(
(0): Linear(in_features=1536, out_features=12288, bias=True)
)
)
(self_attention): FullyParallelAttention(
(linear_qkv): CustomLayerNormLinear(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(q): PerTensorQuantizedFp8Linear()
(qx): PerTensorQuantizedFp8Linear()
(k): PerTensorQuantizedFp8Linear()
(v): PerTensorQuantizedFp8Linear()
)
(linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
(linear_proj): PerChannelQuantizedFp8Linear()
(q_layernorm): FusedLayerNorm()
(q_layernorm_xattn): FusedLayerNorm()
(k_layernorm): FusedLayerNorm()
(k_layernorm_xattn): FusedLayerNorm()
)
(self_attn_post_norm): FusedLayerNorm()
(mlp): CustomMLP(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(linear_fc1): PerTensorQuantizedFp8Linear()
(linear_fc2): PerChannelQuantizedFp8Linear()
)
(mlp_post_norm): FusedLayerNorm()
)
(47): TransformerLayer(
(ada_modulate_layer): AdaModulateLayer(
(act): SiLU()
(proj): Sequential(
(0): Linear(in_features=1536, out_features=12288, bias=True)
)
)
(self_attention): FullyParallelAttention(
(linear_qkv): CustomLayerNormLinear(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(q): Linear(in_features=6144, out_features=6144, bias=False)
(qx): Linear(in_features=6144, out_features=6144, bias=False)
(k): Linear(in_features=6144, out_features=1024, bias=False)
(v): Linear(in_features=6144, out_features=1024, bias=False)
)
(linear_kv_xattn): Linear(in_features=6144, out_features=2048, bias=False)
(linear_proj): Linear(in_features=12288, out_features=6144, bias=False)
(q_layernorm): FusedLayerNorm()
(q_layernorm_xattn): FusedLayerNorm()
(k_layernorm): FusedLayerNorm()
(k_layernorm_xattn): FusedLayerNorm()
)
(self_attn_post_norm): FusedLayerNorm()
(mlp): CustomMLP(
(layer_norm): LayerNorm((6144,), eps=1e-06, elementwise_affine=True)
(linear_fc1): Linear(in_features=6144, out_features=32768, bias=False)
(linear_fc2): Linear(in_features=16384, out_features=6144, bias=False)
)
(mlp_post_norm): FusedLayerNorm()
)
)
(final_layernorm): FusedLayerNorm()
)
(final_linear): FinalLinear(
(linear): Linear(in_features=6144, out_features=128, bias=False)
)
)
[2025-04-25 00:55:49,212 - INFO] (cp, pp) rank (0, 0): param count 23902014382, model size 24.65 GB
[2025-04-25 00:55:49,212 - INFO] Build DiTModel successfully
[2025-04-25 00:55:49,212 - INFO] After build_dit_model, memory allocated: 0.02 GB, memory reserved: 0.08 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in <module>
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]: self._run(prompt, prefix_video, output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 48, in _run
[rank0]: dit = get_dit(self.config)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 654, in get_dit
[rank0]: model = load_checkpoint(model)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 155, in load_checkpoint
[rank0]: state_dict = load_state_dict(model.runtime_config, model.engine_config)
[rank0]: File "/workspace/MAGI-1/inference/infra/checkpoint/checkpointing.py", line 145, in load_state_dict
[rank0]: assert os.path.exists(inference_weight_dir)
[rank0]: AssertionError
E0425 00:55:50.917000 136965425092416 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3142) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
inference/pipeline/entry.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-04-25_00:55:50
host : ca1683f2b34d
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3142)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(magi) root@ca1683f2b34d:/workspace/MAGI-1#
If you’re using the 24B_base model, please set cfg_number=3, fp8_quant=false, and distill=false.
The default config on GitHub seems a bit confusing — I’ll try to update it when I get a chance.
Well, it got further, but still a no go. I tried it with 4xL40s.
[2025-04-26 04:56:46,189 - INFO] After load_checkpoint, memory allocated: 11.28 GB, memory reserved: 11.31 GB
[2025-04-26 04:56:46,191 - INFO] After high_precision_promoter, memory allocated: 11.28 GB, memory reserved: 11.31 GB
[2025-04-26 04:56:46,350 - INFO] Load checkpoint successfully
[2025-04-26 04:56:46,350 - INFO] special_token = ['HQ_TOKEN', 'DURATION_TOKEN']
[2025-04-26 04:56:46,354 - INFO] Warning: For better performance, please use multiple inputs for PP>1
InferBatch 0: 0%| | 0/6 [00:00<?, ?it/s][2025-04-26 04:56:46,355 - INFO] transport_inputs len: 1
2025-04-26 04:56:51,307 - INFO - flashinfer.jit: Loading JIT ops: silu_and_mul
2025-04-26 04:57:12,262 - INFO - flashinfer.jit: Finished loading JIT ops: silu_and_mul
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in <module>
[rank0]: main()
[rank0]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank0]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank0]: self._run(prompt, prefix_video, output_path)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 50, in _run
[rank0]: [
[rank0]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 50, in <listcomp>
[rank0]: [
[rank0]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 763, in generate_per_chunk
[rank0]: for _, _, chunk in sample_transport.walk():
[rank0]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 725, in walk
[rank0]: velocity = self.forward_velocity(infer_idx, 0)
[rank0]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 657, in forward_velocity
[rank0]: velocity = forward_fn(
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 503, in forward_dispatcher
[rank0]: (out_cond_pre_and_text, out_cond_pre, out_uncond, denoise_width) = self.forward_3cfg(
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 414, in forward_3cfg
[rank0]: out_cond_pre_and_text = self.forward(
[rank0]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 385, in forward
[rank0]: x = self.videodit_blocks.forward(
[rank0]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_module.py", line 1422, in forward
[rank0]: hidden_states = layer(
[rank0]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_module.py", line 1311, in forward
[rank0]: hidden_states = self.attn_post_process(core_attn_out, cross_attn_out, residual, condition, condition_map)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_module.py", line 1324, in attn_post_process
[rank0]: hidden_states = self.gating_and_mlp(hidden_states, residual, condition, condition_map)
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_module.py", line 1359, in gating_and_mlp
[rank0]: hidden_states = bias_modulate_add(hidden_states, residual, condition_map, gate_mlp, self.mlp_post_norm).to(
[rank0]: File "/workspace/MAGI-1/inference/model/dit/dit_module.py", line 293, in bias_modulate_add
[rank0]: x = x.float()
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
InferBatch 0: 0%| | 0/6 [00:26<?, ?it/s]
[rank3]: Traceback (most recent call last):
[rank3]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in <module>
[rank3]: main()
[rank3]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank3]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank3]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank3]: self._run(prompt, prefix_video, output_path)
[rank3]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 50, in _run
[rank3]: [
[rank3]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 50, in <listcomp>
[rank3]: [
[rank3]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 763, in generate_per_chunk
[rank3]: for _, _, chunk in sample_transport.walk():
[rank3]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 725, in walk
[rank3]: velocity = self.forward_velocity(infer_idx, 0)
[rank3]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 657, in forward_velocity
[rank3]: velocity = forward_fn(
[rank3]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 503, in forward_dispatcher
[rank3]: (out_cond_pre_and_text, out_cond_pre, out_uncond, denoise_width) = self.forward_3cfg(
[rank3]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 414, in forward_3cfg
[rank3]: out_cond_pre_and_text = self.forward(
[rank3]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]: return func(*args, **kwargs)
[rank3]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 379, in forward
[rank3]: x = pp_scheduler().recv_prev_data(x.shape, x.dtype)
[rank3]: File "/workspace/MAGI-1/inference/infra/parallelism/pipeline_parallel.py", line 74, in recv_prev_data
[rank3]: self.irecv_prev(recv_tensor).wait()
[rank3]: File "/workspace/MAGI-1/inference/infra/parallelism/pipeline_parallel.py", line 60, in irecv_prev
[rank3]: handle = torch.distributed.irecv(buffer, src=mpu.get_pipeline_model_parallel_prev_rank(), group=mpu.get_pp_group())
[rank3]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in irecv
[rank3]: return pg.recv([tensor], group_src_rank, tag)
[rank3]: torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer
[rank3]: Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1720538438429/work/torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[rank3]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78ca3bf76f86 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank3]: frame #1: <unknown function> + 0x599c9de (0x78ca2c19c9de in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x78ca2c197277 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x78ca2c1975a2 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x78ca2c198791 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x78ca2c14d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x78ca2c14d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x78ca2c14d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x78ca2c14d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x78c9e51b50df in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank3]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0x114c (0x78c9e51c0ebc in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank3]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x68a (0x78c9e51deaba in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank3]: frame #12: <unknown function> + 0x593f429 (0x78ca2c13f429 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #13: <unknown function> + 0x5949e8a (0x78ca2c149e8a in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #14: <unknown function> + 0x4f6c42b (0x78ca2b76c42b in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #15: <unknown function> + 0x4f69ca4 (0x78ca2b769ca4 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #16: <unknown function> + 0x176bcd8 (0x78ca27f6bcd8 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #17: <unknown function> + 0x5950e94 (0x78ca2c150e94 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #18: <unknown function> + 0x5956045 (0x78ca2c156045 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank3]: frame #19: <unknown function> + 0xdb6a3e (0x78ca349b6a3e in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank3]: frame #20: <unknown function> + 0x4b00e4 (0x78ca340b00e4 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank3]: frame #21: <unknown function> + 0x144446 (0x628476b1c446 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #22: _PyObject_MakeTpCall + 0x26b (0x628476b1597b in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #23: <unknown function> + 0x1506e6 (0x628476b286e6 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x628476b11022 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #25: _PyFunction_Vectorcall + 0x6c (0x628476b1c8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #26: _PyEval_EvalFrameDefault + 0x13cc (0x628476b0d7dc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #27: _PyFunction_Vectorcall + 0x6c (0x628476b1c8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #28: _PyEval_EvalFrameDefault + 0x72c (0x628476b0cb3c in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #29: _PyFunction_Vectorcall + 0x6c (0x628476b1c8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #30: _PyEval_EvalFrameDefault + 0x72c (0x628476b0cb3c in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #31: _PyFunction_Vectorcall + 0x6c (0x628476b1c8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #32: PyObject_Call + 0xbc (0x628476b28d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #33: _PyEval_EvalFrameDefault + 0x2d84 (0x628476b0f194 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #34: <unknown function> + 0x150402 (0x628476b28402 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #35: PyObject_Call + 0xbc (0x628476b28d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #36: _PyEval_EvalFrameDefault + 0x2d84 (0x628476b0f194 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #37: <unknown function> + 0x150402 (0x628476b28402 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #38: PyObject_Call + 0xbc (0x628476b28d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #39: _PyEval_EvalFrameDefault + 0x2d84 (0x628476b0f194 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #40: <unknown function> + 0x150402 (0x628476b28402 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #41: PyObject_Call + 0xbc (0x628476b28d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #42: _PyEval_EvalFrameDefault + 0x2d84 (0x628476b0f194 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #43: _PyFunction_Vectorcall + 0x6c (0x628476b1c8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #44: _PyEval_EvalFrameDefault + 0x72c (0x628476b0cb3c in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #45: <unknown function> + 0x157017 (0x628476b2f017 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #46: _PyEval_EvalFrameDefault + 0xa0a (0x628476b0ce1a in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #47: <unknown function> + 0x157017 (0x628476b2f017 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #48: _PyEval_EvalFrameDefault + 0xa0a (0x628476b0ce1a in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #49: _PyFunction_Vectorcall + 0x6c (0x628476b1c8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #50: _PyEval_EvalFrameDefault + 0x320 (0x628476b0c730 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #51: _PyFunction_Vectorcall + 0x6c (0x628476b1c8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #52: _PyEval_EvalFrameDefault + 0x72c (0x628476b0cb3c in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #53: <unknown function> + 0x150402 (0x628476b28402 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #54: _PyEval_EvalFrameDefault + 0x13cc (0x628476b0d7dc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #55: _PyFunction_Vectorcall + 0x6c (0x628476b1c8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #56: _PyEval_EvalFrameDefault + 0x320 (0x628476b0c730 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #57: <unknown function> + 0x1d7870 (0x628476baf870 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #58: PyEval_EvalCode + 0x87 (0x628476baf7b7 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #59: <unknown function> + 0x207d1a (0x628476bdfd1a in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #60: <unknown function> + 0x203123 (0x628476bdb123 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #61: <unknown function> + 0x9a4d1 (0x628476a724d1 in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: frame #62: _PyRun_SimpleFileObject + 0x1ae (0x628476bd560e in /workspace/miniconda3/envs/magi/bin/python)
[rank3]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in <module>
[rank1]: main()
[rank1]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank1]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank1]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank1]: self._run(prompt, prefix_video, output_path)
[rank1]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 50, in _run
[rank1]: [
[rank1]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 50, in <listcomp>
[rank1]: [
[rank1]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 763, in generate_per_chunk
[rank1]: for _, _, chunk in sample_transport.walk():
[rank1]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 725, in walk
[rank1]: velocity = self.forward_velocity(infer_idx, 0)
[rank1]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 657, in forward_velocity
[rank1]: velocity = forward_fn(
[rank1]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 503, in forward_dispatcher
[rank1]: (out_cond_pre_and_text, out_cond_pre, out_uncond, denoise_width) = self.forward_3cfg(
[rank1]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 414, in forward_3cfg
[rank1]: out_cond_pre_and_text = self.forward(
[rank1]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 379, in forward
[rank1]: x = pp_scheduler().recv_prev_data(x.shape, x.dtype)
[rank1]: File "/workspace/MAGI-1/inference/infra/parallelism/pipeline_parallel.py", line 74, in recv_prev_data
[rank1]: self.irecv_prev(recv_tensor).wait()
[rank1]: File "/workspace/MAGI-1/inference/infra/parallelism/pipeline_parallel.py", line 60, in irecv_prev
[rank1]: handle = torch.distributed.irecv(buffer, src=mpu.get_pipeline_model_parallel_prev_rank(), group=mpu.get_pp_group())
[rank1]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in irecv
[rank1]: return pg.recv([tensor], group_src_rank, tag)
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer
[rank1]: Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1720538438429/work/torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[rank1]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7bd236376f86 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank1]: frame #1: <unknown function> + 0x599c9de (0x7bd22839c9de in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7bd228397277 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7bd2283975a2 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7bd228398791 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7bd22834d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7bd22834d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7bd22834d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7bd22834d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x7bd1e13b50df in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0x114c (0x7bd1e13c0ebc in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x68a (0x7bd1e13deaba in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank1]: frame #12: <unknown function> + 0x593f429 (0x7bd22833f429 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #13: <unknown function> + 0x5949e8a (0x7bd228349e8a in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #14: <unknown function> + 0x4f6c42b (0x7bd22796c42b in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #15: <unknown function> + 0x4f69ca4 (0x7bd227969ca4 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #16: <unknown function> + 0x176bcd8 (0x7bd22416bcd8 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #17: <unknown function> + 0x5950e94 (0x7bd228350e94 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #18: <unknown function> + 0x5956045 (0x7bd228356045 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank1]: frame #19: <unknown function> + 0xdb6a3e (0x7bd230bb6a3e in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #20: <unknown function> + 0x4b00e4 (0x7bd2302b00e4 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank1]: frame #21: <unknown function> + 0x144446 (0x59d1354a6446 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #22: _PyObject_MakeTpCall + 0x26b (0x59d13549f97b in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #23: <unknown function> + 0x1506e6 (0x59d1354b26e6 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x59d13549b022 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #25: _PyFunction_Vectorcall + 0x6c (0x59d1354a68cc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #26: _PyEval_EvalFrameDefault + 0x13cc (0x59d1354977dc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #27: _PyFunction_Vectorcall + 0x6c (0x59d1354a68cc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #28: _PyEval_EvalFrameDefault + 0x72c (0x59d135496b3c in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #29: _PyFunction_Vectorcall + 0x6c (0x59d1354a68cc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #30: _PyEval_EvalFrameDefault + 0x72c (0x59d135496b3c in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #31: _PyFunction_Vectorcall + 0x6c (0x59d1354a68cc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #32: PyObject_Call + 0xbc (0x59d1354b2d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #33: _PyEval_EvalFrameDefault + 0x2d84 (0x59d135499194 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #34: <unknown function> + 0x150402 (0x59d1354b2402 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #35: PyObject_Call + 0xbc (0x59d1354b2d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #36: _PyEval_EvalFrameDefault + 0x2d84 (0x59d135499194 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #37: <unknown function> + 0x150402 (0x59d1354b2402 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #38: PyObject_Call + 0xbc (0x59d1354b2d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #39: _PyEval_EvalFrameDefault + 0x2d84 (0x59d135499194 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #40: <unknown function> + 0x150402 (0x59d1354b2402 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #41: PyObject_Call + 0xbc (0x59d1354b2d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #42: _PyEval_EvalFrameDefault + 0x2d84 (0x59d135499194 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #43: _PyFunction_Vectorcall + 0x6c (0x59d1354a68cc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #44: _PyEval_EvalFrameDefault + 0x72c (0x59d135496b3c in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #45: <unknown function> + 0x157017 (0x59d1354b9017 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #46: _PyEval_EvalFrameDefault + 0xa0a (0x59d135496e1a in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #47: <unknown function> + 0x157017 (0x59d1354b9017 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #48: _PyEval_EvalFrameDefault + 0xa0a (0x59d135496e1a in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #49: _PyFunction_Vectorcall + 0x6c (0x59d1354a68cc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #50: _PyEval_EvalFrameDefault + 0x320 (0x59d135496730 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #51: _PyFunction_Vectorcall + 0x6c (0x59d1354a68cc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #52: _PyEval_EvalFrameDefault + 0x72c (0x59d135496b3c in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #53: <unknown function> + 0x150402 (0x59d1354b2402 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #54: _PyEval_EvalFrameDefault + 0x13cc (0x59d1354977dc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #55: _PyFunction_Vectorcall + 0x6c (0x59d1354a68cc in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #56: _PyEval_EvalFrameDefault + 0x320 (0x59d135496730 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #57: <unknown function> + 0x1d7870 (0x59d135539870 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #58: PyEval_EvalCode + 0x87 (0x59d1355397b7 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #59: <unknown function> + 0x207d1a (0x59d135569d1a in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #60: <unknown function> + 0x203123 (0x59d135565123 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #61: <unknown function> + 0x9a4d1 (0x59d1353fc4d1 in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: frame #62: _PyRun_SimpleFileObject + 0x1ae (0x59d13555f60e in /workspace/miniconda3/envs/magi/bin/python)
[rank1]: . This may indicate a possible application crash on rank 0 or a network set up issue.
[rank2]: Traceback (most recent call last):
[rank2]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in <module>
[rank2]: main()
[rank2]: File "/workspace/MAGI-1/inference/pipeline/entry.py", line 45, in main
[rank2]: pipeline.run_image_to_video(prompt=args.prompt, image_path=args.image_path, output_path=args.output_path)
[rank2]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 40, in run_image_to_video
[rank2]: self._run(prompt, prefix_video, output_path)
[rank2]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 50, in _run
[rank2]: [
[rank2]: File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 50, in <listcomp>
[rank2]: [
[rank2]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 763, in generate_per_chunk
[rank2]: for _, _, chunk in sample_transport.walk():
[rank2]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 725, in walk
[rank2]: velocity = self.forward_velocity(infer_idx, 0)
[rank2]: File "/workspace/MAGI-1/inference/pipeline/video_generate.py", line 657, in forward_velocity
[rank2]: velocity = forward_fn(
[rank2]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 503, in forward_dispatcher
[rank2]: (out_cond_pre_and_text, out_cond_pre, out_uncond, denoise_width) = self.forward_3cfg(
[rank2]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 414, in forward_3cfg
[rank2]: out_cond_pre_and_text = self.forward(
[rank2]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]: return func(*args, **kwargs)
[rank2]: File "/workspace/MAGI-1/inference/model/dit/dit_model.py", line 379, in forward
[rank2]: x = pp_scheduler().recv_prev_data(x.shape, x.dtype)
[rank2]: File "/workspace/MAGI-1/inference/infra/parallelism/pipeline_parallel.py", line 74, in recv_prev_data
[rank2]: self.irecv_prev(recv_tensor).wait()
[rank2]: File "/workspace/MAGI-1/inference/infra/parallelism/pipeline_parallel.py", line 60, in irecv_prev
[rank2]: handle = torch.distributed.irecv(buffer, src=mpu.get_pipeline_model_parallel_prev_rank(), group=mpu.get_pp_group())
[rank2]: File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in irecv
[rank2]: return pg.recv([tensor], group_src_rank, tag)
[rank2]: torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer
[rank2]: Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1720538438429/work/torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
[rank2]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a9534176f86 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libc10.so)
[rank2]: frame #1: <unknown function> + 0x599c9de (0x7a952619c9de in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7a9526197277 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7a95261975a2 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7a9526198791 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7a952614d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7a952614d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7a952614d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7a952614d1e1 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x7a94df1b50df in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank2]: frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0x114c (0x7a94df1c0ebc in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank2]: frame #11: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x68a (0x7a94df1deaba in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[rank2]: frame #12: <unknown function> + 0x593f429 (0x7a952613f429 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #13: <unknown function> + 0x5949e8a (0x7a9526149e8a in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #14: <unknown function> + 0x4f6c42b (0x7a952576c42b in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #15: <unknown function> + 0x4f69ca4 (0x7a9525769ca4 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #16: <unknown function> + 0x176bcd8 (0x7a9521f6bcd8 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #17: <unknown function> + 0x5950e94 (0x7a9526150e94 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #18: <unknown function> + 0x5956045 (0x7a9526156045 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
[rank2]: frame #19: <unknown function> + 0xdb6a3e (0x7a952e9b6a3e in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank2]: frame #20: <unknown function> + 0x4b00e4 (0x7a952e0b00e4 in /workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
[rank2]: frame #21: <unknown function> + 0x144446 (0x57f889c9b446 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #22: _PyObject_MakeTpCall + 0x26b (0x57f889c9497b in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #23: <unknown function> + 0x1506e6 (0x57f889ca76e6 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x57f889c90022 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #25: _PyFunction_Vectorcall + 0x6c (0x57f889c9b8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #26: _PyEval_EvalFrameDefault + 0x13cc (0x57f889c8c7dc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #27: _PyFunction_Vectorcall + 0x6c (0x57f889c9b8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #28: _PyEval_EvalFrameDefault + 0x72c (0x57f889c8bb3c in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #29: _PyFunction_Vectorcall + 0x6c (0x57f889c9b8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #30: _PyEval_EvalFrameDefault + 0x72c (0x57f889c8bb3c in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #31: _PyFunction_Vectorcall + 0x6c (0x57f889c9b8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #32: PyObject_Call + 0xbc (0x57f889ca7d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #33: _PyEval_EvalFrameDefault + 0x2d84 (0x57f889c8e194 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #34: <unknown function> + 0x150402 (0x57f889ca7402 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #35: PyObject_Call + 0xbc (0x57f889ca7d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #36: _PyEval_EvalFrameDefault + 0x2d84 (0x57f889c8e194 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #37: <unknown function> + 0x150402 (0x57f889ca7402 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #38: PyObject_Call + 0xbc (0x57f889ca7d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #39: _PyEval_EvalFrameDefault + 0x2d84 (0x57f889c8e194 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #40: <unknown function> + 0x150402 (0x57f889ca7402 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #41: PyObject_Call + 0xbc (0x57f889ca7d9c in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #42: _PyEval_EvalFrameDefault + 0x2d84 (0x57f889c8e194 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #43: _PyFunction_Vectorcall + 0x6c (0x57f889c9b8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #44: _PyEval_EvalFrameDefault + 0x72c (0x57f889c8bb3c in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #45: <unknown function> + 0x157017 (0x57f889cae017 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #46: _PyEval_EvalFrameDefault + 0xa0a (0x57f889c8be1a in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #47: <unknown function> + 0x157017 (0x57f889cae017 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #48: _PyEval_EvalFrameDefault + 0xa0a (0x57f889c8be1a in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #49: _PyFunction_Vectorcall + 0x6c (0x57f889c9b8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #50: _PyEval_EvalFrameDefault + 0x320 (0x57f889c8b730 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #51: _PyFunction_Vectorcall + 0x6c (0x57f889c9b8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #52: _PyEval_EvalFrameDefault + 0x72c (0x57f889c8bb3c in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #53: <unknown function> + 0x150402 (0x57f889ca7402 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #54: _PyEval_EvalFrameDefault + 0x13cc (0x57f889c8c7dc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #55: _PyFunction_Vectorcall + 0x6c (0x57f889c9b8cc in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #56: _PyEval_EvalFrameDefault + 0x320 (0x57f889c8b730 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #57: <unknown function> + 0x1d7870 (0x57f889d2e870 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #58: PyEval_EvalCode + 0x87 (0x57f889d2e7b7 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #59: <unknown function> + 0x207d1a (0x57f889d5ed1a in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #60: <unknown function> + 0x203123 (0x57f889d5a123 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #61: <unknown function> + 0x9a4d1 (0x57f889bf14d1 in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: frame #62: _PyRun_SimpleFileObject + 0x1ae (0x57f889d5460e in /workspace/miniconda3/envs/magi/bin/python)
[rank2]: . This may indicate a possible application crash on rank 0 or a network set up issue.
W0426 04:57:14.279000 130591876859712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3018 closing signal SIGTERM
W0426 04:57:14.281000 130591876859712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3019 closing signal SIGTERM
W0426 04:57:14.283000 130591876859712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3020 closing signal SIGTERM
E0426 04:57:14.766000 130591876859712 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3017) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
inference/pipeline/entry.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-04-26_04:57:14
host : 090fdaa4401b
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3017)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================