File size: 34,235 Bytes
6a787b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
[](#run-inference-on-servers)Run Inference on servers
=====================================================

Inference is the process of using a trained model to make predictions on new data. Because this process can be compute-intensive, running on a dedicated or external service can be an interesting option. The `huggingface_hub` library provides a unified interface to run inference across multiple services for models hosted on the Hugging Face Hub:

1.  [Inference Providers](https://huggingface.co/docs/inference-providers/index): a streamlined, unified access to hundreds of machine learning models, powered by our serverless inference partners. This new approach builds on our previous Serverless Inference API, offering more models, improved performance, and greater reliability thanks to world-class providers. Refer to the [documentation](https://huggingface.co/docs/inference-providers/index#partners) for a list of supported providers.
2.  [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index): a product to easily deploy models to production. Inference is run by Hugging Face in a dedicated, fully managed infrastructure on a cloud provider of your choice.
3.  Local endpoints: you can also run inference with local inference servers like [llama.cpp](https://github.com/ggerganov/llama.cpp), [Ollama](https://ollama.com/), [vLLM](https://github.com/vllm-project/vllm), [LiteLLM](https://docs.litellm.ai/docs/simple_proxy), or [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) by connecting the client to these local endpoints.

These services can all be called from the [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) object. It acts as a replacement for the legacy [InferenceApi](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceApi) client, adding specific support for tasks and third-party providers. Learn how to migrate to the new client in the [Legacy InferenceAPI client](#legacy-inferenceapi-client) section.

[InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) is a Python client making HTTP calls to our APIs. If you want to make the HTTP calls directly using your preferred tool (curl, postman,…), please refer to the [Inference Providers](https://huggingface.co/docs/inference-providers/index) documentation or to the [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) documentation pages.

For web development, a [JS client](https://huggingface.co/docs/huggingface.js/inference/README) has been released. If you are interested in game development, you might have a look at our [C# project](https://github.com/huggingface/unity-api).

[](#getting-started)Getting started
-----------------------------------

Let’s get started with a text-to-image task:

Copied

\>>> from huggingface\_hub import InferenceClient

\# Example with an external provider (e.g. replicate)
\>>> replicate\_client = InferenceClient(
    provider="replicate",
    api\_key="my\_replicate\_api\_key",
)
\>>> replicate\_image = replicate\_client.text\_to\_image(
    "A flying car crossing a futuristic cityscape.",
    model="black-forest-labs/FLUX.1-schnell",
)
\>>> replicate\_image.save("flying\_car.png")

In the example above, we initialized an [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) with a third-party provider, [Replicate](https://replicate.com/). When using a provider, you must specify the model you want to use. The model id must be the id of the model on the Hugging Face Hub, not the id of the model from the third-party provider. In our example, we generated an image from a text prompt. The returned value is a `PIL.Image` object that can be saved to a file. For more details, check out the [text\_to\_image()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_to_image) documentation.

Let’s now see an example using the [chat\_completion()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion) API. This task uses an LLM to generate a response from a list of messages:

Copied

\>>> from huggingface\_hub import InferenceClient
\>>> messages = \[
    {
        "role": "user",
        "content": "What is the capital of France?",
    }
\]
\>>> client = InferenceClient(
    provider="together",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    api\_key="my\_together\_api\_key",
)
\>>> client.chat\_completion(messages, max\_tokens=100)
ChatCompletionOutput(
    choices=\[
        ChatCompletionOutputComplete(
            finish\_reason="eos\_token",
            index=0,
            message=ChatCompletionOutputMessage(
                role="assistant", content="The capital of France is Paris.", name=None, tool\_calls=None
            ),
            logprobs=None,
        )
    \],
    created=1719907176,
    id\="",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    object\="text\_completion",
    system\_fingerprint="2.0.4-sha-f426a33",
    usage=ChatCompletionOutputUsage(completion\_tokens=8, prompt\_tokens=17, total\_tokens=25),
)

In the example above, we used a third-party provider ([Together AI](https://www.together.ai/)) and specified which model we want to use (`"meta-llama/Meta-Llama-3-8B-Instruct"`). We then gave a list of messages to complete (here, a single question) and passed an additional parameter to the API (`max_token=100`). The output is a `ChatCompletionOutput` object that follows the OpenAI specification. The generated content can be accessed with `output.choices[0].message.content`. For more details, check out the [chat\_completion()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion) documentation.

The API is designed to be simple. Not all parameters and options are available or described for the end user. Check out [this page](https://huggingface.co/docs/api-inference/detailed_parameters) if you are interested in learning more about all the parameters available for each task.

### [](#using-a-specific-provider)Using a specific provider

If you want to use a specific provider, you can specify it when initializing the client. The default value is β€œauto” which will select the first of the providers available for the model, sorted by the user’s order in [https://hf.co/settings/inference-providers](https://hf.co/settings/inference-providers). Refer to the [Supported providers and tasks](#supported-providers-and-tasks) section for a list of supported providers.

Copied

\>>> from huggingface\_hub import InferenceClient
\>>> client = InferenceClient(provider="replicate", api\_key="my\_replicate\_api\_key")

### [](#using-a-specific-model)Using a specific model

What if you want to use a specific model? You can specify it either as a parameter or directly at an instance level:

Copied

\>>> from huggingface\_hub import InferenceClient
\# Initialize client for a specific model
\>>> client = InferenceClient(provider="together", model="meta-llama/Llama-3.1-8B-Instruct")
\>>> client.text\_to\_image(...)
\# Or use a generic client but pass your model as an argument
\>>> client = InferenceClient(provider="together")
\>>> client.text\_to\_image(..., model="meta-llama/Llama-3.1-8B-Instruct")

When using the β€œhf-inference” provider, each task comes with a recommended model from the 1M+ models available on the Hub. However, this recommendation can change over time, so it’s best to explicitly set a model once you’ve decided which one to use. For third-party providers, you must always specify a model that is compatible with that provider.

Visit the [Models](https://huggingface.co/models?inference=warm) page on the Hub to explore models available through Inference Providers.

### [](#using-inference-endpoints)Using Inference Endpoints

The examples we saw above use inference providers. While these prove to be very useful for prototyping and testing things quickly. Once you’re ready to deploy your model to production, you’ll need to use a dedicated infrastructure. That’s where [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) comes into play. It allows you to deploy any model and expose it as a private API. Once deployed, you’ll get a URL that you can connect to using exactly the same code as before, changing only the `model` parameter:

Copied

\>>> from huggingface\_hub import InferenceClient
\>>> client = InferenceClient(model="https://uu149rez6gw9ehej.eu-west-1.aws.endpoints.huggingface.cloud/deepfloyd-if")
\# or
\>>> client = InferenceClient()
\>>> client.text\_to\_image(..., model="https://uu149rez6gw9ehej.eu-west-1.aws.endpoints.huggingface.cloud/deepfloyd-if")

Note that you cannot specify both a URL and a provider - they are mutually exclusive. URLs are used to connect directly to deployed endpoints.

### [](#using-local-endpoints)Using local endpoints

You can use [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) to run chat completion with local inference servers (llama.cpp, vllm, litellm server, TGI, mlx, etc.) running on your own machine. The API should be OpenAI API-compatible.

Copied

\>>> from huggingface\_hub import InferenceClient
\>>> client = InferenceClient(model="http://localhost:8080")

\>>> response = client.chat.completions.create(
...     messages=\[
...         {"role": "user", "content": "What is the capital of France?"}
...     \],
...     max\_tokens=100
... )
\>>> print(response.choices\[0\].message.content)

Similarily to the OpenAI Python client, [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) can be used to run Chat Completion inference with any OpenAI REST API-compatible endpoint.

### [](#authentication)Authentication

Authentication can be done in two ways:

**Routed through Hugging Face** : Use Hugging Face as a proxy to access third-party providers. The calls will be routed through Hugging Face’s infrastructure using our provider keys, and the usage will be billed directly to your Hugging Face account.

You can authenticate using a [User Access Token](https://huggingface.co/docs/hub/security-tokens). You can provide your Hugging Face token directly using the `api_key` parameter:

Copied

\>>> client = InferenceClient(
    provider="replicate",
    api\_key="hf\_\*\*\*\*"  \# Your HF token
)

If you _don’t_ pass an `api_key`, the client will attempt to find and use a token stored locally on your machine. This typically happens if you’ve previously logged in. See the [Authentication Guide](https://huggingface.co/docs/huggingface_hub/quick-start#authentication) for details on login.

Copied

\>>> client = InferenceClient(
    provider="replicate",
    token="hf\_\*\*\*\*"  \# Your HF token
)

**Direct access to provider**: Use your own API key to interact directly with the provider’s service:

Copied

\>>> client = InferenceClient(
    provider="replicate",
    api\_key="r8\_\*\*\*\*"  \# Your Replicate API key
)

For more details, refer to the [Inference Providers pricing documentation](https://huggingface.co/docs/inference-providers/pricing#routed-requests-vs-direct-calls).

[](#supported-providers-and-tasks)Supported providers and tasks
---------------------------------------------------------------

[InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient)’s goal is to provide the easiest interface to run inference on Hugging Face models, on any provider. It has a simple API that supports the most common tasks. Here is a table showing which providers support which tasks:

Task

Black Forest Labs

Cerebras

Cohere

fal-ai

Featherless AI

Fireworks AI

Groq

HF Inference

Hyperbolic

Nebius AI Studio

Novita AI

Replicate

Sambanova

Together

[audio\_classification()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.audio_classification)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[audio\_to\_audio()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.audio_to_audio)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[automatic\_speech\_recognition()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.automatic_speech_recognition)

❌

❌

❌

βœ…

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[chat\_completion()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion)

❌

βœ…

βœ…

❌

βœ…

βœ…

βœ…

βœ…

βœ…

βœ…

βœ…

❌

βœ…

βœ…

[document\_question\_answering()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.document_question_answering)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[feature\_extraction()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.feature_extraction)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

βœ…

❌

❌

βœ…

❌

[fill\_mask()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.fill_mask)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[image\_classification()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.image_classification)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[image\_segmentation()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.image_segmentation)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[image\_to\_image()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.image_to_image)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[image\_to\_text()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.image_to_text)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[object\_detection()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.object_detection)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[question\_answering()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.question_answering)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[sentence\_similarity()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.sentence_similarity)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[summarization()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.summarization)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[table\_question\_answering()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.table_question_answering)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[text\_classification()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_classification)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[text\_generation()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation)

❌

❌

❌

❌

βœ…

❌

❌

βœ…

βœ…

βœ…

βœ…

❌

❌

βœ…

[text\_to\_image()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_to_image)

βœ…

❌

❌

βœ…

❌

❌

❌

βœ…

βœ…

βœ…

❌

βœ…

❌

βœ…

[text\_to\_speech()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_to_speech)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

βœ…

❌

❌

[text\_to\_video()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_to_video)

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

[tabular\_classification()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.tabular_classification)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[tabular\_regression()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.tabular_regression)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[token\_classification()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.token_classification)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[translation()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.translation)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[visual\_question\_answering()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.visual_question_answering)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[zero\_shot\_image\_classification()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.zero_shot_image_classification)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

[zero\_shot\_classification()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.zero_shot_classification)

❌

❌

❌

❌

❌

❌

❌

βœ…

❌

❌

❌

❌

❌

❌

Check out the [Tasks](https://huggingface.co/tasks) page to learn more about each task.

[](#openai-compatibility)OpenAI compatibility
---------------------------------------------

The `chat_completion` task follows [OpenAI’s Python client](https://github.com/openai/openai-python) syntax. What does it mean for you? It means that if you are used to play with `OpenAI`’s APIs you will be able to switch to `huggingface_hub.InferenceClient` to work with open-source models by updating just 2 line of code!

Copied

\- from openai import OpenAI
\+ from huggingface\_hub import InferenceClient

\- client = OpenAI(
\+ client = InferenceClient(
    base\_url=...,
    api\_key=...,
)


output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=\[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    \],
    stream=True,
    max\_tokens=1024,
)

for chunk in output:
    print(chunk.choices\[0\].delta.content)

And that’s it! The only required changes are to replace `from openai import OpenAI` by `from huggingface_hub import InferenceClient` and `client = OpenAI(...)` by `client = InferenceClient(...)`. You can choose any LLM model from the Hugging Face Hub by passing its model id as `model` parameter. [Here is a list](https://huggingface.co/models?pipeline_tag=text-generation&other=conversational,text-generation-inference&sort=trending) of supported models. For authentication, you should pass a valid [User Access Token](https://huggingface.co/settings/tokens) as `api_key` or authenticate using `huggingface_hub` (see the [authentication guide](https://huggingface.co/docs/huggingface_hub/quick-start#authentication)).

All input parameters and output format are strictly the same. In particular, you can pass `stream=True` to receive tokens as they are generated. You can also use the [AsyncInferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.AsyncInferenceClient) to run inference using `asyncio`:

Copied

import asyncio
\- from openai import AsyncOpenAI
\+ from huggingface\_hub import AsyncInferenceClient

\- client = AsyncOpenAI()
\+ client = AsyncInferenceClient()

async def main():
    stream = await client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=\[{"role": "user", "content": "Say this is a test"}\],
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices\[0\].delta.content or "", end="")

asyncio.run(main())

You might wonder why using [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) instead of OpenAI’s client? There are a few reasons for that:

1.  [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) is configured for Hugging Face services. You don’t need to provide a `base_url` to run models with Inference Providers. You also don’t need to provide a `token` or `api_key` if your machine is already correctly logged in.
2.  [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) is tailored for both Text-Generation-Inference (TGI) and `transformers` frameworks, meaning you are assured it will always be on-par with the latest updates.
3.  [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) is integrated with our Inference Endpoints service, making it easier to launch an Inference Endpoint, check its status and run inference on it. Check out the [Inference Endpoints](./inference_endpoints.md) guide for more details.

`InferenceClient.chat.completions.create` is simply an alias for `InferenceClient.chat_completion`. Check out the package reference of [chat\_completion()](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient.chat_completion) for more details. `base_url` and `api_key` parameters when instantiating the client are also aliases for `model` and `token`. These aliases have been defined to reduce friction when switching from `OpenAI` to `InferenceClient`.

[](#function-calling)Function Calling
-------------------------------------

Function calling allows LLMs to interact with external tools, such as defined functions or APIs. This enables users to easily build applications tailored to specific use cases and real-world tasks. `InferenceClient` implements the same tool calling interface as the OpenAI Chat Completions API. Here is a simple example of tool calling using [Nebius](https://nebius.com/) as the inference provider:

Copied

from huggingface\_hub import InferenceClient

tools = \[
        {
            "type": "function",
            "function": {
                "name": "get\_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Paris, France"
                        }
                    },
                    "required": \["location"\],
                },
            }
        }
\]

client = InferenceClient(provider="nebius")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=\[
    {
        "role": "user",
        "content": "What's the weather like the next 3 days in London, UK?"
    }
    \],
    tools=tools,
    tool\_choice="auto",
)

print(response.choices\[0\].message.tool\_calls\[0\].function.arguments)

Please refer to the providers’ documentation to verify which models are supported by them for Function/Tool Calling.

[](#structured-outputs--json-mode)Structured Outputs & JSON Mode
----------------------------------------------------------------

InferenceClient supports JSON mode for syntactically valid JSON responses and Structured Outputs for schema-enforced responses. JSON mode provides machine-readable data without strict structure, while Structured Outputs guarantee both valid JSON and adherence to a predefined schema for reliable downstream processing.

We follow the OpenAI API specs for both JSON mode and Structured Outputs. You can enable them via the `response_format` argument. Here is an example of Structured Outputs using [Cerebras](https://www.cerebras.ai/) as the inference provider:

Copied

from huggingface\_hub import InferenceClient

json\_schema = {
    "name": "book",
    "schema": {
        "properties": {
            "name": {
                "title": "Name",
                "type": "string",
            },
            "authors": {
                "items": {"type": "string"},
                "title": "Authors",
                "type": "array",
            },
        },
        "required": \["name", "authors"\],
        "title": "Book",
        "type": "object",
    },
    "strict": True,
}

client = InferenceClient(provider="cerebras")


completion = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=\[
        {"role": "system", "content": "Extract the books information."},
        {"role": "user", "content": "I recently read 'The Great Gatsby' by F. Scott Fitzgerald."},
    \],
    response\_format={
        "type": "json\_schema",
        "json\_schema": json\_schema,
    },
)

print(completion.choices\[0\].message)

Please refer to the providers’ documentation to verify which models are supported by them for Structured Outputs and JSON Mode.

[](#async-client)Async client
-----------------------------

An async version of the client is also provided, based on `asyncio` and `aiohttp`. You can either install `aiohttp` directly or use the `[inference]` extra:

Copied

pip install --upgrade huggingface\_hub\[inference\]
\# or
\# pip install aiohttp

After installation all async API endpoints are available via [AsyncInferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.AsyncInferenceClient). Its initialization and APIs are strictly the same as the sync-only version.

Copied

\# Code must be run in an asyncio concurrent context.
\# $ python -m asyncio
\>>> from huggingface\_hub import AsyncInferenceClient
\>>> client = AsyncInferenceClient()

\>>> image = await client.text\_to\_image("An astronaut riding a horse on the moon.")
\>>> image.save("astronaut.png")

\>>> async for token in await client.text\_generation("The Huggingface Hub is", stream=True):
...     print(token, end="")
 a platform for sharing and discussing ML-related content.

For more information about the `asyncio` module, please refer to the [official documentation](https://docs.python.org/3/library/asyncio.html).

[](#mcp-client)MCP Client
-------------------------

The `huggingface_hub` library now includes an experimental [MCPClient](/docs/huggingface_hub/v0.33.4/en/package_reference/mcp#huggingface_hub.MCPClient), designed to empower Large Language Models (LLMs) with the ability to interact with external Tools via the [Model Context Protocol](https://modelcontextprotocol.io) (MCP). This client extends an [AsyncInferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.AsyncInferenceClient) to seamlessly integrate Tool usage.

The [MCPClient](/docs/huggingface_hub/v0.33.4/en/package_reference/mcp#huggingface_hub.MCPClient) connects to MCP servers (either local `stdio` scripts or remote `http`/`sse` services) that expose tools. It feeds these tools to an LLM (via [AsyncInferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)). If the LLM decides to use a tool, [MCPClient](/docs/huggingface_hub/v0.33.4/en/package_reference/mcp#huggingface_hub.MCPClient) manages the execution request to the MCP server and relays the Tool’s output back to the LLM, often streaming results in real-time.

In the following example, we use [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) model via [Nebius](https://nebius.com/) inference provider. We then add a remote MCP server, in this case, an SSE server which made the Flux image generation tool available to the LLM.

Copied

import os

from huggingface\_hub import ChatCompletionInputMessage, ChatCompletionStreamOutput, MCPClient

async def main():
    async with MCPClient(
        provider="nebius",
        model="Qwen/Qwen2.5-72B-Instruct",
        api\_key=os.environ\["HF\_TOKEN"\],
    ) as client:
        await client.add\_mcp\_server(type\="sse", url="https://evalstate-flux1-schnell.hf.space/gradio\_api/mcp/sse")

        messages = \[
            {
                "role": "user",
                "content": "Generate a picture of a cat on the moon",
            }
        \]

        async for chunk in client.process\_single\_turn\_with\_tools(messages):
            \# Log messages
            if isinstance(chunk, ChatCompletionStreamOutput):
                delta = chunk.choices\[0\].delta
                if delta.content:
                    print(delta.content, end="")

            \# Or tool calls
            elif isinstance(chunk, ChatCompletionInputMessage):
                print(
                    f"\\nCalled tool '{chunk.name}'. Result: '{chunk.content if len(chunk.content) < 1000 else chunk.content\[:1000\] + '...'}'"
                )

if \_\_name\_\_ == "\_\_main\_\_":
    import asyncio

    asyncio.run(main())

For even simpler development, we offer a higher-level [Agent](/docs/huggingface_hub/v0.33.4/en/package_reference/mcp#huggingface_hub.Agent) class. This β€˜Tiny Agent’ simplifies creating conversational Agents by managing the chat loop and state, essentially acting as a wrapper around [MCPClient](/docs/huggingface_hub/v0.33.4/en/package_reference/mcp#huggingface_hub.MCPClient). It’s designed to be a simple while loop built right on top of an [MCPClient](/docs/huggingface_hub/v0.33.4/en/package_reference/mcp#huggingface_hub.MCPClient). You can run these Agents directly from the command line:

Copied

\# install latest version of huggingface\_hub with the mcp extra
pip install -U huggingface\_hub\[mcp\]
\# Run an agent that uses the Flux image generation tool
tiny-agents run julien-c/flux-schnell-generator

When launched, the Agent will load, list the Tools it has discovered from its connected MCP servers, and then it’s ready for your prompts!

[](#advanced-tips)Advanced tips
-------------------------------

In the above section, we saw the main aspects of [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient). Let’s dive into some more advanced tips.

### [](#billing)Billing

As an HF user, you get monthly credits to run inference through various providers on the Hub. The amount of credits you get depends on your type of account (Free or PRO or Enterprise Hub). You get charged for every inference request, depending on the provider’s pricing table. By default, the requests are billed to your personal account. However, it is possible to set the billing so that requests are charged to an organization you are part of by simply passing `bill_to="<your_org_name>"` to `InferenceClient`. For this to work, your organization must be subscribed to Enterprise Hub. For more details about billing, check out [this guide](https://huggingface.co/docs/api-inference/pricing#features-using-inference-providers).

Copied

\>>> from huggingface\_hub import InferenceClient
\>>> client = InferenceClient(provider="fal-ai", bill\_to="openai")
\>>> image = client.text\_to\_image(
...     "A majestic lion in a fantasy forest",
...     model="black-forest-labs/FLUX.1-schnell",
... )
\>>> image.save("lion.png")

Note that it is NOT possible to charge another user or an organization you are not part of. If you want to grant someone else some credits, you must create a joint organization with them.

### [](#timeout)Timeout

Inference calls can take a significant amount of time. By default, [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) will wait β€œindefinitely” until the inference complete. If you want more control in your workflow, you can set the `timeout` parameter to a specific value in seconds. If the timeout delay expires, an [InferenceTimeoutError](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceTimeoutError) is raised, which you can catch in your code:

Copied

\>>> from huggingface\_hub import InferenceClient, InferenceTimeoutError
\>>> client = InferenceClient(timeout=30)
\>>> try:
...     client.text\_to\_image(...)
... except InferenceTimeoutError:
...     print("Inference timed out after 30s.")

### [](#binary-inputs)Binary inputs

Some tasks require binary inputs, for example, when dealing with images or audio files. In this case, [InferenceClient](/docs/huggingface_hub/v0.33.4/en/package_reference/inference_client#huggingface_hub.InferenceClient) tries to be as permissive as possible and accept different types:

*   raw `bytes`
*   a file-like object, opened as binary (`with open("audio.flac", "rb") as f: ...`)
*   a path (`str` or `Path`) pointing to a local file
*   a URL (`str`) pointing to a remote file (e.g. `https://...`). In this case, the file will be downloaded locally before being sent to the API.

Copied

\>>> from huggingface\_hub import InferenceClient
\>>> client = InferenceClient()
\>>> client.image\_classification("https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Cute\_dog.jpg/320px-Cute\_dog.jpg")
\[{'score': 0.9779096841812134, 'label': 'Blenheim spaniel'}, ...\]

[< \> Update on GitHub](https://github.com/huggingface/huggingface_hub/blob/main/docs/source/en/guides/inference.md)