Installation

In [2]:
!pip install transformers torch pytesseract sentencepiece sacremoses --quiet

# 1 - Sentiment analysis

Importing a sentiment analysis model :

In [3]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", verbose = 0)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Generating an output with this model :

In [3]:
review = "Your product is so bad!"

In [4]:
sentiment_pipeline("review")

[{'label': 'POSITIVE', 'score': 0.9946893453598022}]

# 2 - Translation

Importing a translation model :

In [5]:
en_fr_pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Generating an output with this model :

In [6]:
en_fr_pipe("Where is the library?")

[{'translation_text': 'Où est la bibliothèque ?'}]

# Your Turn !

Using the examples above, and what you have learned about Hugging Face, carry out the following tasks :

# 3 - Summarizing a website


Using the following code, search for a model that allows summarizing text from a website.

In [2]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

data = requests.get("https://www.bbc.com/news/technology-66128106")
text = BeautifulSoup(data.content).find("article").text

In [3]:
# Function to chunk text with an overlap
def chunk_text_with_overlap(tokens, max_length, overlap):
    chunks = []
    for i in range(0, len(tokens), max_length - overlap):
        chunk = tokens[i:i + max_length]
        chunks.append(chunk)
    return chunks

# Initialize the summarization pipeline, tokenizer, and model
model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
summary_pipe = pipeline("summarization", model=model, tokenizer=tokenizer)

# Get the maximum length from the model configuration
max_length = model.config.max_position_embeddings

# Define the overlap
overlap = 60  # Adjust overlap as needed

# Tokenize the text
tokens = tokenizer(text, return_tensors='pt', truncation=False)['input_ids'][0]

# Chunk the tokens with overlap
chunks = chunk_text_with_overlap(tokens, max_length, overlap)

# Summarize each chunk
summaries = []
for chunk in chunks:
    input_ids = chunk.unsqueeze(0)  # Add batch dimension
    summary_ids = model.generate(input_ids, max_length=max_length, num_beams=4, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    summaries.append(summary)

# Concatenate the summaries
summary = ' '.join(summaries)

print(summary)




'Inevitable' jobs will be more automated, says new AI adviser. Tech entrepreneur Ian Hogarth is five weeks into his role as chair of the government's new AI taskforce. There have already been reports of multiple job losses as a result of companies choosing to use AI tools instead of humans. Ian Hogarth says the UK can still take a seat at the centre of the AI revolution. He says nation states will start to see their role in directing critical AI infrastructure in a new way. Despite the trials ahead, he is optimistic the UK is not too late.


# 4 - Question Answering

Using the text from the previous application, try to create a pipeline that retrieves the amount of money raised (according to the information in the article).

In [6]:
questions_pipe = pipe = pipeline("question-answering", model="deepset/roberta-base-squad2")
questions_pipe(
    #YOUR CODE HERE
    question = "how much money she raised?",
    context = text
)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

{'score': 2.4412167931586737e-06,
 'start': 4653,
 'end': 4705,
 'answer': 'hoovered up by those companies which can afford them'}

# 5 - Image-to-Text

Find a model that allows creating a description of the following image. Then, try it with other images.

In [7]:
image="https://images.theconversation.com/files/118670/original/image-20160414-4709-vaix4b.jpg?ixlib=rb-1.1.0&rect=0%2C551%2C3444%2C1670&q=45&auto=format&w=1356&h=668&fit=crop"

In [12]:
captioner = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")


captioner(image)



[{'generated_text': 'a large group of people sitting on a grassy field '}]

# 6 - Text Classification

Search for a model on Hugging Face that allows classifying the following text based on the level of urgency. The urgency levels are defined in the variable labels.

In [13]:
# MODIFY THE TEXT WITHIN THE  QUOTES
classification = pipeline(model="facebook/bart-large-mnli")
text = "I have a problem with my iPhone that needs to be resolved asap!!"

# If you want to modify the list of categories, change the classification words within the brackets.
# Make sure to keep them in quotes and separated by commas.
labels = ["urgent", "not urgent", "phone", "tablet", "computer"]

classification(
    text,labels
)



config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'I have a problem with my iPhone that needs to be resolved asap!!',
 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'],
 'scores': [0.5097073316574097,
  0.48252567648887634,
  0.004156067967414856,
  0.0023611385840922594,
  0.0012497848365455866]}

# 7 - Named Entity Recognition

Find a model that allows recognizing different types of entities, such as location (LOC), organizations (ORG), person (PER), and miscellaneous (MISC). Then, test it with this text; it should tell you that there is a person and a city.

In [15]:
text = "My name is John and I come from Berlin"

In [16]:
ner_pipeline = pipeline("token-classification", model="dslim/bert-base-NER")

ner_pipeline(text)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity': 'B-PER',
  'score': 0.99863094,
  'index': 4,
  'word': 'John',
  'start': 11,
  'end': 15},
 {'entity': 'B-LOC',
  'score': 0.9996766,
  'index': 9,
  'word': 'Berlin',
  'start': 32,
  'end': 38}]

# 8 - Speech-to-Text

Find a model on Hugging Face that allows creating text from an audio.



In [4]:
transcription_pipe = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")


audio_url = "https://voiceage.com/wbsamples/in_mono/Television.wav"
transcription_pipe(audio_url)

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


{'text': ' It was dark times Harry, dark times. Voldemort started to gather some followers.'}