DeDeckerThomas commited on
Commit
109e2cc
·
1 Parent(s): 7fa2224

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -122,14 +122,65 @@ For more in detail information, you can take a look at the training notebook (li
122
  | Early Stopping Patience | 1 |
123
 
124
  ### Preprocessing
 
125
  ```python
126
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ```
128
-
129
  ### Postprocessing
 
130
  ```python
131
-
 
132
  ```
 
133
  ## 📝 Evaluation results
134
 
135
  One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
 
122
  | Early Stopping Patience | 1 |
123
 
124
  ### Preprocessing
125
+ The documents in the dataset are already preprocessed into list of words with the corresponding keyphrases. The only thing that must be done is tokenization and joining all keyphrases into one string with a certain seperator of choice(;).
126
  ```python
127
+ def pre_process_keyphrases(text_ids, kp_list):
128
+ kp_order_list = []
129
+ kp_set = set(kp_list)
130
+ text = tokenizer.decode(
131
+ text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
132
+ )
133
+ text = text.lower()
134
+ for kp in kp_set:
135
+ kp = kp.strip()
136
+ kp_index = text.find(kp.lower())
137
+ kp_order_list.append((kp_index, kp))
138
+ kp_order_list.sort()
139
+ present_kp, absent_kp = [], []
140
+ for kp_index, kp in kp_order_list:
141
+ if kp_index < 0:
142
+ absent_kp.append(kp)
143
+ else:
144
+ present_kp.append(kp)
145
+ return present_kp, absent_kp
146
+
147
+ def preprocess_fuction(samples):
148
+ processed_samples = {"input_ids": [], "attention_mask": [], "labels": []}
149
+ for i, sample in enumerate(samples[dataset_document_column]):
150
+ input_text = " ".join(sample)
151
+ inputs = tokenizer(
152
+ input_text,
153
+ padding="max_length",
154
+ truncation=True,
155
+ )
156
+ present_kp, absent_kp = pre_process_keyphrases(
157
+ text_ids=inputs["input_ids"],
158
+ kp_list=samples["extractive_keyphrases"][i]
159
+ + samples["abstractive_keyphrases"][i],
160
+ )
161
+ keyphrases = present_kp
162
+ keyphrases += absent_kp
163
+ target_text = f" {keyphrase_sep_token} ".join(keyphrases)
164
+ with tokenizer.as_target_tokenizer():
165
+ targets = tokenizer(
166
+ target_text, max_length=40, padding="max_length", truncation=True
167
+ )
168
+ targets["input_ids"] = [
169
+ (t if t != tokenizer.pad_token_id else -100)
170
+ for t in targets["input_ids"]
171
+ ]
172
+ for key in inputs.keys():
173
+ processed_samples[key].append(inputs[key])
174
+ processed_samples["labels"].append(targets["input_ids"])
175
+ return processed_samples
176
  ```
 
177
  ### Postprocessing
178
+ For the post-processing, you will need to split the string based on the keyphrase separator.
179
  ```python
180
+ def extract_keyphrases(examples):
181
+ return [example.split(keyphrase_sep_token) for example in examples]
182
  ```
183
+
184
  ## 📝 Evaluation results
185
 
186
  One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.