kalle07
/

pdf2txt_parser_converter

Model card Files Files and versions

kalle07 commited on Jun 13

Commit

28cd386

·

verified ·

1 Parent(s): 0a321d0

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ better input = better output<br>
 ...
-Most LLM applications only convert your PDF simple to txt, nothing more, its like you save your PDF as txt file. Blocks of text that are close together are often mixed up and tables cannot be read logically.
 Therefore its better to convert it with some help of a <b>"Parser"</b>. The embedder can now find a better context.<br>
 I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
 <ul style="line-height: 1.05;">
@@ -34,7 +34,7 @@ I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
 <li>Instant view of the result, hit one pdf on top of the list</li>
 <li>Converts some common tables as json-foramt inside the txt file, readable for embedder</li>
 <li>Adds the absolute PAGE number to each page</li>
-<li>Adds the label “Chapter” for large font and/or “important” for bold font</li>
 <li>All txt files will be created in original folder of PDF</li>
 <li>All previous txt files are overwritten</li>
 <li>aprox 5 to 20 Pages/sec - depends on complexity and system-power</li>
@@ -46,6 +46,8 @@ It is really hard for me with GUI and the Function and in addition to compile it
 For the python-file you need to import missing libraries.<br>
 Of course there is a lot of need for optimization(save/error-handling) or the use of other parser libraries, but it's a start.
 <br><br>
 ...
 <br>
 I also have a "<b>docling</b>" parser with OCR (GPU is need for fast processing), its only be a python-file, not compiled.<br>

 ...
+Most LLM applications only convert your PDF simple to txt, nothing more, its like you save your PDF as txt file. For usual flowing text Books its quite okay! But often blocks of text that are close together are mixed up and tables cannot be read logically.
 Therefore its better to convert it with some help of a <b>"Parser"</b>. The embedder can now find a better context.<br>
 I work with "<b>pdfplumber/pdfminer</b>" none OCR, so its very fast!<br>
 <ul style="line-height: 1.05;">
 <li>Instant view of the result, hit one pdf on top of the list</li>
 <li>Converts some common tables as json-foramt inside the txt file, readable for embedder</li>
 <li>Adds the absolute PAGE number to each page</li>
+<li>Adds the label “chapter” for large font and/or “important” for bold font</li>
 <li>All txt files will be created in original folder of PDF</li>
 <li>All previous txt files are overwritten</li>
 <li>aprox 5 to 20 Pages/sec - depends on complexity and system-power</li>
 For the python-file you need to import missing libraries.<br>
 Of course there is a lot of need for optimization(save/error-handling) or the use of other parser libraries, but it's a start.
 <br><br>
+i am working on a 50% faster version. in addition, the GUI should allow more influence on the processing, e.g. faster raw text, trim margins (yes/no) and set % yourself, set unimportant text block size, layout with line breaks or force more continuous text.<br>
+give me a hand to help ;) <br>
 ...
 <br>
 I also have a "<b>docling</b>" parser with OCR (GPU is need for fast processing), its only be a python-file, not compiled.<br>