## LaTeX forum ⇒ Conversion Tools ⇒ Introducing LaTeXifier – A Converter from PDF to LaTeX

Information and discussion about output converters related to LaTeX (e.g. dvips, ps2pdf, ...)
Latexifier
Posts: 2
Joined: Mon Oct 29, 2012 4:41 pm

### Introducing LaTeXifier – A Converter from PDF to LaTeX

Hi everybody,

I would like to present you a project named LaTeXifier and managed by a group of Master students at the ENS Lyon (a French university). Our goal is to provide a free software enable to convert a PDF generated with LaTeX, to a source file (*.tex) whose content is as close as possible from the original one.

Such a tool can be useful when you lose your sources, or when you are not the creator of the PDF. We have already developed the core and some basic packages, such as recognizing text, lists and sections, and we will add packages progressively.

I would like to have your opinion. What do you think about our project? What do you primarily need? Which packages would you like us to handle?

Benjamin

josephwright
Site Moderator
Posts: 814
Joined: Tue Jul 01, 2008 2:19 pm
I wonder what you mean by 'as close to your input as possible'. From first principals, you can't tell if TeX, LaTeX or ConTeXt was used for a file, even if you can tell from the fonts that it's likely to be TeX-related. How do you handle for example mathematics (which might be hand-adjusted or use something like breqn or nath)?
Joseph Wright

Stefan Kottwitz
Posts: 9593
Joined: Mon Mar 10, 2008 9:44 pm
Hi Benjamin,

welcome to the board!

In my eyes, it seems to be very challenging. Great, that you start such a project! If you would achieve the objective or not, you will learn by developing it and the LaTeX world becomes richer.

I guess it's hardly possible to always come close to the original source. Documents can base on various classes and can load any of hundreds or thousand packages. Macros of classes and packages can be used and there even can be user defined macros. Perhaps you are planning it like this: based on a PDF document, which your software analyzes,

• assumptions are made regarding a base document class (other classes are often derived from base classes), so you choose a class
• generate settings which seems to match the layout (such as options to geometry, even if the original document used typearea instead)
• read out the used fonts from the PDF document information, either load the corresponding font package or switch to XeLaTeX if meaningful (TrueType, OpenType detected)
• based on recognizing structures you built itemize and enumerate lists

Math formulas can be very difficult. Generally, I guess you are still at the beginning, because the topic seems to be very complex and you mention just sections and list. Perhaps tell it introducing plans, asking for opinions, rather than introducing a software which perhaps many people doubt it can be made at all.

Well, if I would lose my source of an important document, but still have the PDF file, I would be glad if there would be a tool which generates at least a framework-like source document with matching settings as a start, even if I would have to do math and more.

Finally, I guess it doesn't matter if the PDF was generated by LaTeX - it seems your tool could be especially useful if it would able to convert for example an MS Word made document into high quality LaTeX source code.

Stefan

Latexifier
Posts: 2
Joined: Mon Oct 29, 2012 4:41 pm

By "as close as possible", we mean that the output file .tex will be quite similar in the content (and the meaning). It could differ from the original pdf on graphical issues, but we consider the user could (and often want to) change the layout by his own.

Stefan_K, you are quite right about how our tool will work.
We make assumptions on the document class, the packages, and we divide our pdf such as each block is associated to a recognized package.
But, like I previously said, we won't try to match the layout because it's quite meaningless and easy to change.

We began by packages which seemed the most useful. The software will be free, so the community will be able to treat packages they need. That's why we ask for wanted packages in order to maybe include them in our project if enough people ask for them.

About math formulas, it's true that it's rather difficult. But we have good hope to produce a satisfiable tool. In fact, the most difficult part is to treat symbols with arguments, such as \frac{}{} or srqt{}, because PDF writes an argument, makes a bar and then writes the other argument, so we have to check positions in order to find the beginning and the end of arguments. But right now, we can handle most of the cases. For simple symbols, we are just completing tables of symbols treated automatically.

The project began in October, so we are just at the beginning, but our software already works on simple pdf files. We hope to handle files generated by other ways too, but it will essentially depends on whether your file is LaTex-like. In fact, we can handle different fonts (MS Word is acting very weird on that...) but we didn't check the margins which may be different and could make the computation more difficult.

Still thank for your remarks ! If you have any questions left, just ask !

didier
Posts: 1
Joined: Thu Jan 10, 2013 3:23 pm
It's a great and ambitious project.

Latex is a set of macro-commands of tex.
In fact, the resulting PDF file is result of preprocessing Latex in Tex.

Hence, I suggest you to proceed in two-step process, the reversing process.
First, translating PDF in Tex, that should not be so tricky.

Second, if possible, attempting to recognize sequences of tex commands as macro-commands. I guess this assumes to expand all the forms of commands latex macro-commands in order to compare them with the tex sequences obtained in first step in order to substitute them by the first ones.

Vignesh Kumar
Posts: 1
Joined: Wed Jan 08, 2020 9:24 am
May i know about LaTexifier and how to get it