Fonts & Character Setscut and pasting accented characters from pdf file

Information and discussion about fonts and character sets (e.g. how to use language specific characters)
User avatar
frabjous
Posts: 2064
Joined: Fri Mar 06, 2009 12:20 am

cut and pasting accented characters from pdf file

Post by frabjous »

novicedude wrote: All the fonts show up real nice in the PDF file, with the exception of the first café. The é shows up as several different symbols, mostly boxes, depending upon which font I'm using.
The first é worked fine for when I tested your code with every font I tried (though I had to test with different fonts, since I don't own the rather expensive ones you have! I prefer open source...). Be sure that TXC is saving the file using Unicode (UTF-8) character encoding. I think older versions of TXC don't support Unicode, so you'll need to use 2.0 alpha or beta or whatever it is now.
Then when I attempt to copy & paste into word, some of the ligatures wouldn't translate correctly. Some of them did. The three letter ligatures (ffi and ffl) translated correctly, but the two letter ones (fi and fl) didn't. With the Garamond font, the Th in The didn't translate, but with the other fonts I used, it translated okay.
Does it make a difference what font Word was using when you pasted? I would think the ligatures would only show up if the Word font had the same ones.

Recommended reading 2024:

LaTeXguide.org • LaTeX-Cookbook.net • TikZ.org

NEW: TikZ book now 40% off at Amazon.com for a short time.

novicedude
Posts: 16
Joined: Tue May 25, 2010 10:38 pm

cut and pasting accented characters from pdf file

Post by novicedude »

frabjous wrote: The first é worked fine for when I tested your code with every font I tried (though I had to test with different fonts, since I don't own the rather expensive ones you have! I prefer open source...). Be sure that TXC is saving the file using Unicode (UTF-8) character encoding. I think older versions of TXC don't support Unicode, so you'll need to use 2.0 alpha or beta or whatever it is now.
I tend to think the issue isn't so much with TeXnicCenter. When I run it with pdflatex, it works, but with xelatex it doesn't. That's with both of them running from TeXnicCenter.
Does it make a difference what font Word was using when you pasted? I would think the ligatures would only show up if the Word font had the same ones.
Word normally pastes it in whatever font the original was in. However, an easier way to test it is to attempt a text search with acrobat. With pdflatex and using those \pdfglyphtounicode mapping commands, I could find words with ligatures (e.g. fluid), but with xelatex, finds only work when their aren't ligatures (e.g. Fluid - with a capital F).
User avatar
frabjous
Posts: 2064
Joined: Fri Mar 06, 2009 12:20 am

cut and pasting accented characters from pdf file

Post by frabjous »

I tend to think the issue isn't so much with TeXnicCenter. When I run it with pdflatex, it works, but with xelatex it doesn't. That's with both of them running from TeXnicCenter.
Not supporting UTF-8 is a known problem with TXC 1.0. XeLaTeX is set to work with Unicode/UTF-8. PDFLaTeX works with other encodings, like Latin1, which is probably what you're using.

As for the pasting, does it matter what PDF software you use? What are you using now? Adobe Reader?

I don't really understand what the big deal with pasting into Word is. Surely, there are better ways to convert LaTeX to Word, like latex2rtf, pandoc or TeX4ht.
novicedude
Posts: 16
Joined: Tue May 25, 2010 10:38 pm

Re: cut and pasting accented characters from pdf file

Post by novicedude »

Copying from Acrobat Reader and Acrobat Pro yield the same results (ligatures as boxes). I need to be able to get the ligatures right so for two reasons. First is so that searches work and second is that I often distribute the pdf files and I want some of the recipients to be able to copy and paste the contents (much of what I make isn't copyrighted or anything like that).
User avatar
frabjous
Posts: 2064
Joined: Fri Mar 06, 2009 12:20 am

cut and pasting accented characters from pdf file

Post by frabjous »

You could simply disable the ligatures. See here.
Post Reply