GeneralHyphenation of similar words

LaTeX specific issues not fitting into one of the other forums of this category.
Post Reply
meho_r
Posts: 823
Joined: Tue Aug 07, 2007 5:28 pm

Hyphenation of similar words

Post by meho_r »

In my language (Bosnian) there are many similar words that should have similar hyphenation break points, but it's a hard job putting all of them in \hyphenation{} in the preamble. For example, consider these words:
"pretpostavka", "pretpostavke", "pretpostavku", "pretpostaviti", "pretpostavljati" etc.

I would like to set this scheme: "pret-po-stav" to apply to all variants of this word (common part) regardless of the suffix. Is it possible? Is there some kind of "regex" that can be used in \hyphenation for this situation (something like: "pret-po-stav*" where "*" will replace suffixes)?

Recommended reading 2024:

LaTeXguide.org • LaTeX-Cookbook.net • TikZ.org

NEW: TikZ book now 40% off at Amazon.com for a short time.

User avatar
black-wolf
Posts: 7
Joined: Wed Jul 02, 2008 10:58 pm

Hyphenation of similar words

Post by black-wolf »

Hello,

Why don't you disable hyphenation?

Just add this to the preamble:

Code: Select all

\usepackage[none]{hyphenat}
LaTeX will do justification as usual but no hyphenation. I use this by default on my documents.

Greetings from Portugal :D
meho_r
Posts: 823
Joined: Tue Aug 07, 2007 5:28 pm

Re: Hyphenation of similar words

Post by meho_r »

Thanks, but that's not acceptable solution. In most cases hyphenation works correctly (croatian hyphenation scheme). But there are some words that need correction.
User avatar
Juanjo
Posts: 657
Joined: Sat Jan 27, 2007 12:46 am

Hyphenation of similar words

Post by Juanjo »

What you are really searching for are hyphenation patterns for the bosnian language. To my knowledge, they are not actually implemented. TeX has a macro called \pattern to add them. But this macro can be only used by INITEX, that is, when TeX is dumping some format (like LaTeX). Too difficult for a normal user. One has to bound oneself to \hyphenation in order to build a list of exceptions.

Anyway, I've done a test that may help. I've compiled the following code:

Code: Select all

\documentclass{article}
\usepackage[croatian,serbian,english]{babel}
\begin{document}
\showhyphens{pretpostavka pretpostavke pretpostavku pretpostaviti pretpostavljati}
\selectlanguage{serbian}
\showhyphens{pretpostavka pretpostavke pretpostavku pretpostaviti pretpostavljati}
\selectlanguage{croatian}
\showhyphens{pretpostavka pretpostavke pretpostavku pretpostaviti pretpostavljati}
\end{document}
There is no output. It doesn't matter, since the important things are in the log file. The \showhyphens command writes there the positions where hyphens could be placed. I copy the relevant lines:

Code: Select all

[] \OT1/cmr/m/n/10 pret-postavka pret-postavke pret-postavku pret-postaviti pre
t-postavl-jati

[] \OT1/cmr/m/n/10 pret-po-stavka pret-po-stavke pret-po-stavku pret-po-sta-vit
i pret-po-sta-vljati

[] \OT1/cmr/m/n/10 pret-pos-tavka pret-pos-tavke pret-pos-tavku pret-pos-ta-vit
i pret-pos-tav-ljati
These lines shows hyphens when english, serbian and croatian are the active languages. It is clear that english patterns are not valid. However, for the words considered here, it seems that serbian patterns fit better than the croatian ones. You may perform more extensive tests and see if it is convenient for you to switch to serbian or any other language in your geographical area.
The CTAN lion is an artwork by Duane Bibby. Courtesy of www.ctan.org.
meho_r
Posts: 823
Joined: Tue Aug 07, 2007 5:28 pm

Re: Hyphenation of similar words

Post by meho_r »

In other words, if I use croatian hyphenation pattern, for all exception there's only one way: manually input all variants. Well... OK then. Thank you very much for your replies.

BTW, a little bit off topic, where to look for instructions and procedure about adding new language support for babel? I tried to find babel homepage but without success. Any advice is appreciated.
User avatar
Juanjo
Posts: 657
Joined: Sat Jan 27, 2007 12:46 am

Hyphenation of similar words

Post by Juanjo »

meho_r wrote:In other words, if I use croatian hyphenation pattern, for all exception there's only one way: manually input all variants.
Yes, that's right.
meho_r wrote:BTW, a little bit off topic, where to look for instructions and procedure about adding new language support for babel? I tried to find babel homepage but without success. Any advice is appreciated.
To my knowledge, support is provided in two different levels:
  1. hyphenation patterns,
  2. special macros, name translations, particular layouts...
Let's start with (b). Each language needs two files: <language>.sty and <language>.ldf. Both are generated from a dtx file by running TeX on a suitable ins file. All the sty files are obtained together from base.ins (located, in TeX Live, at texmf-dist/source/generic/babel). Since base.ins should not be modified, by the moment, one can get bosnian.sty by editing any other <language>.sty and replacing <language> by bosnian.

The file <language>.ldf is generated by typesetting <language>.ins, which strips code from <language>.dtx (located also in texmf-dist/source/generic/babel). A new run of TeX on <language>.dtx yields the documentation corresponding to that language. So, remember, first the ins file, then the dtx file. In the attached zip file, I provided bosnian.ins and bosnian.dtx, which come from suitable changes in the corresponding croatian files. For your convenience, I've generated bosnian.ldf and the documentation (bosnian.pdf). For completeness, I also provide bosnian.sty and a simple test file. You'll see that I've tried to localise the \today macro (I've searched the name of the months in an on-line dictionary). Until bosnian hyphenation patterns could be available, the Bosnian language uses the croatian ones (that's the \let\l@bosnian\l@croatian command near the beginning of bosnian.ldf).

I hope that you can continue improving bosnian.dtx. You need to understand the particular syntax used there. Look at this tutorial. But you need even more to understand a bit the babel package.

Now, let's go with (a). Hyphenation patterns are defined in the file hyph-<language abbreviation>.tex, located at texmf-dist/tex/generic/hyph-utf8/patterns. The meaning of each patterns and the hyphenation algorithm is explained in Appendix H of The TeXbook. Some configuration files are also required. To adapt all that for Bosnian is really a hard job. And once adapted, a new LaTeX format is needed so the patterns come usable. You should ask for help: perhaps in comp.text.tex, in bosnian institutions, contributors to hyphenation patterns in other languages... I don't know. In the meantime, I think you may seriously perform extensive tests (with the help of \showhyphens) to see if the actual croatian patterns are really best suited to Bosnian than the serbian ones. If you compare hyph-hr.tex (croatian) and hyph-sh-latn.tex (serbocroatian), it seems that the latter contains more patterns, including groups of four or more letters, while the former considers at most four letters, so being less complete and more error prone. If you are finally convinced that hyph-sh-latn.tex could be a better startpoint for the bosnian patterns, I would recommend to change \let\l@bosnian\l@croatian to \let\l@bosnian\l@serbian in bosnian.dtx (so in bosnian.ldf).

I hope all this is really of some help for you.
Attachments
bosnian.zip
(57.12 KiB) Downloaded 169 times
The CTAN lion is an artwork by Duane Bibby. Courtesy of www.ctan.org.
meho_r
Posts: 823
Joined: Tue Aug 07, 2007 5:28 pm

Re: Hyphenation of similar words

Post by meho_r »

Wow, thank you very, very much. This is a great help indeed :D

In fact, I tried to make something similar in the past, but couldn't figure out relation between files and what files are needed for language to work properly. I changed every file in the texlive folder that has "croatian" or "hr" in their names :) And all that's needed are four files. Ahhh... And, of course, I missed the key point: \let\l@bosnian\l@croatian so hyphenation wasn't right.

As I've concluded, the path for those four files is:

1. for .ins and .dtx files: <texlive>/texmf-dist/source/generic/babel
2. for .ldf and .sty files: <texlive>/texmf-dist/tex/generic/babel

I tried it and works perfectly. BTW, you've done terrific work making changes in .dtx file. I'm amazed:)

I will try to do something in this regard in the future. I'll try to contact some institutions although I don't expect much except maybe from our LUG and their localization team. And, if we manage to make hyphenation pattern it can be used in OpenOffice.org too (as I was informed) so many will benefit from it ;)

Again, thank you very, very much :D
Post Reply