A look at the new journal style at PLOS

Posted on by 9 comments 13,704 views

I came across this post by Peter Murray-Rust, who spotted an error in the XML of a PLOS One paper. It might seem like a small point but is actually important. PLOS are one of the few publishers who publish papers in multiple formats, including PDF and XML. This is to be applauded and all publishers should be doing this. Of course the downside is that inquisitive people like Peter and me will examine these with a magnifying glass, but all in a good cause! As with Peter I care passionately about publishers creating content that is future-proof and machine-readable, and I want publications to be beautiful to look at too. Fortunately, that is possible today and the purpose of this post is to encourage publishers to do just that.

Back the the XML error. The paper has expressions like $0.06 \pm 0.01$ in several places. The PDF looks like

Now you can that the plus sign is centred and the minus sign is a bit lower than it should be. That is because it has been wrongly coded in the source file.

Looking at that XML, we can see that it is represented by

So the + symbol has been hacked to show $\pm$, not by using the correct character but by underlining it. And the XML (supposedly the definitive version of the publication) reflects this. Now an underline is sometimes used for emphasis, so the XML is saying that not only is this a plus symbol, it most certainly is one! Why is this important? Well, suppose a reader with impaired vision is accessing the page and has a text to speech reader that converts the XML. They will hear a “plus”. They might even hear it emphasized as it is underlined!

A quick hats-off to PLOS and to cc-by

In this post I am going to dissect a paper published by PLOS. I am going to praise and to criticize, giving my personal opinion. First of all thanks to PLOS for not only providing the PDF but also the XML, using a cc-by license, which means I can cut and chop and copy any part of the publication and say anything I want about it, and I merely have to attribute the author and the source of the article. It is quite ironic that the license allows me to criticize some parts, but that’s the way it goes! Other publishers, please join the party!

New journal style, new vendor

While looking at the PDF, I noticed that PLOS had adopted a new journal style with a single column and a wide left margin. This is a very welcome change. (There is now very little justification for 2 or more columns, unless the primary medium for the journal is a large print format.) Were it not for my extreme modesty I would draw attention to the uncanny similarity of the new style and the page style of PeerJ for which I proudly oversaw the page design. 😉

Around the same time as Peter’s article, I came across the Scholarly Kitchen blog post pointing out that the output of PLOS in January 2015 was only around 1/3 of that of December 2014, but for a “good reason”, namely that PLOS had teamed with a new composition vendor, and things would slow down at first, but would soon lead to gains in “speed, efficiency, and quality”. So perhaps it is a bit unfair to get the magnifying glass out while the new composition vendor is settling down, but on the other hand it might be a good time for a gentle look “under the hood”, all in the interests of making PLOS even better than it is. I decided to look at some files from PLOS Computational Biology as these tend to have more math in them, and it would be good to see how good the math is. A typical files was this one (by Saheed Imam and others), which has some math, but not what I would call really heavy math.

The design

As I say, single column is the only way to go these days, although there are exceptions, most notably when high volume printing is the most important medium, and the paper size is A4 or the American equivalent. In this case the only way to save paper is to go double column, else there are too many characters in a line. (90 characters including spaces is regarded as a maximum number of characters per line.) The advantages of a single column are:

• PDF can be read easily on screen
• Automated pagination is far easier
• Figure placement is less complex

So let’s look at some of the details of the new design…

Logos

The main PLOS logo didn’t appear as sharp as it should. Zooming in, I discovered that it is bitmapped (i.e. rasterized). Now looking at December issue, the logo was placed as a (very good) vector file. If this is a deliberate change, then it is hard to see why. The logo is the brand and it should be pin sharp even on the best printer. So on this one, the old composition vendor had the edge!

For those unfimiliar with vector files, these are generally of smaller file size than bitmapped images, and the shapes in the image are defined by mathematical curves, which means they will always be sharp, whatever the resolution of the screen or printer. See this highly enlarged image of the logo showing no degradation of the image:

The Open Access logo is also bitmapped:

The old papers used a different logo, but in vector form:

Finally, there is the CrossMark logo. This has always been bitmapped, but I can’t see why a good vector image provided by CrossRef cannot be used. It doesn’t cost any more! Here is the logo on PLOS papers (left) and one provided by CrossRef (right).

Typography

I will now look at the type, both from a graphic design and a from a typesetting angle.

Prelims

Let’s take a quick look at the prelims from the paper in question:

I have some suggestions here that will make the text more readable. Why not start each new affiliation on a new line, rather than running them in together into one paragraph, which in my view is only a remnant of print days where saving paper was paramount. In addition, the numbers would look better if they are pushed into the left margin, thus accentuating the text indent throughout the text. Similarly, the email should be flush left with the asterisk pushed into the margin.

Hyphenation

The text is “ragged-right”, i.e. there is no justification, or alignment, of the right margin. This is a good choice as it allows for large words that cannot be hyphenated (typical in life sciences) without the need for large inter-word spacing. By applying a little extra stretching between the words, hyphenation can be almost eliminated (but not quite). So it is about balancing the “raggedness” and the level of hyphenation. Let’s look at the first few lines of the abstract:

My feeling is that the level of hyphenation is a bit higher than need be, and comparable to that in justified text. In particular, hyphenating a compound word (well-studied) does not help readability. Most of these could be avoided by using the right global settings for word stretching and raggedness.

Tables

The alternate shading certainly aids readability, and I am surprised more publishers don’t use this, simply to guide the eye along rows. My main point about tables in PLOS papers is that if numbers in a column are related (and they usually are) then they should be aligned on the decimal point. In the above example, the numbers are left aligned, which makes it hard to compare them. Also worth mentioning that the dashes before the numbers should be minus signs which are longer. These are hyphens, not minus signs, and happen to be particularly short in in typeface chosen. They lack the “strength” of a real minus sign.

Finally, another point about alignment. Looking at the footnotes to the tables, would it not be nicer to have all the indices (*, a, b, c) hanging to the left of the table rows, and the text aligned to the right edge of (my) red line? It is simple changes like this that can make an article a joy to read, rather a chore!

Graphics

OK folks, Uncle Kaveh is not happy. Someone please tell me I am wrong, but from what I can see, PLOS accepts figures in TIFF, or EPS, including EPS vector files, but during production, every single vector file is converted to bitmap, and the original discarded. If this is the case, it is just not good enough, and PLOS really should be setting a better example. I do hope it is a case of the new composition vendors “settling in”. (From what I remember, in the old days figures were a mixture of vector and bitmap.) In my view this practice is analogous to going into an art gallery, taking photos of all the masterpieces, then putting those masterpieces in a skip! With a very few (valid) exceptions, vector files should not be rasterized. Here is why:

• More often than not the file gets bigger
• Almost always the quality degrades, especially when printed at high resolution
• Data is destroyed for ever – as Peter keeps telling us!
• The possibility of accessibility for the visually impaired is reduced to near-zero
• It’s disrespectful to authors who have sent clean, sharp EPS files, following the guidelines

Mathematics

Math typesetting is considered one of the toughest areas of publishing, and why the faint hearted should not attempt it, at least not commercially! Here are the challenges:

1. Ensuring accuracy and absolute fidelity with author’s manuscript
2. Abiding by subtle conventions of math typography and spacing – much more subtle than body text
3. Ensuring perfectly matching XML/MathML
4. Consistency throughout a publication.

It is the last one, consistency, that allows a paper to exude authority. And, needless to say, consistently good is better than consistently bad! The first two chellanges have been around for centuries, but 3 is what compounds the problem of typesetting. And the problem is that either the MathML has to be perfect, or it is useless. In fact it is more important for the XML to be correct, than for the PDF. There is no point having a definitive archive if it is not absolutely correct. And these are the reasons most typesetters (rightly) steer clear of math. So let’s roll our sleeves up and take a look at the file. Here is a screenshot from the first set of equations:

The first thing I look for is consistency, and there is an immediate “fail” here. Notice that constructions like $Corr_{mean}$ appear both in Roman and in Italic. Clearly they mean the same thing and should be consistently set. Same goes for $Cluster_y$, etc. The general convention is to set words and phrases as Roman, and variables as Italic. This makes the text easier to read. Another point is that the spacing in math seems rather arbitray, as in these spaces after a comma:

Another example of inconsistency is equation 8:

We have “Target_j” for example, both with j as a subscript and with it inline with main text. Also note numeral 3 which is in a different typeface in an equation and in text. These should use the same typeface.

One more set of equations further down provides a good example of why Roman and Italic should be used judiciously:

These equations look a little confusing, and invite a “double take”, because variables and words have been mixed, and all are in Italic. If words like “with” and the phrases in the last equation were set in Roman, then they would be much easier to read.

The XML

Again, kudos to PLOS for publishing the XML of the articles, which we’ll now take to bits!!

Dear reader, please do not be put off by the gobbledygook, and stay with Uncle Kaveh! XML looks intimidating but I promise, it is very simple – there’s just a lot of it… So let’s find the part of the XML corresponding to equation 11 which we have just looked at. (I simply searched for the phrase “Recall was calculated”.) Brace yourselves – here it is:

Yes, I know. But give me a second and I will explain. You are looking at MathML, which “describes” a mathematical equation, allowing it to be displayed in different ways, and most important, allowing accessibility to, for example, the blind and visually impaired. Most of the above tags are one of three “elements”, namely

• <mi>x</mi>identifiers, usually a variable like x or y
• <mo>+</mo>operators, $+$, $=$, etc
• <mn>2</mn> numbers.

And mml: just stands for MathML. Now MathML is allowed to be verbose, so that is not a problem in principle. But the above mass of characters look to be a bit too much for a small equation like (11). Well, that is because it has been wrongly coded, and very wrongly at that!! To see why, let’s change the contrast a bit:

So the verbose tags are now in the background and you can see the “meat” of the equation in yellow. You should be able to read the letters in the sequence of the equation, e.g. “True positive predictions”, etc. So what’s wrong with this code? Well, every letter is coded as an identifier, or a variable. So the word “True” is actually represented as “$T × r × u × e$” which is not going to help a blind mathematician understand the content! This, I am afraid, is an example of XML that is worse than no XML! I do have sympathy with the new composition vendor, having to deliver say 100 files a day, but these are elementary errors and I really feel I should point them out.

References

The references look OK in the PDF, but I don’t understand the current trend of using the same type style throughout a reference. The traditional convention of using Bold, Italic, etc, to differentiate the components of a reference is just common sense. It doesn’t cost any more to make the year Bold and the title Italic.

The XML is pretty clean:

I have a personal dislike for the “mixed citation” (used by most publishers), whereby punctuation is put in as part of the data, as you can see from this screenshot. But punctuation is not data. The XML should be pure, and punctuation should be put in on they fly during “rendering” or typesetting. Having these in the XML makes it easier for compositors, but I think we really have to escape from a print-centric world where the “look” of a page is what counts. And again think of the blind reader who really doesn’t care about brackets and semi-colons, and just wants the content!

Why am I picking on PLOS?

Because I love Open Access and I love PLOS for pushing boundaries. I am afraid that just as Elsevier, the biggest academic publisher, are the constant target for OA proponents, PLOS, as the “poster boys” of OA publishing, are the biggest target in Open Access. We are all looking to them to set an example new OA publishers can aspire to. So my aim is not to gloat at errors I have found in PLOS papers but to encourage them, with the support of their vendors, to be even better than they are, and to show that the OA model can deliver products as good as, if not better than, subscription models. And I will be looking at other publishers’ files too, so watch this space! Unfortunately, it is probably going to be OA publishers (using cc-by) that I will be analysing, due to copyright and access issues. I could write and ask permission, but I really couldn’t be bothered!!

Category: TeX, Publishing, XML

9 comments on “A look at the new journal style at PLOS”

1. Hi Kaveh,
nice article. I have one more suggestion concerning the table of (Motif, the long DNA sequences): I would have made the TGT… strings set to the same width, having equal spaces for each uppercase letter. The difference is small, but visually disturbing and one has the feeling the sequences are not the same length.
2. Yes Norbert. Someone else mentioned that. Proportional fonts have had a detrimental effect on many things, including email. My view is that we would be better off using fixed width fonts for everything but in the final typesetting of a book.

I should say that I might have been a bit unfair in my criticism of PLOS, considering they are publishing 100–200 papers a day! Overall, their production is well above average. Wait till I get to the other. 😉

3. Thanks for an instructive and interesting article.

I am surprised that authors do not typeset maths themselves in latex. I wouldn’t dream of not doing that, but maybe that is something that comes with being a physicist.

On the subject of maths, and the usage of upright a d italic, I am personally very much in favour of using standard nomenclature. In physics, this nomenclature is agreed upon by IUPAP, and is described in the “IUPAP

Among what this describes is that physical quantities should be written in italic, but, for example, mathematical constants in upright. For example, “e” (the natural logarithm”, the complex “i”, and the “d” as part of a differential should be upright. Thus, the end of an integral should be of the form “\mathrm{d}x. For subscript, they should be italic if they represent a physical quantity, otherwise not. These recommendations are, unfortunately, not adhered to to the extent that I think they should.

4. Hello Anders

First of all, I agree with your usage of Roman and Italic for variables, differentials, labels etc. And it makes the math more readable. In the old days many publishers, e.g. IOP, would copy edit author TeX files, and indeed make all these style changes. The trend in recent years has been to “go with author” with most publishers and just to ensure consistency within each article.

The problems I mentioned above regarding math come in due to the requirement to save the math as MathML where possible. Even when a perfect TeX file has been submitted, how do you convert from TeX to MathML and keep the semantics, and separate the variables from labels, etc? This is not a trivial task and a “filter” has to be written with great care and great sensitivity to the author’s manuscript, hence the problem that we see in the resultant XML files.

5. Thanks Mark and sorry for the late approval. I will remove moderation now…
6. “While looking at the PDF, I noticed that PLOS had adopted a new journal style with a single column and a wide left margin. This is a very welcome change. (There is now very little justification for 2 or more columns, unless the primary medium for the journal is a large print format.)”

Could you expand a bit on the above? What are the advantages of 1 column over 2 columns? Isn’t it easier to read 2 columns, or at least easier to read spans that are not larger than ~10 words? Is it a cost thing? Or about all the hyphenation? Other things?

• Hi Bob

Thanks for looking!

Double column is fine when printed on paper, but very hard to read on screen as you have to navigate from the bottom of one to the top of the next column. The primary reason for having two columns is so that on the traditional large paper (e.g. UK A4) there would be too many letters or words per line (as you rightly point out) and the eye has difficulty finding the start of the following line. As more and more journals are online only, the need for double column reduces.

A more technical advantage of single column pages is that it is easier to create the pages fully automatically from the source XML documents. As we try to reduce the cost of publishing, we want to minimize manual work involved in pagination. Double column usually needs manual intervention to deal with floating objects like figures, but single column can usually be automated.

Hope that explains why I suggested single column is preferable these days.

• Thanks for the reply – very useful!
7. “Double column is fine when printed on paper, but very hard to read on screen as you have to navigate from the bottom of one to the top of the next column.”

Er, maybe your Page-Up and Page-Down keys aren’t working..? It’s trivial to navigate from the bottom of the left column to the top of the right column using page down/up. In addition, some people like to print PDFs to read away from their computer screen, and then these single-column articles with very wide margins are a horrid waste of paper–and even more difficult to read.

The one and only reason for the switch to single-column at PLoS ONE was:  it’s easier and hence cheaper to format. In fact, it’s dirt cheap because there is no formatting! It’s just a long stream of text with the periodic figure.