Monday 10 August 2015

Get single pages from a PDF with pdftk

Another little trick for file manipulation in Linux, which I find useful when making/reading papers, is how to convert a multi-page PDF document into multiple single-page files.

Here is how I do it in Ubuntu, using the lovely program pdftk:

 for i in {1..7}; do echo Working on page $i; pdftk A=INFILE.pdf cat A$i output OUTFILEpage$i.pdf; done

Friday 7 August 2015

PhD thesis writing advice


When I sat down to write this post, I had an opening line in mind that was going to bemoan me being a bit remiss with my blogging of late. That was before I checked my last post and realised it had been ten months, and 'remiss' feels a bit inadequate: I basically stopped.
I had a pretty good reason, in that I needed to finish my PhD. Those ten months have basically been filled with write paper, submit paper, thesis thesis thesis, do a talk, thesis thesis thesis, get paper rejections, rewrite, resubmit, thesis thesis thesis … then finally viva (and a couple of more paper submissions). It is exhausting, and frankly having either a thesis or a paper to be writing at any one given time takes the shine off blogging somewhat.
Now that it's done and out of the way (no corrections, wahey!), I begin to turn my mind to getting the blog up and running again. What have I been up to lately that I could write about, I asked myself. Well, there's always this big blue beast.

Having been the major task of the last year of my life, I've spent more than a little time thinking about the process of thesis writing, so I thought I'd share some pointers and thoughts, maybe being the usual and the obvious.
1) Everything takes longer than expected.
This is the best advice I got going in, and is always the first thing I say to anyone who asks. In a perfect example of Hofstader's Law, no matter how much time you think a given task should take, in reality it will take longer.
2) Be consistent.
The longer a given document is, the more chance there is that inconsistencies will creep in. Theses are long documents that are invariably only read by accomplished academics (or those in training), many of whom have a keen eye for finding such consistencies. 'n=5' on one line and 'n = 4' on another. 'Fig1: Blah', 'Figure 2 – Blah blah' and 'Fig. 3 Even more blah'. These might not seem like big deals, but added up they can make the document feel less professional. Choice of spelling (e.g. UK or US English), where you put your spaces, how you refer to and describe your figures and references, all of these types of things: it doesn't matter so much how you choose to do them, as long as you do them the same each time.
On a related note, many technical notations – such as writing gene, protein or chemical names – are covered by exhaustive committee-approved nomenclatures. Use them, or justify why you're not using them, but again, be consistent.
3) Make it easy for yourself: get into good habits.
Writing long documents is difficult, so don't make it hard on yourself. Start as you mean to go on, and go on like that – I made a rather foolish decision to change how I formatted my figures one chapter in to my writing, and going back to re-format the old stuff was time I could have been spent writing*. Doing something right the first time is much more efficient than re-doing it two or three times.
Make sure you make full use of reference manager software. This will sound obvious to people that use them, but I am consistently surprised by the number of PhD students I meet who write their bibliographies and citations manually. I personally use Mendeley, which operates perfectly well and has a nice reading interface as well, although there are plenty of others. You are probably going to have a lot (hundreds) of references, and even more citations: doing them manually is a recipe for disaster.
Similarly, don't do any manual cross-referencing if you can avoid it – the document as you write it will likely be entirely fluid and subject to change for months, so any 'hard' references you put in could well end up needing to be changed, which not only takes time but increases the risk of you missing something and carrying an error along to your finished PhD.
If you have the time, I would recommend trying to get into LaTeX (with the 'X' pronounced as a 'K'), which is a free, open-source code-based type-setting program. It's a bit of a steep learning curve, but there are plenty of good templates and once you've got a grasp of the basic commands it's incredibly powerful. Crucially, as your file is just a text document (which effectively just 'links' to pictures when you compile your PDF) it remains small in size, and therefore easy to load, backup and play with. It also makes referencing, cross-referencing, and generally producing beautiful looking text a lot easier then most word processors.
Theses are often full of technical words and abbreviations, and it's entirely likely your examiners won't know them all beforehand – therefore they need to be defined the first time they're used. However, if you're moving chunks of text around (sometimes whole chapters), how do you know which one is the first time? My tactic was to not define anything while writing, but whenever I did use a new phrase I added it to a spreadsheet, along with its definition. Then once everything was set in place I worked through that spreadsheet and used 'find' to add the appropriate definition to the first instance of every term. What's more, that spreadsheet was then easily alphabetised and converted straight into a convenient glossary!
4) Be prepared to get sick of it ...
You will spend an unhealthy amount of time doing one thing: working on this one document. It will bore you, and it will make you boring, as it will take over your time, thoughts and life**. It is basically guaranteed that you will bone-weary of sitting down to your computer and working on it again. It's relentless, it just keeps going on and on and on, to the point where you forget that your life hasn't and will not always just be thesis-writing.
5) ... but remember it will end.
It might not seem like it at the time, but it will. You will finish writing, you will finish checking, you will hand it in. You'll then find some errors, but that's OK, your examiners are never going to read it as closely as you do when you check it. Remember that your supervisor(s)/thesis committees/whoever shouldn't let you submit unless they think you're ready, so the fact you're submitting means you're probably going to be fine!
* For what it's worth, the final way I did my figures was better, I should just have thought about it and done it first. Basically I outputted my plots from Python and R as SVG files and compiled them into whole figures in Inkscape (which is also great for making schematics) and saving these as PDFs. A word of warning thought – certain lines/boxes in occasional Python-saved SVGs failed to print (apparently something to do with the way fancy printers flatten the layers), so it is probably worth keeping backup EPS or non-vector versions of your Inkscape files on hand.
** Look at me, I've just closed my file for the last time (before uploading to university servers) and the first thing I do is go and write a thousand words about it!