O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


WINDOWS HACK

Windows Wildcards Can Grab Extra Files

Some pdftk users process hundreds of files. Performing this work on a Windows machine can yield unexpected results. The problem arises from the Windows command-prompt shell, not pdftk. The problem arises because for every long filename, Windows creates a short, DOS-compatible (8.3) alias filename. This short alias filename might end up matching a wildcard expression, even when the long filename does not. When using pdftk, the result is that you end up with more input files than you wanted.

This article offers a couple workarounds and then describes the case where this problem arose.



Contributed by:
Sid Steward
[06/21/05 | Discuss (0) | Link to this hack]

<h1>The Workarounds</h1>

One workaround is to use a wildcard expression that couldn't possibly match a short, DOS-style filename. DOS-style filenames have a maximum length of eight characters and an optional, maximum extension of three characters. They look something like this: 343990~1.PDF. In the case below, using the wildcard expression 343990_* solved the problem.

Another workaround is to use a shell other than the Windows command-prompt. I use bash, as packaged by MSYS.

<h1>The Case</h1>

This problem arose in a case where a directory of input files contained 448 PDFs. Their numerical names had incrementing prefixes and suffixes, such as:

343959_0011.pdf
343959_0021.pdf
343959_0031.pdf
343990_0011.pdf
343990_0021.pdf
343990_0031.pdf
343990_0041.pdf
343991_0011.pdf
343991_0021.pdf
343991_0031.pdf
343992_0011.pdf
343992_0021.pdf
343992_0031.pdf
343993_0011.pdf
343993_0021.pdf
343993_0031.pdf
343994_0011.pdf
343994_0021.pdf
343994_0031.pdf
...

When using pdftk to combine these PDF files, extra files were showing up in the output PDF. For example, running:

pdftk input\343990* cat output output\343990.PDF

yields 343990.PDF which includes these files in this order:

343990_0011.pdf
343990_0021.pdf
343990_0031.pdf
343990_0041.pdf
345089_0131.pdf
345688_1121.pdf 

Is this a pdftk error or a shell error? Using dir shows that the shell is passing these unwanted files to pdftk:

dir 343990*

06/20/2005  03:58p               1,825 343990_0011.pdf
06/20/2005  03:58p               1,825 343990_0021.pdf
06/20/2005  03:58p               1,825 343990_0031.pdf
06/20/2005  03:58p               1,825 343990_0041.pdf
06/20/2005  03:58p               1,828 345089_0131.pdf
06/20/2005  03:58p               1,828 345688_1121.pdf

This mystery is solved by using the /X switch. This switch shows the DOS-compatible name on the left and the original, long filename on the right:

dir /X 343990*

06/20/2005  03:58p               1,825 343990~1.PDF    343990_0011.pdf
06/20/2005  03:58p               1,825 343990~2.PDF    343990_0021.pdf
06/20/2005  03:58p               1,825 343990~3.PDF    343990_0031.pdf
06/20/2005  03:58p               1,825 343990~4.PDF    343990_0041.pdf
06/20/2005  03:58p               1,828 343990~5.PDF    345089_0131.pdf
06/20/2005  03:58p               1,828 343990~6.PDF    345688_1121.pdf 

Thanks to Josh Gray at Daktronics who identified this problem and worked with me to solve it.

See also:

Pdftk, the PDF Toolkit


Estimating Maximum Command-Line Length


Open the MSYS Shell Right Where You Want It on Windows


O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.