aboutsummaryrefslogtreecommitdiff
path: root/doc/pdftohtml.1
diff options
context:
space:
mode:
authorCalvin Morrison <calvin@pobox.com>2023-04-05 14:13:39 -0400
committerCalvin Morrison <calvin@pobox.com>2023-04-05 14:13:39 -0400
commit835e373b3eeaabcd0621ed6798ab500f37982fae (patch)
treedfa16b0e2e1b4956b38f693220eac4e607802133 /doc/pdftohtml.1
xpdf-no-select-disableHEADmaster
Diffstat (limited to 'doc/pdftohtml.1')
-rw-r--r--doc/pdftohtml.1158
1 files changed, 158 insertions, 0 deletions
diff --git a/doc/pdftohtml.1 b/doc/pdftohtml.1
new file mode 100644
index 0000000..5129f0d
--- /dev/null
+++ b/doc/pdftohtml.1
@@ -0,0 +1,158 @@
+.\" Copyright 1997-2022 Glyph & Cog, LLC
+.TH pdftohtml 1 "18 Apr 2022"
+.SH NAME
+pdftohtml \- Portable Document Format (PDF) to HTML converter
+(version 4.04)
+.SH SYNOPSIS
+.B pdftohtml
+[options]
+.I PDF-file
+.I HTML-dir
+.SH DESCRIPTION
+.B Pdftohtml
+converts Portable Document Format (PDF) files to HTML.
+.PP
+Pdftohtml reads the PDF file,
+.IR PDF-file ,
+and places an HTML file for each page, along with auxiliary images
+in the directory,
+.IR HTML-dir .
+The HTML directory will be created; if it already exists, pdftohtml
+will report an error.
+.SH CONFIGURATION FILE
+Pdftohtml reads a configuration file at startup. It first tries to
+find the user's private config file, ~/.xpdfrc. If that doesn't
+exist, it looks for a system-wide config file, typically /etc/xpdfrc
+(but this location can be changed when pdftohtml is built). See the
+.BR xpdfrc (5)
+man page for details.
+.SH OPTIONS
+Many of the following options can be set with configuration file
+commands. These are listed in square brackets with the description of
+the corresponding command line option.
+.TP
+.BI \-f " number"
+Specifies the first page to convert.
+.TP
+.BI \-l " number"
+Specifies the last page to convert.
+.TP
+.BI \-z " number"
+Specifies the initial zoom level. The default is 1.0, which means
+72dpi, i.e., 1 point in the PDF file will be 1 pixel in the HTML.
+Using \'-z 1.5', for example, will make the initial view 50% larger.
+.TP
+.BI \-r " number"
+Specifies the resolution, in DPI, for background images. This
+controls the pixel size of the background image files. The initial
+zoom level is controlled by the \'-z' option. Specifying a larger
+\'-r' value will allow the viewer to zoom in farther without upscaling
+artifacts in the background.
+.TP
+.BI \-vstretch " number"
+Specifies a vertical stretch factor. Setting this to a value greater
+than 1.0 will stretch each page vertically, spreading out the lines.
+This also stretches the background image to match.
+.TP
+.B \-embedbackground
+Embeds the background image as base64-encoded data directly in the
+HTML file, rather than storing it as a separate file.
+.TP
+.B \-nofonts
+Disable extraction of embedded fonts. By default, pdftohtml extracts
+TrueType and OpenType fonts. Disabling extraction can work around
+problems with buggy fonts.
+.TP
+.B \-embedfonts
+Embeds any extracted fonts as base64-encoded data directly in the HTML
+file, rather than storing them as separate files.
+.TP
+.B \-skipinvisible
+Don't draw invisible text. By default, invisible text (commonly used
+in OCR'ed PDF files) is drawn as transparent (alpha=0) HTML text.
+This option tells pdftohtml to discard invisible text entirely.
+.TP
+.B \-allinvisible
+Treat all text as invisible. By default, regular (non-invisible) text
+is not drawn in the background image, and is instead drawn with HTML
+on top of the image. This option tells pdftohtml to include the
+regular text in the background image, and then draw it as transparent
+(alpha=0) HTML text.
+.TP
+.B \-formfields
+Convert AcroForm text and checkbox fields to HTML input elements.
+This also removes text (e.g., underscore characters) and erases
+background image content (e.g., lines or boxes) in the field areas.
+.TP
+.B \-table
+Use table mode when performing the underlying text extraction. This
+will generally produce better output when the PDF content is a
+full-page table. NB: This does not generate HTML tables; it just
+changes the way text is split up.
+.TP
+.BI \-opw " password"
+Specify the owner password for the PDF file. Providing this will
+bypass all security restrictions.
+.TP
+.BI \-upw " password"
+Specify the user password for the PDF file.
+.TP
+.B \-verbose
+Print a status message (to stdout) before processing each page.
+.RB "[config file: " printStatusInfo ]
+.TP
+.B \-q
+Don't print any messages or errors.
+.RB "[config file: " errQuiet ]
+.TP
+.BI \-cfg " config-file"
+Read
+.I config-file
+in place of ~/.xpdfrc or the system-wide config file.
+.TP
+.B \-v
+Print copyright and version information.
+.TP
+.B \-h
+Print usage information.
+.RB ( \-help
+and
+.B \-\-help
+are equivalent.)
+.SH BUGS
+Some PDF files contain fonts whose encodings have been mangled beyond
+recognition. There is no way (short of OCR) to extract text from
+these files.
+.SH EXIT CODES
+The Xpdf tools use the following exit codes:
+.TP
+0
+No error.
+.TP
+1
+Error opening a PDF file.
+.TP
+2
+Error opening an output file.
+.TP
+3
+Error related to PDF permissions.
+.TP
+99
+Other error.
+.SH AUTHOR
+The pdftohtml software and documentation are copyright 1996-2022 Glyph
+& Cog, LLC.
+.SH "SEE ALSO"
+.BR xpdf (1),
+.BR pdftops (1),
+.BR pdftotext (1),
+.BR pdfinfo (1),
+.BR pdffonts (1),
+.BR pdfdetach (1),
+.BR pdftoppm (1),
+.BR pdftopng (1),
+.BR pdfimages (1),
+.BR xpdfrc (5)
+.br
+.B http://www.xpdfreader.com/