File Header Size Information Required for Document Files

AI Thread Summary
The discussion revolves around the challenge of programmatically changing headers in document files such as PDF, MS Word, and LibreOffice formats. The original poster seeks information on the header sizes of these file formats, noting that while they are aware of the fixed header size for BMP files, they find no similar information for PDFs or Word documents. Responses indicate that PDF files do not have a fixed header size, and altering them can be complex. MS Word files in the Open XML format are easier to manipulate, as they are structured as XML documents within a ZIP archive. The conversation emphasizes that different file formats have varying header structures, with some potentially having no header at all. It is suggested to refer to documentation and source code, particularly for LibreOffice, to understand file structures better. Additionally, tools for inspecting PDF file structures are mentioned, along with a reference manual for PDF specifications.
zak100
Messages
462
Reaction score
11
Hi,
I want to change the header of some document files like pdf, ms word, libre office programatically. I know that I have to use some byte type command like putc(..) and getc(). But I don't the header size of the above mentioned file formats.I saw a list of file format at wikipedia https://en.wikipedia.org/wiki/File_format
but I can't see information about the header size. For instance, I know the header size of bmp file=54 bytes.Can some body please guide me any link which tells me this information.

Zulfi.
 
Technology news on Phys.org
Where have you looked already? Wikipedia has some information about all of these formats (e.g. https://en.wikipedia.org/wiki/PDF#File_structure) with links to more detailed specifications. PDF files can be very difficult to alter. MS Word (if stored in the Open XML (docx) format which it shares with Open/Libre Office) is a little easier but as this is a set of XML documents stored in a ZIP archive, you will want to work with libraries that do the heavy lifting with these formats for you rather than work at the byte level.
 
Sorry, I can't find any information about header size in terms of bytes on the link you have provided. Do you have any information about header size related to HTML or notepad files?

Zulfi.
 
zak100 said:
Sorry, I can't find any information about header size in terms of bytes on the link you have provided.
Perhaps that is because PDF files do not have a fixed header size?
zak100 said:
Do you have any information about header size related to HTML or notepad files?
HTML is plain text so it doesn't have a 'header' in the sense you are using this word. Notepad is an editor for plain text files and doesn't have a file format of its own.

'Header size' isn't really a thing; if you want to learn how these files are structured, just read what the documentation says rather than search it for terms that may not be relevent.
 
As others said, different formats have different headers. Some have variable size headers. I suppose there must be some with no header at all.

How would the information on byte count help you? What are you trying to accomplish?
 
I just saw this thread while wandering around the PF site. I think your best option would be to look at the Libre Office source code and see how it handles the various formats. You can start here: https://www.libreoffice.org/about-us/source-code/
 
The reference manual for PDF v 1.4 is:

PDF Reference, third edition, Adobe Portable Document Format, Version 1.4

It is available as a free download from:

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf

Size: 8.6MB, 978 pages
(that ought'a keep you busy for awhile)

Cheers,
Tom
 
  • Like
Likes sysprog
Back
Top