{"id":2692,"date":"2021-07-30T02:30:24","date_gmt":"2021-07-30T02:30:24","guid":{"rendered":"https:\/\/blog.aleperno.com\/?p=2692"},"modified":"2021-09-06T23:51:22","modified_gmt":"2021-09-06T23:51:22","slug":"shallow-diving-into-pdfs-embedded-images","status":"publish","type":"post","link":"https:\/\/blog.aleperno.com\/?p=2692","title":{"rendered":"Shallow Diving into PDFs embedded images"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large is-resized\"><img src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/header-edited.png\" alt=\"\" class=\"wp-image-2695\" width=\"679\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/header-edited.png 1255w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/header-edited-300x169.png 300w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/header-edited-1024x575.png 1024w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/header-edited-768x431.png 768w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><figcaption>Yes&#8230;this is me during one of my PADI OWD Classes<\/figcaption><\/figure>\n\n\n\n<p class=\"has-text-align-justify\">Recently I had the chance&#8230;actually the need to have a look into how a given PDF manages its images. Here I&#8217;ll explain what the initial issue was an the findings along my journey. <strong>Disclamer<\/strong>: This is written more like a logbook \/ narration than an actual article.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"has-text-align-justify\">Last week I started reading the online material for my PADI OWD (Open Water Diver) course. Yes&#8230;now you realize this post title is very pun-intended, anyway&#8230; To my surprise the content had quite a bad <em>responsiveness<\/em>. Some images failed to load, others disappeared if I scrolled up and down. Here&#8217;s a tweet showcasing the issue<\/p>\n\n\n\n<figure class=\"wp-block-embed aligncenter is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"525\" data-dnt=\"true\"><p lang=\"es\" dir=\"ltr\">Este es el material online de PADI. En vez de tener un simple PDF y un viewer online no&#8230; Carga un pdf sin im\u00e1genes, y luego las carga din\u00e1micamente.<br><br>No s\u00f3lo te carga m\u00e1s de una vez la misma imagen si vas y volv\u00e9s en las p\u00e1ginas&#8230; Sino que se bugea, no te las carga bien, lento <a href=\"https:\/\/t.co\/yWQhZwJoDm\">pic.twitter.com\/yWQhZwJoDm<\/a><\/p>&mdash; Alejandro Pernin (@alepernin) <a href=\"https:\/\/twitter.com\/alepernin\/status\/1418400824749875204?ref_src=twsrc%5Etfw\">July 23, 2021<\/a><\/blockquote><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script>\n<\/div><figcaption>Let the rant begin&#8230;<\/figcaption><\/figure>\n\n\n\n<p>Not only that, but the page requested the images more than once, making it even slower. All these made me wonder how this was made and if it could be improved in any way. So, into the browser&#8217;s dev tools we go.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"805\" height=\"453\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-10-06.png\" alt=\"\" class=\"wp-image-2766\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-10-06.png 805w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-10-06-300x169.png 300w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-10-06-768x432.png 768w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><figcaption>Google Chrome Network Requests<\/figcaption><\/figure>\n\n\n\n<p>The first two interesting elements are a <em>book.pdf<\/em> and a set of <em>img_X.jpg<\/em> which are the images that are shown in the document. Let&#8217;s first try to open the PDF and see how it is.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"936\" height=\"536\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-10-45.png\" alt=\"\" class=\"wp-image-2767\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-10-45.png 936w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-10-45-300x172.png 300w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-10-45-768x440.png 768w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><figcaption>Attempt to open PDF<\/figcaption><\/figure>\n\n\n\n<p>The document is password-protected, however since the viewer is able to show the document without asking for any password we can assume the password is accessible to the client-side at some point. Let&#8217;s continue browsing the resources.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"636\" height=\"397\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-12-30-edited.png\" alt=\"\" class=\"wp-image-2771\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-12-30-edited.png 636w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-12-30-edited-300x187.png 300w\" sizes=\"(max-width: 636px) 100vw, 636px\" \/><figcaption>index.html<\/figcaption><\/figure>\n\n\n\n<p>In the <em>index.html<\/em> two things caught my eye.<\/p>\n\n\n\n<ul><li>The <em>book<strong> <\/strong><\/em>variable containing b64 encoded data than is then passed to a <em>PdfViewer<\/em> method<\/li><li>There is something called <em>iSpring<\/em> which seems to be the viewer.<\/li><\/ul>\n\n\n\n<p>Let&#8217;s first add a breakpoint in the mentioned method<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"804\" height=\"343\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-13-34.png\" alt=\"\" class=\"wp-image-2774\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-13-34.png 804w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-13-34-300x128.png 300w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-13-34-768x328.png 768w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/figure>\n\n\n\n<p>The first thing done by the method is create a new instance of an object, using the b64 code. Inspecting the object&#8217;s attributes we can see something resembling a document name (&#8220;Open Water Manual 02 Intro&#8221;) and other strings. After trying some, I found out the <strong>nh<\/strong> attribute was in fact the document password.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"574\" height=\"490\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-18-00.png\" alt=\"\" class=\"wp-image-2775\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-18-00.png 574w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-07-31-18-18-00-300x256.png 300w\" sizes=\"(max-width: 574px) 100vw, 574px\" \/><figcaption>bingo!<\/figcaption><\/figure>\n\n\n\n<p>However we do see there&#8217;s something odd&#8230; The document lacks any images, what I deduce is the viewer loads the PDF and then loads the images dynamically as we scroll through the document. This arises some questions.<\/p>\n\n\n\n<ul><li>How does the viewer know which images to load and where to put them?<\/li><li>Is there any way to revert the process and obtain the original PDF?<\/li><\/ul>\n\n\n\n<p>However, I don&#8217;t want to keep fiddling around with this document \/ page to avoid any ToS \/ Copyright infringement. Luckily the very same html page does provide a hint which I previously mentioned, the <em>iSpring<\/em> name. After some googling I found the following<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"iSpring Flip Overview\" width=\"525\" height=\"295\" src=\"https:\/\/www.youtube.com\/embed\/QPHrFv7_YvA?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>It seems the document viewer is called <strong><a rel=\"noreferrer noopener\" href=\"https:\/\/www.ispringsolutions.com\/ispring-flip\" data-type=\"URL\" data-id=\"https:\/\/www.ispringsolutions.com\/ispring-flip\" target=\"_blank\">iSpring Flip<\/a><\/strong> and fortunately they do offer a trial version of their product.<\/p>\n\n\n\n<p>I used the software with one of my physics assignments, which you can check <a href=\"http:\/\/pdftest.aleperno.com\" data-type=\"URL\" data-id=\"pdftest.aleperno.com\">here<\/a>. iSpring gives you a folder with the following contents<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"337\" height=\"527\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/Screenshot-from-2021-07-31-18-58-25.png\" alt=\"\" class=\"wp-image-2786\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/Screenshot-from-2021-07-31-18-58-25.png 337w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/Screenshot-from-2021-07-31-18-58-25-192x300.png 192w\" sizes=\"(max-width: 337px) 100vw, 337px\" \/><\/figure>\n\n\n\n<p>The images contained in the <em>res<\/em> folder are indeed the images embedded in the pdf.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"720\" height=\"181\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-09-06-19-22-20.png\" alt=\"\" class=\"wp-image-2789\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-09-06-19-22-20.png 720w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/Screenshot-from-2021-09-06-19-22-20-300x75.png 300w\" sizes=\"(max-width: 720px) 100vw, 720px\" \/><\/figure>\n\n\n\n<p>There&#8217;s also a files named <em>book.pdf.js<\/em> and taking a look into it we find the following<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"555\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/book.pdf.js-1024x555.png\" alt=\"\" class=\"wp-image-2790\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/book.pdf.js-1024x555.png 1024w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/book.pdf.js-300x163.png 300w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/book.pdf.js-768x417.png 768w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/07\/book.pdf.js.png 1366w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/figure>\n\n\n\n<p>I decoded the b64 into another file and opened it, finding a password protected PDF. Using what I&#8217;ve learned above I found the password. This is a comparison between the original PDF and the pdf found in the resources, where we can see the lack of some images.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"639\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_pdfs-1024x639.png\" alt=\"\" class=\"wp-image-2791\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_pdfs-1024x639.png 1024w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_pdfs-300x187.png 300w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_pdfs-768x479.png 768w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_pdfs-1536x958.png 1536w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_pdfs.png 1675w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><figcaption>Original on the left, processed on the right<\/figcaption><\/figure>\n\n\n\n<p>I opened the pdf file with a text editor and searched for any reference of a &#8220;img_x.jpg&#8221;. <strong>NOTE:<\/strong> The original PDF is compressed which makes inspecting its contents harsh. I used the <em><a href=\"https:\/\/github.com\/qpdf\/qpdf\" data-type=\"URL\" data-id=\"https:\/\/github.com\/qpdf\/qpdf\">qpdf<\/a><\/em> tool to decompress it.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"551\" height=\"257\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/xobject.png\" alt=\"\" class=\"wp-image-2792\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/xobject.png 551w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/xobject-300x140.png 300w\" sizes=\"(max-width: 551px) 100vw, 551px\" \/><figcaption>Reference of img_1.jpg<\/figcaption><\/figure>\n\n\n\n<p>The first result we see some kind of object with a reference to an image (with the correct path), some attributes such as height and width (which matches the actual image size) and some sort of stream.<\/p>\n\n\n\n<p>Taking a look into the <a href=\"https:\/\/www.adobe.com\/content\/dam\/acom\/en\/devnet\/pdf\/pdfs\/PDF32000_2008.pdf\" data-type=\"URL\" data-id=\"https:\/\/www.adobe.com\/content\/dam\/acom\/en\/devnet\/pdf\/pdfs\/PDF32000_2008.pdf\">pdf specification<\/a> we can find the following definition:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>An external object (commonly called an XObject) is a graphics object whose contents are defined by a self-contained stream, separate from the content stream in which it is used.<\/p><\/blockquote>\n\n\n\n<p>It looks like we can include an image just by including its stream into the <em>stream \/ endstream<\/em> block, so let&#8217;s test it. I opened a Jupyter Notebook and started fiddling around, ending up with<\/p>\n\n\n\n<p>[gist]https:\/\/gist.github.com\/aleperno\/33a61a582d53edfd736686fe0e9292a7[\/gist]<\/p>\n\n\n\n<p>This code basically does the following<\/p>\n\n\n\n<ul><li>Finds every object of type XObject and subtype Image<\/li><li>Checks it contains a property F matching &#8216;data\/res&#8217; and retrieves the image name<\/li><li>With the image name tries to obtain the image from a given folder and set the stream into the file.<\/li><\/ul>\n\n\n\n<p>And as a result we obtain a PDF with images just like the original one<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"640\" src=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_2-1024x640.png\" alt=\"\" class=\"wp-image-2798\" srcset=\"https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_2-1024x640.png 1024w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_2-300x188.png 300w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_2-768x480.png 768w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_2-1536x960.png 1536w, https:\/\/blog.aleperno.com\/wp-content\/uploads\/2021\/09\/compared_2.png 1676w\" sizes=\"(max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><figcaption>Original (left) vs New (right)<\/figcaption><\/figure>\n\n\n\n<h2>Learnings<\/h2>\n\n\n\n<p>PDFs has a simple way to include images by including an stream, it is also allows us to use the same image multiple times without duplicating the stream but rather using references to the object.<\/p>\n\n\n\n<p>Also by them being streams, we can extract images, modify them and re-add them to the file without much hassle, this could be useful when trying to reduce a document size, we can try to compress the images.<\/p>\n\n\n\n<p>An important side-note is by being streams, images conserve metadata which could be &#8220;dangerous&#8221; by enabling anyone to extract the images EXIF info.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently I had the chance&#8230;actually the need to have a look into how a given PDF manages its images. Here I&#8217;ll explain what the initial issue was an the findings along my journey. Disclamer: This is written more like a logbook \/ narration than an actual article.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=\/wp\/v2\/posts\/2692"}],"collection":[{"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2692"}],"version-history":[{"count":21,"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=\/wp\/v2\/posts\/2692\/revisions"}],"predecessor-version":[{"id":2801,"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=\/wp\/v2\/posts\/2692\/revisions\/2801"}],"wp:attachment":[{"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2692"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2692"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.aleperno.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2692"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}