zip存档的外观以及我们可以使用的存档。 第2部分-数据描述符和压缩

文章的续篇zip存档的外观以及我们可以使用它做些什么


前言


大家好
再次直播,我们用PHP进行了非常规编程。


在上一篇文章中,尊敬的读者对zip压缩和zip流感兴趣。 今天,我们将尝试稍微打开这个主题。


让我们来看看

上一篇文章的代码
<?php //        (1.txt  2.txt)   : $entries = [ '1.txt' => 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc id ante ultrices, fermentum nibh eleifend, ullamcorper nunc. Sed dignissim ut odio et imperdiet. Nunc id felis et ligula viverra blandit a sit amet magna. Vestibulum facilisis venenatis enim sed bibendum. Duis maximus felis in suscipit bibendum. Mauris suscipit turpis eleifend nibh commodo imperdiet. Donec tincidunt porta interdum. Aenean interdum condimentum ligula, vitae ornare lorem auctor in. Suspendisse metus ipsum, porttitor et sapien id, fringilla aliquam nibh. Curabitur sem lacus, ultrices quis felis sed, blandit commodo metus. Duis tincidunt vel mauris at accumsan. Integer et ipsum fermentum leo viverra blandit.', '2.txt' => 'Mauris in purus sit amet ante tempor finibus nec sed justo. Integer ac nibh tempus, mollis sem vel, consequat diam. Pellentesque ut condimentum ex. Praesent finibus volutpat gravida. Vivamus eleifend neque sit amet diam scelerisque lacinia. Nunc imperdiet augue in suscipit lacinia. Curabitur orci diam, iaculis non ligula vitae, porta pellentesque est. Duis dolor erat, placerat a lacus eu, scelerisque egestas massa. Aliquam molestie pulvinar faucibus. Quisque consequat, dolor mattis lacinia pretium, eros eros tempor neque, volutpat consectetur elit elit non diam. In faucibus nulla justo, non dignissim erat maximus consectetur. Sed porttitor turpis nisl, elementum aliquam dui tincidunt nec. Nunc eu enim at nibh molestie porta ut ac erat. Sed tortor sem, mollis eget sodales vel, faucibus in dolor.', ]; //      Lorem.zip,      cwd (      ) $destination = 'Lorem.zip'; $handle = fopen($destination, 'w'); //      ,    ,     ,   "" Central Directory File Header $written = 0; $dictionary = []; foreach ($entries as $filename => $content) { //         Local File Header,     //        ,      . $fileInfo = [ //     'versionToExtract' => 10, //   0,        - 'generalPurposeBitFlag' => 0, //      ,    0 'compressionMethod' => 0, // -    mtime ,    ,      ? 'modificationTime' => 28021, //   , ? 'modificationDate' => 20072, //      .     ,       ,   ? 'crc32' => hexdec(hash('crc32b', $content)), //     .        . //       :) 'compressedSize' => $size = strlen($content), 'uncompressedSize' => $size, //    'filenameLength' => strlen($filename), //  .    ,   0. 'extraFieldLength' => 0, ]; //      . $LFH = pack('LSSSSSLLLSSa*', ...array_values([ 'signature' => 0x04034b50, //  Local File Header ] + $fileInfo + ['filename' => $filename])); //       ,       Central Directory File Header $dictionary[$filename] = [ 'signature' => 0x02014b50, //  Central Directory File Header 'versionMadeBy' => 798, //  .    ,  -  . ] + $fileInfo + [ 'fileCommentLength' => 0, //    . No comments 'diskNumber' => 0, //     0,        'internalFileAttributes' => 0, //    'externalFileAttributes' => 2176057344, //    'localFileHeaderOffset' => $written, //      Local File Header 'filename' => $filename, //  . ]; //      $written += fwrite($handle, $LFH); //    $written += fwrite($handle, $content); } // ,     ,    . //          End of central directory record (EOCD) $EOCD = [ //  EOCD 'signature' => 0x06054b50, //  .    ,   0 'diskNumber' => 0, //      -  0 'startDiskNumber' => 0, //       . 'numberCentralDirectoryRecord' => $records = count($dictionary), //    .    ,     'totalCentralDirectoryRecord' => $records, //   Central Directory Record. //      ,      'sizeOfCentralDirectory' => 0, // ,    Central Directory Records 'centralDirectoryOffset' => $written, //     'commentLength' => 0 ]; //     !   foreach ($dictionary as $entryInfo) { $CDFH = pack('LSSSSSSLLLSSSSSLLa*', ...array_values($entryInfo)); $written += fwrite($handle, $CDFH); } // ,   .  ,    $EOCD['sizeOfCentralDirectory'] = $written - $EOCD['centralDirectoryOffset']; //     End of central directory record $EOCD = pack('LSSSSLLS', ...array_values($EOCD)); $written += fwrite($handle, $EOCD); //  . fclose($handle); echo '  : ' . $written . ' ' . PHP_EOL; echo '     `unzip -tq ' . $destination . '`' . PHP_EOL; echo PHP_EOL; 


他怎么了 好吧,公平地讲,值得注意的是它的唯一优点是它可以工作,并且那里仍然存在问题。

在我看来,主要的问题是我们必须首先使用crc32和文件长度编写本地文件头(LFH) ,然后再编写文件本身的内容。
这有什么威胁? 或者我们将整个文件加载到内存中,考虑使用crc32,然后写入LFH ,从I / O的角度来看,文件的内容是经济的,但对于大文件则不允许。 或者我们从RAM的角度出发,经济地读取文件两次-首先计算哈希值,然后读取内容并将其写入存档-但例如,首先它会使驱动器上的负载增加一倍,这不一定是SSD。

并且,如果文件位于远程位置并且其容量为1.5GB,例如? 好了,您必须将所有1.5GB都加载到内存中,或者等到所有这些1.5GB被下载后,我们才能计算出哈希值,然后再次下载它们以提供内容。 例如,如果我们想即时提供一个转储数据库,例如我们从stdout读取它,这通常是不可接受的-数据库中的数据已更改,转储数据将发生变化,哈希将完全不同,并且将获得无效的存档。 是的,当然很糟糕。


流归档记录的数据描述符结构


但是不要气,, ZIP规范允许我们先写入数据,然后在已经包含crc32,打包数据的长度和未经压缩的数据的长度的数据之后粘贴数据描述符(DD)结构。 为此,我们每天只需要在LFH中 空腹3次 ,将generalPurposeBitFlag指定为0x0008 ,并将crc32compressionSizeuncompressedSize指定为0 。 然后,在数据之后,我们将编写DD结构,如下所示:



 pack('LLLL', ...array_values([ 'signature' => 0x08074b50, //  Data Descriptor 'crc32' => $crc32, //  crc32    'compressedSize' => $compressedSize, //    'uncompressedSize' => $uncompressedSize, //    . ])); 

并且在中央目录文件头(CDFH)中,只有generalPurposeBitFlag更改了 ,其余数据必须是真实的。 但这不是问题,因为我们在所有数据之后都写入CDFH ,并且在任何情况下都知道具有数据长度的哈希。


当然,这就是全部。 它仅保留在PHP中实现。
标准的Hash库将对我们有很大帮助。 我们可以创建一个哈希上下文,在该上下文中足以填充数据块,最后获取哈希值。 当然,此解决方案比哈希(“ crc32b”,$ content)要麻烦一些,但是它将为我们节省大量的资源和时间。

看起来像这样:

 $hashCtx = hash_init('crc32b'); $handle = fopen($source, 'r'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); hash_update($hashCtx, $chunk); $chunk = null; } $hash = hash_final($hashCtx); 

如果一切都正确完成,则该值与hash_file('crc32b',$ source)哈希('crc32b',file_get_content($ source))完全没有区别。

让我们尝试以某种方式将其全部包装在一个函数中,以便我们可以以一种方便的方式读取文件,最后获得其哈希值和长度。 生成器将帮助我们:

 function read(string $path): \Generator { $length = 0; $handle = fopen($path, 'r'); $hashCtx = hash_init('crc32b'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); $length += strlen($chunk); hash_update($hashCtx, $chunk); yield $chunk; $chunk = null; } fclose($handle); return ['length' => $length, 'crc32' => hexdec(hash_final($hashCtx))]; } 

现在我们可以

 $reader = read('https://speed.hetzner.de/1GB.bin'); foreach ($reader as $chunk) { // -   . } //      . ['length' => $length, 'crc32' => $crc32] = $reader->getReturn(); echo round(memory_get_peak_usage(true) / 1024 / 1024, 2) . 'MB - Memory Peak Usage' . PHP_EOL; 

我认为这非常简单方便。 对于1GB的文件,我的峰值内存消耗为2MB。

现在,让我们尝试修改上一篇文章中的代码,以便我们可以使用此功能。

最终脚本
 <?php function read(string $path): \Generator { $length = 0; $handle = fopen($path, 'r'); $hashCtx = hash_init('crc32b'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); $length += strlen($chunk); hash_update($hashCtx, $chunk); yield $chunk; $chunk = null; } fclose($handle); return ['length' => $length, 'crc32' => hexdec(hash_final($hashCtx))]; } $entries = ['https://speed.hetzner.de/100MB.bin', __FILE__]; $destination = 'test.zip'; $handle = fopen($destination, 'w'); $written = 0; $dictionary = []; foreach ($entries as $entry) { $filename = basename($entry); $fileInfo = [ 'versionToExtract' => 10, //       Data Descriptor,     00008, //   00000    . 'generalPurposeBitFlag' => 0x0008, 'compressionMethod' => 0, 'modificationTime' => 28021, 'modificationDate' => 20072, 'crc32' => 0, 'compressedSize' => 0, 'uncompressedSize' => 0, 'filenameLength' => strlen($filename), 'extraFieldLength' => 0, ]; $LFH = pack('LSSSSSLLLSSa*', ...array_values([ 'signature' => 0x04034b50, ] + $fileInfo + ['filename' => $filename])); $fileOffset = $written; $written += fwrite($handle, $LFH); //     $reader = read($entry); foreach ($reader as $chunk) { //      $written += fwrite($handle, $chunk); $chunk = null; } //       ['length' => $length, 'crc32' => $crc32] = $reader->getReturn(); //    fileInfo,     CDFH $fileInfo['crc32'] = $crc32; $fileInfo['compressedSize'] = $length; $fileInfo['uncompressedSize'] = $length; //  Data Descriptor $DD = pack('LLLL', ...array_values([ 'signature' => 0x08074b50, 'crc32' => $fileInfo['crc32'], 'compressedSize' => $fileInfo['compressedSize'], 'uncompressedSize' => $fileInfo['uncompressedSize'], ])); $written += fwrite($handle, $DD); $dictionary[$filename] = [ 'signature' => 0x02014b50, 'versionMadeBy' => 798, ] + $fileInfo + [ 'fileCommentLength' => 0, 'diskNumber' => 0, 'internalFileAttributes' => 0, 'externalFileAttributes' => 2176057344, 'localFileHeaderOffset' => $fileOffset, 'filename' => $filename, ]; } $EOCD = [ 'signature' => 0x06054b50, 'diskNumber' => 0, 'startDiskNumber' => 0, 'numberCentralDirectoryRecord' => $records = count($dictionary), 'totalCentralDirectoryRecord' => $records, 'sizeOfCentralDirectory' => 0, 'centralDirectoryOffset' => $written, 'commentLength' => 0 ]; foreach ($dictionary as $entryInfo) { $CDFH = pack('LSSSSSSLLLSSSSSLLa*', ...array_values($entryInfo)); $written += fwrite($handle, $CDFH); } $EOCD['sizeOfCentralDirectory'] = $written - $EOCD['centralDirectoryOffset']; $EOCD = pack('LSSSSLLS', ...array_values($EOCD)); $written += fwrite($handle, $EOCD); fclose($handle); echo '  : ' . memory_get_peak_usage(true) . ' ' . PHP_EOL; echo '  : ' . $written . ' ' . PHP_EOL; echo '   `unzip -tq ' . $destination . '`: ' . PHP_EOL; echo '> ' . exec('unzip -tq ' . $destination) . PHP_EOL; echo PHP_EOL; 


在输出中,我们应该获得一个名为test.zip的Zip存档,其中将包含一个包含上述脚本和100MB.bin的文件,大小约为100 MB。

压缩zip压缩文件


现在,我们几乎拥有压缩数据并即时进行处理的所有功能。
正如我们通过给函数提供小的块来获得哈希值一样,我们也可以借助出色的Zlib库及其deflate_initdeflate_add函数来压缩数据。


看起来像这样:

 $deflateCtx = deflate_init(ZLIB_ENCODING_RAW, ['level' => 6]); $handle = fopen($source, 'r'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); yield deflate_add($deflateCtx, $chunk, feof($handle) ? ZLIB_FINISH : ZLIB_SYNC_FLUSH); $chunk = null; } 

我遇到了这样一个选项,与前一个选项相比,它将在最后添加几个零。
扰流板方向
 while (!feof($handle)) { yield deflate_add($deflateCtx, $chunk, ZLIB_SYNC_FLUSH); } yield deflate_add($deflateCtx, '', ZLIB_FINISH); 

但是解开咒骂,所以我不得不摆脱这种简化。

让我们修复阅读器 ,使其立即压缩我们的数据,最后返回给我们一个哈希值,未压缩的数据长度和经过压缩的数据长度:

 function read(string $path): \Generator { $uncompressedSize = 0; $compressedSize = 0; $hashCtx = hash_init('crc32b'); $deflateCtx = deflate_init(ZLIB_ENCODING_RAW, ['level' => 6]); $handle = fopen($path, 'r'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); hash_update($hashCtx, $chunk); $compressedChunk = deflate_add($deflateCtx, $chunk, feof($handle) ? ZLIB_FINISH : ZLIB_SYNC_FLUSH); $uncompressedSize += strlen($chunk); $compressedSize += strlen($compressedChunk); yield $compressedChunk; $chunk = null; $compressedChunk = null; } fclose($handle); return [ 'uncompressedSize' => $uncompressedSize, 'compressedSize' => $compressedSize, 'crc32' => hexdec(hash_final($hashCtx)) ]; } 

并尝试使用100 mb的文件:

 $reader = read('https://speed.hetzner.de/100MB.bin'); foreach ($reader as $chunk) { // -   . } ['uncompressedSize' => $uncompressedSize, 'compressedSize' => $compressedSize, 'crc32' => $crc32] = $reader->getReturn(); echo 'Uncompressed size: ' . $uncompressedSize . PHP_EOL; echo 'Compressed size: ' . $compressedSize . PHP_EOL; echo round(memory_get_peak_usage(true) / 1024 / 1024, 2) . 'MB - Memory Peak Usage' . PHP_EOL; 

内存消耗仍然表明我们没有将整个文件加载到内存中。

让我们放在一起,最后得到一个真正的脚本存档器。
与以前的版本不同,我们的generalPurposeBitFlag将更改-现在其值为0x0018 ,以及compressionMethod - 8 (这意味着Deflate )。

最终脚本
 <?php function read(string $path): \Generator { $uncompressedSize = 0; $compressedSize = 0; $hashCtx = hash_init('crc32b'); $deflateCtx = deflate_init(ZLIB_ENCODING_RAW, ['level' => 6]); $handle = fopen($path, 'r'); while (!feof($handle)) { $chunk = fread($handle, 8 * 1024); hash_update($hashCtx, $chunk); $compressedChunk = deflate_add($deflateCtx, $chunk, feof($handle) ? ZLIB_FINISH : ZLIB_SYNC_FLUSH); $uncompressedSize += strlen($chunk); $compressedSize += strlen($compressedChunk); yield $compressedChunk; $chunk = null; $compressedChunk = null; } fclose($handle); return [ 'uncompressedSize' => $uncompressedSize, 'compressedSize' => $compressedSize, 'crc32' => hexdec(hash_final($hashCtx)) ]; } $entries = ['https://speed.hetzner.de/100MB.bin', __FILE__]; $destination = 'test.zip'; $handle = fopen($destination, 'w'); $written = 0; $dictionary = []; foreach ($entries as $entry) { $filename = basename($entry); $fileInfo = [ 'versionToExtract' => 10, //   ,        0x0018  0x0008 'generalPurposeBitFlag' => 0x0018, 'compressionMethod' => 8, //      : 8 - Deflate 'modificationTime' => 28021, 'modificationDate' => 20072, 'crc32' => 0, 'compressedSize' => 0, 'uncompressedSize' => 0, 'filenameLength' => strlen($filename), 'extraFieldLength' => 0, ]; $LFH = pack('LSSSSSLLLSSa*', ...array_values([ 'signature' => 0x04034b50, ] + $fileInfo + ['filename' => $filename])); $fileOffset = $written; $written += fwrite($handle, $LFH); $reader = read($entry); foreach ($reader as $chunk) { $written += fwrite($handle, $chunk); $chunk = null; } [ 'uncompressedSize' => $uncompressedSize, 'compressedSize' => $compressedSize, 'crc32' => $crc32 ] = $reader->getReturn(); $fileInfo['crc32'] = $crc32; $fileInfo['compressedSize'] = $compressedSize; $fileInfo['uncompressedSize'] = $uncompressedSize; $DD = pack('LLLL', ...array_values([ 'signature' => 0x08074b50, 'crc32' => $fileInfo['crc32'], 'compressedSize' => $fileInfo['compressedSize'], 'uncompressedSize' => $fileInfo['uncompressedSize'], ])); $written += fwrite($handle, $DD); $dictionary[$filename] = [ 'signature' => 0x02014b50, 'versionMadeBy' => 798, ] + $fileInfo + [ 'fileCommentLength' => 0, 'diskNumber' => 0, 'internalFileAttributes' => 0, 'externalFileAttributes' => 2176057344, 'localFileHeaderOffset' => $fileOffset, 'filename' => $filename, ]; } $EOCD = [ 'signature' => 0x06054b50, 'diskNumber' => 0, 'startDiskNumber' => 0, 'numberCentralDirectoryRecord' => $records = count($dictionary), 'totalCentralDirectoryRecord' => $records, 'sizeOfCentralDirectory' => 0, 'centralDirectoryOffset' => $written, 'commentLength' => 0 ]; foreach ($dictionary as $entryInfo) { $CDFH = pack('LSSSSSSLLLSSSSSLLa*', ...array_values($entryInfo)); $written += fwrite($handle, $CDFH); } $EOCD['sizeOfCentralDirectory'] = $written - $EOCD['centralDirectoryOffset']; $EOCD = pack('LSSSSLLS', ...array_values($EOCD)); $written += fwrite($handle, $EOCD); fclose($handle); echo '  : ' . memory_get_peak_usage(true) . ' ' . PHP_EOL; echo '  : ' . $written . ' ' . PHP_EOL; echo '   `unzip -tq ' . $destination . '`: ' . PHP_EOL; echo '> ' . exec('unzip -tq ' . $destination) . PHP_EOL; echo PHP_EOL; 

结果,我得到了一个大小为360183字节的档案(我们的100MB文件被很好地压缩了,很可能只是一组相同的字节),并且解压缩后发现档案中没有发现错误。

结论


如果我有足够的精力和时间来撰写另一篇文章,那么我将尝试说明如何以及最重要的是为什么可以使用所有这些内容。

如果您对本主题还有其他兴趣,请在评论中提出建议,我将尽力回答您的问题。 很可能我们不会处理加密,因为脚本已经增长了,在我看来,在现实生活中,此类归档并不经常使用。



感谢您的关注和评论。

Source: https://habr.com/ru/post/zh-CN472966/


All Articles