Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage when updating ZipArchives #102704

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

edwardneal
Copy link
Contributor

@edwardneal edwardneal commented May 26, 2024

There are a handful of different issues surrounding ZipArchive, so this PR touches on a few. It makes ZipArchive more conservative when writing in WriteFile. It does this by tracking the type of dirty state of each archive entry and of the archive itself.

Previously, every ZipArchiveEntry's data would be loaded into memory, the output stream would have its length set to zero, then the archive entries would be written out in sequence. This naturally causes problems when working with very large ZIP files.

I've changed this behaviour by forcing the ZipArchive to track the offset its first deleted entry, _firstDeletedEntryOffset. When writing a file:

  • Find the offset of the first entry with any changes at all (startingOffset), and the first entry with changes to its dynamic-length metadata or entry data (dynamicDirtyStartingOffset).
  • If any entries have an offset greater than startingOffset, add them to the list (entriesToWrite) of entries to persist.
  • If any entries have an offset greater than dynamicDirtyStartingOffset, load their contents into memory.
  • Position the stream offset at startingOffset, then start writing each entry in entriesToWrite. When writing:
    • If the entry is dirty at all, (or beyond dynamicDirtyStartOffset) write the header. If not, seek past it.
    • If the entry has changes to its data, write the data (loaded in an earlier step). If not, seek past it.
  • If necessary, write (or seek past) each entry's central directory header. Track whether or not any of these were written.
  • If any central directory headers were written (or if there are no entries in the archive), write the end-of-central-directory block. If not, seek past it.
  • Shrink the stream to its current position.

This relies upon the list of the ZipArchive's existing ZipArchiveEntry records being sorted in offset order, which is now guaranteed on load.

Issue links:

There's also an issue (1544) which relates to ZipArchiveEntry storing uncompressed data in memory unnecessarily. This PR doesn't fix that issue, but it means that somebody won't fix it and discover that ZipArchive.Dispose is loading the contents into memory anyway.

I've added a number of test cases which cover the corner cases I've thought of, verifying the correct number of writes and the contents of the files as they go. It's quite a core part of ZipArchive though and I'm conscious that a number of libraries depend upon it, so I'm open to adding more.

Edit: From benchmarking the "append case", my results were pretty much as I expected: performance improves, but becomes more variable. Updating/deleting entries at the end of an archive (or after the largest entries in the archive) becomes faster and uses less memory, while updating/deleting entries at the start of an archive takes about as long as it did before.


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3593/23H2/2023Update/SunValley3)
Intel Core i7-8565U CPU 1.80GHz (Whiskey Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]     : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX2
  Job-QKCECQ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Toolchain=CoreRun 

Benchmark Mean Error StdDev Gen0 Gen1 Gen2 Allocated
Pre-PR 5.128 s 0.1346 s 0.3925 s 1000.0000 1000.0000 1000.0000 2 GB
Post-PR 8.169 ms 0.2386 ms 0.7036 ms 7.24 KB

WriteFile now takes a more granular approach to writing the ZipArchive to its stream.
Adding a new file to the end of the archive will no longer require every file in the archive to be loaded into memory.
Changing fixed-length metadata inside a file will no longer require the entire archive to be re-written.
Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the premise of making ZipArchive writes more incremental - would like someone from @dotnet/area-system-io-compression to have a look and see if they can give more guidance for direction in what they'd like to see in this PR. Added a couple comments around what stuck out.

Renamed Dirty and DirtyState to Changed and ChangeState.
Explicitly assigned Unchanged as a zero-value ChangeState.
Reset _originallyInArchive and _offsetOfLocalHeader to private members, exposed as internal properties.
Also changed _versionToExtract to be private - this isn't ever used in System.IO.Compression outside of ZipArchiveEntry.
@ericstj
Copy link
Member

ericstj commented Jul 22, 2024

@carlossanlop can you please review?

@carlossanlop carlossanlop added this to the 10.0.0 milestone Sep 30, 2024
@udlose
Copy link

udlose commented Oct 8, 2024

@carlossanlop @ericstj

  1. is there any update on this?
  2. possibly coming in .net 9?

@ericstj
Copy link
Member

ericstj commented Oct 9, 2024

It's too late for .NET 9 but we'll have a look at getting into main for .NET 10.0. There's potential fallout from this change and we need to make sure it gets a closer look and plenty of time to bake before going out stable.

@edwardneal
Copy link
Contributor Author

Thanks for the update @ericstj.

To expand on the note in the original description:

I've made an improvement, but I think a complete fix will be more complex (and might look quite like a compacting heap.)

This PR makes an improvement improvement to the average case and resolves the case of appending new entries to the ZIP archive, but it's limited: if the user changes the contents of the first entry in the archive, WriteFile will write the entire archive again - even if the entry content changes have resulted in the entry becoming smaller. The same thing will happen if the user renames the file, even if the encoded name is shorter.

A mechanism similar to a compacting heap would fix this, but I felt that needed more design work. The primary tradeoff which I was thinking about was between unused space in the archives, and the time/IO required to start moving entries around (if that's even possible - ZipArchive deals with unseekable streams.) That tradeoff naturally changes as we move towards the end of an archive, since the IO to append a large entry to the end of an archive is less expensive than the IO which reshuffles enough entries to make free space at the start.

If there's already a data structure which handles this case in .NET and which accounts for the cost of IO, then this PR can use that directly and reduce the risk.

@steveharter
Copy link
Member

From the description's first step:

Find the offset of the first entry with any changes at all (startingOffset), and the first entry with changes to its dynamic-length metadata or entry data (dynamicDirtyStartingOffset).

I assume then that all of the entries from that first changed entry to the last entry will be re-written and thus have no fragmentation and thus be the same physically on disk as before the PR? If so, then we shouldn't need to add a new enum value to ZipArchiveMode but would just need to update the documentation for ZipArchiveMode.Append where it says "The content of the entire archive is held in memory."

@@ -415,7 +427,7 @@ internal void AcquireArchiveStream(ZipArchiveEntry entry)
{
if (!_archiveStreamOwner.EverOpenedForWrite)
{
_archiveStreamOwner.WriteAndFinishLocalEntry();
_archiveStreamOwner.WriteAndFinishLocalEntry(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
_archiveStreamOwner.WriteAndFinishLocalEntry(true);
_archiveStreamOwner.WriteAndFinishLocalEntry(forceWrite: true);

@@ -31,6 +31,8 @@ public class ZipArchive : IDisposable
private byte[] _archiveComment;
private Encoding? _entryNameAndCommentEncoding;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary extra line here

@@ -295,6 +302,12 @@ public long Length
/// </summary>
public string Name => ParseFileName(FullName, _versionMadeByPlatform);

internal ZipArchive.ChangeState Changed { get; private set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Changed" -> "Changes"?

"Changed" sounds like a bool, while "Changes" sounds like the flags it is.

ZipGenericExtraField.WriteAllBlocks(_cdUnknownExtraFields, _archive.ArchiveStream);
if (_originallyInArchive && Changed == ZipArchive.ChangeState.Unchanged && !forceWrite)
{
long centralDirectoryHeaderLength = 46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a const for 46 perhaps with the +'s for readability.

@@ -633,41 +653,108 @@ private void WriteFile()
// if we are in update mode, we call EnsureCentralDirectoryRead, which sets readEntries to true
Debug.Assert(_readEntries);

// Entries starting after this offset have had a dynamically-sized change. Everything on or after this point must be rewritten.
long dynamicDirtyStartingOffset = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dynamicDirtyStartingOffset" -> "rewriteStartingOffset"?

}
else
{
entriesToWrite.Add(entry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for understandability of this long method, swap\move entriesToWrite.Add(entry) to the beginning of the block above (e.g. if (!entry.OriginallyInArchive))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.IO.Compression community-contribution Indicates that the PR has been added by a community member
Projects
None yet
5 participants