Splitting a Backup into Multiple Locations

Created by Steven Baltakatei Sandoval on 2020-02-14T19:43Z under a CC BY-SA 4.0 license and last updated on 2020-02-15T00:59Z.

Introduction

This blog entry is to document a method I found to create a backup of a large complex file system when the output must be split across multiple locations.

Commands used are the latest versions available in Debian GNU/Linux 10 ("Buster").

The tar and split Method

This method involves continuously creating and feeding an archive file created by tar into the split command which writes the archive to multiple chunks as multiple files. These chunks can later be recombined with a simple cat command and decompressed back into the original file(s) and directories with a tar command. The chunk size can be specified. I used this stackoverflow post as reference material.

Step 1. Create the split archives.

$ tar cz my_large_file_1 my_large_file_2 | split -b 100MB - myfiles_split.tgz_

How the command works:

This command takes the files my_large_file1 and my_large_file2, compresses them (thanks to the -z option), and feeds resulting archive data via stdout (thanks to the pipe |) to the split command.

The split command then splits up the archive data into chunks with a size specified with the -b option. In this example, this size is set to 100MB, or "100 megabytes". The resulting set of files have a name pattern like the following:

myfiles_split_tgz_aa
myfiles_split_tgz_ab
myfiles_split_tgz_ac
etc.

Step 2. Save the chunks to the different backup locations.

Step 3. Reconstitute the original file

$ cat myfiles_split_tgz_* | tar xz

How the command works:

This command concatenates all files within the working directory whose file names begin with the string myfiles_split_tgz_ and pipes (|) the resulting data to tar xz which decompresses the data contained across all the chunks back into the original my_large_file_1 and my_large_file_2 files.

The rsync with Multiple File Lists Method

This method involves using find to construct a file list of files to be backed up. Then, this file list is sorted with sort and saved to a master file list. Then, file sublists are formed by getting the sizes of different groups of files; sizes of groups are found using du and awk while the groups are formed by using cat, head, and tail. I used this stackoverflow post as reference material.

Step 1. Create a master file list of files to be transferred.

$ find ~/Downloads -readable -type f | sort > /tmp/filelist_master_sorted.txt

How this command works:

  1. First, the find command is told to look at all files within the ~/Downloads directory. By default, includes all files within subdirectories.
  2. Next, find is told by the -readable option to limit search to items that the user has permission to read.
  3. Next, find is told by the -type f option to limit search to files only (no directories will be returned).
  4. Next, the pipe (|) tells find to output the resulting file list to a stdout stream to the next command.
  5. Next, the file list is sorted by sort.
  6. The > character tells sort to write the resulting sorted file list to the file /tmp/filelist_master_sorted.txt.

Step 2. Identify where the file list should be split according to cumulative file size.

$ cat /tmp/filelist_master_sorted.txt | head -n 77 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'

How this command works:

  1. First, the cat command is told by the | character to send the contents of the file /tmp/filelist_master_sorted.txt to a stdout stream.
  2. Next, the head command is told by the -n 77 option to only output the first 77 lines of the file list to the next command, while.
  3. Now, the next few commands comprise a while loop. Basically, each line of the file is fed to du which outputs a modified file list except that it includes the file size in kilobytes as a second field.
  4. The modified file list is then piped from this while loop to awk which adds up the file sizes. It does so by taking the first field ($1) of every line it receives and adding it to an internal variable, i. Then, i is printed for you to see.

How to use this command:

This command answers the question: "What is the cumulative number of bytes of storage used by the first 77 files listed in /tmp/filelist_master_sorted.txt?".

This command can be used to derive sublists of files whose cumulative size can be arbitrarily set to fit within the available free space of multiple devices.

For example, let's say you have about 6.2 gigabytes of data spread across many small files in a complex directory structure that you want to back up. Also, let's say you have a device with 4 gigabytes of free space and another device with 6 gigabytes of free space. You want to completely fill the first device. You've created filelist_master_sorted.txt in Step 1. What you'll do in Step 3 is transfer files to the first device using file sublists made from this master list. Your goal in this step is to construct a file sublist whose files occupy close to but less than 4 gigabytes of space by adjusting the 77 in the head -n 77 part of the command. The number you end up using instead of 77 depends on your file system but as long as you don't have a single file over 4 gigabytes in size, there should be a number greater than 0 that will work. Your work will end up looking something like this:

$ cat /tmp/filelist_master_sorted.txt | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
6273398923
$ cat /tmp/filelist_master_sorted.txt | head -n 1000 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3613072211
$ cat /tmp/filelist_master_sorted.txt | head -n 1500 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3649060125
$ cat /tmp/filelist_master_sorted.txt | head -n 2500 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3724253511
$ cat /tmp/filelist_master_sorted.txt | head -n 4000 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3850081384
$ cat /tmp/filelist_master_sorted.txt | head -n 6000 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3991857806
$ cat /tmp/filelist_master_sorted.txt | head -n 7000 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
4054134433
$ cat /tmp/filelist_master_sorted.txt | head -n 6500 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
4022820287
$ cat /tmp/filelist_master_sorted.txt | head -n 6250 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
4003911559
$ cat /tmp/filelist_master_sorted.txt | head -n 6100 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3996211531

In this example all files in filelist_master_sorted.txt occupy a total of 6,273,398,923 bytes while the top 6,100 files in filelist_master_sorted.txt occupy 3,996,211,531 bytes (just below 4 gigabytes). You'd use a command like this to write first sublist:

$ cat /tmp/filelist_master_sorted.txt | head -n 6100 > /tmp/filelist_sub1.txt

Now you can use the tail command to write the remainder sublist (note the + sign and that 6100 is incremented by one):

$ cat /tmp/filelist_master_sorted.txt | tail -n +6101 > /tmp/filelist_sub2.txt

Thus, the contents of filelist_master_sorted.txt has been split between filelist_sub1.txt and filelist_sub2.txt.

Now, you have two file lists that can be used as inputs into rsync for copying all files to two different locations.

Step 3. Use rsync to copy files using file sublists

$ rsync --files-from="/tmp/filelist_sub1.txt" -avu --progress --dry-run / /media/username/device1/backup_part1
$ rsync --files-from="/tmp/filelist_sub2.txt" -avu --progress --dry-run / /media/username/device2/backup_part2

How this command works:

  1. In the first command, rsync is told to look at file sublist filelist_sub1.txt when considering which files to copy. This is done in the --files-from="/tmp/filelist_sub1.txt" option.

  2. The options -avu specify a standard set of rsync options I find myself often using.

  3. The --progress option causes rsync to provide useful progress information as it works.

  4. The --dry-run option prevents rsync from making any changes to disk. I purposefully included this option as a measure of caution which I see as important when testing new rsync capabilities.

  5. The / argument is present because it tells rsync where to start when considering the paths in each line of filelist_sub1.txt. In Step 2, the master file list was constructing using find which, by default, includes the entire file path starting from the root directory (/). If no file list was provided using the --files-from= option, then the command would instead look something like this: $ rsync -avu --progress --dry-run /tmp/source/ /tmp/destination/ (copying contents of /tmp/source directory to the /tmp/destination directory).

  6. The /media/username/device1/backup_part1 is the path to the destination directory on the 4 gigabyte device.

How to use this command:

If the commands including --dry-run do not return errors (ex: "No such file or directory") and files appear to be set to copy to the correct directories then it probably will work.

Note: It is important for later backup recovery that the destination directories be empty. If not, then the restoration from backup operation in Step 4 will include extra files that may cause problems. However, if this backup rsync command is run often then the destination directory having mostly the same files as the last backup may be useful. Consider using the --delete-before option (WARNING: it will delete ALL files in the destination directory that don't match the source directory).

The file copy operation can be initiated by removing the --dry-run option and running:

$ rsync --files-from="/tmp/filelist_sub1.txt" -av --progress / /media/username/device1/backup_part1
$ rsync --files-from="/tmp/filelist_sub2.txt" -av --progress / /media/username/device2/backup_part2

You can check the disk usage of the /media/username/device1/backup_part1 directory with du -chb:

$ du -cb /media/username/device1/backup_part1 | tail -n1
3997055307  total

In this example the resulting disk usage of files copied to device1 is: 3,997,055,307 bytes. I'm not exactly sure why this is 843,776 bytes larger than the 3,996,211,531 figure found in Step 2 but it's close enough (note to self: maybe I didn't copy empty folders due to find -type f?). The point is that the first 4 gigabytes of files now nearly completely fill the 4 gigabyte storage device and the remainder is copied to the second device. Due to how rsync works, the sections of the original folder structure necessary to store each file are copied over. This means that the original folder structure (minus empty folders in the original) can be recreated by another use of rsync backwards from the devices to the original source directory.

Step 4. Use rsync to restore copied files

If the files need to be restored back to their original locations then rsync can be used again to combine files from their backup locations. The file sublists will not be needed since they were used to help rsync identify where to copy from, not where to copy to. Also, there are several variations of commands that might be desireable, depending on what you want to apply the backup:

If you want to blow away all changes to /home/username/original/directory/ and make everything look the same as the backup, use:

$ rsync -av --progress --delete-before /media/username/device1/backup_part1/ /home/username/original/directory/
$ rsync -av --progress --delete-before /media/username/device1/backup_part2/ /home/username/original/directory/

If you want to only restore files that are missing and skip files that were modified more recently than the in the backup, use:

$ rsync -avu --progress --modify-window=1 /media/username/device1/backup_part1/ /home/username/original/directory/
$ rsync -avu --progress --modify-window=1 /media/username/device1/backup_part2/ /home/username/original/directory/

There are various other options available to make rsync do what you want but I have found the above commands to work well for me (as someone who exclusively uses Debian GNU/Linux systems for personal file archiving).

Comparison

The tar and split Method

Advantages:

  1. This method is better than some other methods that also use tar and split since it does not require the unsplit tar archive file to be written to disk first. Instead, as $ tar cz compresses and formats the archive stdout data stream, it sends the stream continuously to the split command which creates the 100 megabyte chunks as they come in. This has the advantage of saving disk space, especially if my_large_file1 and my_large_file2 are large.
  2. This method compresses files which may be useful if the input files aren't already compressed. . The point at which the file data to be backed up are split is automatically chosen, unlike the rsync method which, as outline above, requires a human to manually adjust the split point using file lists.

Disadvantages:

  1. It isn't obvious how the archived files may be recovered if one of the chunks from split is lost. If the compression option (-z) isn't used then it may be easier to recover data since I know tar basically concatenates files to one another, saving permission data somewhere along the way. I won't go into investigating this option.
  2. This method still uses a significant amount of disk space to hold the output files since all output files must be located within the same directory. If the backup must be stored across multiple devices (due to, for example, each device being unable to hold the entire backup), then this command is not as useful as the rsync with Multiple File Lists Method.
  3. The process may be vulneable to time loss if the sending of the large chunks is interrupted due to an unreliable network connection.

The rsync with Multiple File Lists Method

Advantages:

  1. Incremental backups are possible since files are copied one-by-one. It's possible to keep recent modifications of files if only a partial restoration from a backup is required.
  2. If the backup must be split it is possible for the user to specify exactly how to perform the split on a file-by-file basis. The tar and split method doesn't have the capability (as far as I know) to specify exactly where the archive data stream is split so as to allow each segment of all archive data to be written to different locations.
  3. If files are being transferred via an unreliable internet connection that fails mid-transfer, rsync can still resume operation where it left off since it transfers files individually. If there is some doubt as to the integrity of files whose transfer was interrupted (in some weird case where a transferred file was interrupted but the resulting partial file is indistinguishable in file size an other attributes from the original to rsync), rsync has an option which forces a full read and checksum comparison of files between the SOURCE and DESTINATION locations.

Disadvantages:

  1. This method is more complex in the file sublist generation step. This method allows very fine adjustment for which file is transfered to which destination thanks to the use of file lists. This may be required for certain situations where limited storage space prevents a user from staging large intermediate files before final transfer to their destination directories. A program like a bash script could, in theory, be used to automate the sublist generation step after only requesting from a user the size of each sublist chunk, but I did not want to write or explore this method since I do not make split backups all that often.

Conclusion

Overall, the more versatile option for creating a backup split across two different devices is the use of rsync with filelists. However, if plenty of space is available to store the intermediate archive chunks and incremental backups are not needed, then the tar and split method is simpler to perform.

References

Commands used

awk

GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2018 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.

cat

cat (GNU coreutils) 8.30, GPLv3+
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Torbjorn Granlund and Richard M. Stallman.

du

du (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Torbjorn Granlund, David MacKenzie, Paul Eggert,
and Jim Meyering.

find

find (GNU findutils) 4.6.0.225-235f
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Features enabled: D_TYPE O_NOFOLLOW(enabled) LEAF_OPTIMISATION FTS(FTS_CWDFD) CBO(level=2)

head

head (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by David MacKenzie and Jim Meyering.

rsync

rsync  version 3.1.3  protocol version 31
Copyright (C) 1996-2018 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
append, ACLs, xattrs, iconv, symtimes, prealloc

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.

sort

sort (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

split

split (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Torbjorn Granlund and Richard M. Stallman.

tail

tail (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Paul Rubin, David MacKenzie, Ian Lance Taylor,
and Jim Meyering.

tar

tar (GNU tar) 1.30
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by John Gilmore and Jay Fenlason.

This work by Steven Baltakatei Sandoval is licensed under CC BY-SA 4.0