<!DOCTYPE html>
Splitting a Backup into Multiple Locations
Created by Steven Baltakatei Sandoval on 2020-02-14T19:43Z under a CC BY-SA 4.0 license and last updated on 2020-02-15T00:59Z.
Introduction
This blog entry is to document a method I found to create a backup of a large complex file system when the output must be split across multiple locations.
Commands used are the latest versions available in Debian GNU/Linux 10 ("Buster").
The tar
and split
Method
This method involves continuously creating and feeding an archive file created by tar
into the split
command which writes the archive to multiple chunks as multiple files. These chunks can later be recombined with a simple cat
command and decompressed back into the original file(s) and directories with a tar
command. The chunk size can be specified. I used this stackoverflow post as reference material.
Step 1. Create the split archives.
$ tar cz my_large_file_1 my_large_file_2 | split -b 100MB - myfiles_split.tgz_
How the command works:
This command takes the files my_large_file1
and my_large_file2
, compresses them (thanks to the -z
option), and feeds resulting archive data via stdout (thanks to the pipe |
) to the split
command.
The split
command then splits up the archive data into chunks with a size specified with the -b
option. In this example, this size is set to 100MB
, or "100 megabytes". The resulting set of files have a name pattern like the following:
myfiles_split_tgz_aa
myfiles_split_tgz_ab
myfiles_split_tgz_ac
etc.
Step 2. Save the chunks to the different backup locations.
Step 3. Reconstitute the original file
$ cat myfiles_split_tgz_* | tar xz
How the command works:
This command concatenates all files within the working directory whose file names begin with the string myfiles_split_tgz_
and pipes (|
) the resulting data to tar xz
which decompresses the data contained across all the chunks back into the original my_large_file_1
and my_large_file_2
files.
The rsync
with Multiple File Lists Method
This method involves using find
to construct a file list of files to be backed up. Then, this file list is sorted with sort
and saved to a master file list. Then, file sublists are formed by getting the sizes of different groups of files; sizes of groups are found using du
and awk
while the groups are formed by using cat
, head
, and tail
. I used this stackoverflow post as reference material.
Step 1. Create a master file list of files to be transferred.
$ find ~/Downloads -readable -type f | sort > /tmp/filelist_master_sorted.txt
How this command works:
- First, the
find
command is told to look at all files within the~/Downloads
directory. By default, includes all files within subdirectories. - Next,
find
is told by the-readable
option to limit search to items that the user has permission to read. - Next,
find
is told by the-type f
option to limit search to files only (no directories will be returned). - Next, the pipe (
|
) tellsfind
to output the resulting file list to a stdout stream to the next command. - Each line in the file list contains the full path and name of the file.
- Next, the file list is sorted by
sort
. - The
>
character tellssort
to write the resulting sorted file list to the file/tmp/filelist_master_sorted.txt
.
Step 2. Identify where the file list should be split according to cumulative file size.
$ cat /tmp/filelist_master_sorted.txt | head -n 77 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
How this command works:
- First, the
cat
command is told by the|
character to send the contents of the file/tmp/filelist_master_sorted.txt
to a stdout stream. - Next, the
head
command is told by the-n 77
option to only output the first 77 lines of the file list to the next command,while
. - Now, the next few commands comprise a while loop. Basically, each line of the file is fed to
du
which outputs a modified file list except that it includes the file size in kilobytes as a second field. - The modified file list is then piped from this
while
loop toawk
which adds up the file sizes. It does so by taking the first field ($1
) of every line it receives and adding it to an internal variable,i
. Then,i
is printed for you to see.
How to use this command:
This command answers the question: "What is the cumulative number of bytes of storage used by the first 77 files listed in /tmp/filelist_master_sorted.txt
?".
This command can be used to derive sublists of files whose cumulative size can be arbitrarily set to fit within the available free space of multiple devices.
For example, let's say you have about 6.2 gigabytes of data spread across many small files in a complex directory structure that you want to back up. Also, let's say you have a device with 4 gigabytes of free space and another device with 6 gigabytes of free space. You want to completely fill the first device. You've created filelist_master_sorted.txt
in Step 1. What you'll do in Step 3 is transfer files to the first device using file sublists made from this master list. Your goal in this step is to construct a file sublist whose files occupy close to but less than 4 gigabytes of space by adjusting the 77
in the head -n 77
part of the command. The number you end up using instead of 77
depends on your file system but as long as you don't have a single file over 4 gigabytes in size, there should be a number greater than 0 that will work. Your work will end up looking something like this:
$ cat /tmp/filelist_master_sorted.txt | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
6273398923
$ cat /tmp/filelist_master_sorted.txt | head -n 1000 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3613072211
$ cat /tmp/filelist_master_sorted.txt | head -n 1500 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3649060125
$ cat /tmp/filelist_master_sorted.txt | head -n 2500 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3724253511
$ cat /tmp/filelist_master_sorted.txt | head -n 4000 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3850081384
$ cat /tmp/filelist_master_sorted.txt | head -n 6000 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3991857806
$ cat /tmp/filelist_master_sorted.txt | head -n 7000 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
4054134433
$ cat /tmp/filelist_master_sorted.txt | head -n 6500 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
4022820287
$ cat /tmp/filelist_master_sorted.txt | head -n 6250 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
4003911559
$ cat /tmp/filelist_master_sorted.txt | head -n 6100 | while read file; do du -b "$file"; done | awk '{i+=$1} END {print i}'
3996211531
In this example all files in filelist_master_sorted.txt
occupy a total of 6,273,398,923 bytes while the top 6,100 files in filelist_master_sorted.txt
occupy 3,996,211,531 bytes (just below 4 gigabytes). You'd use a command like this to write first sublist:
$ cat /tmp/filelist_master_sorted.txt | head -n 6100 > /tmp/filelist_sub1.txt
Now you can use the tail
command to write the remainder sublist (note the +
sign and that 6100
is incremented by one):
$ cat /tmp/filelist_master_sorted.txt | tail -n +6101 > /tmp/filelist_sub2.txt
Thus, the contents of filelist_master_sorted.txt
has been split between filelist_sub1.txt
and filelist_sub2.txt
.
Now, you have two file lists that can be used as inputs into rsync
for copying all files to two different locations.
Step 3. Use rsync
to copy files using file sublists
$ rsync --files-from="/tmp/filelist_sub1.txt" -avu --progress --dry-run / /media/username/device1/backup_part1
$ rsync --files-from="/tmp/filelist_sub2.txt" -avu --progress --dry-run / /media/username/device2/backup_part2
How this command works:
In the first command,
rsync
is told to look at file sublistfilelist_sub1.txt
when considering which files to copy. This is done in the--files-from="/tmp/filelist_sub1.txt"
option.The options
-avu
specify a standard set ofrsync
options I find myself often using.The
--progress
option causesrsync
to provide useful progress information as it works.The
--dry-run
option preventsrsync
from making any changes to disk. I purposefully included this option as a measure of caution which I see as important when testing newrsync
capabilities.The
/
argument is present because it tellsrsync
where to start when considering the paths in each line offilelist_sub1.txt
. In Step 2, the master file list was constructing usingfind
which, by default, includes the entire file path starting from the root directory (/
). If no file list was provided using the--files-from=
option, then the command would instead look something like this:$ rsync -avu --progress --dry-run /tmp/source/ /tmp/destination/
(copying contents of/tmp/source
directory to the/tmp/destination
directory).The
/media/username/device1/backup_part1
is the path to the destination directory on the 4 gigabyte device.
How to use this command:
If the commands including --dry-run
do not return errors (ex: "No such file or directory
") and files appear to be set to copy to the correct directories then it probably will work.
Note: It is important for later backup recovery that the destination directories be empty. If not, then the restoration from backup operation in Step 4 will include extra files that may cause problems. However, if this backup rsync
command is run often then the destination directory having mostly the same files as the last backup may be useful. Consider using the --delete-before
option (WARNING: it will delete ALL files in the destination directory that don't match the source directory).
The file copy operation can be initiated by removing the --dry-run
option and running:
$ rsync --files-from="/tmp/filelist_sub1.txt" -av --progress / /media/username/device1/backup_part1
$ rsync --files-from="/tmp/filelist_sub2.txt" -av --progress / /media/username/device2/backup_part2
You can check the disk usage of the /media/username/device1/backup_part1
directory with du -chb
:
$ du -cb /media/username/device1/backup_part1 | tail -n1
3997055307 total
- The
-c
option causesdu
to output a grand total as the final line in its output. - The
-b
option causesdu
to show usage in bytes. - The output of
du
is piped totail -n 1
because I want to see the last line of output fromdu
, which contains the grand total.
In this example the resulting disk usage of files copied to device1
is: 3,997,055,307 bytes. I'm not exactly sure why this is 843,776 bytes larger than the 3,996,211,531 figure found in Step 2 but it's close enough (note to self: maybe I didn't copy empty folders due to find -type f
?). The point is that the first 4 gigabytes of files now nearly completely fill the 4 gigabyte storage device and the remainder is copied to the second device. Due to how rsync
works, the sections of the original folder structure necessary to store each file are copied over. This means that the original folder structure (minus empty folders in the original) can be recreated by another use of rsync
backwards from the devices to the original source directory.
Step 4. Use rsync
to restore copied files
If the files need to be restored back to their original locations then rsync
can be used again to combine files from their backup locations. The file sublists will not be needed since they were used to help rsync
identify where to copy from, not where to copy to. Also, there are several variations of commands that might be desireable, depending on what you want to apply the backup:
If you want to blow away all changes to /home/username/original/directory/
and make everything look the same as the backup, use:
$ rsync -av --progress --delete-before /media/username/device1/backup_part1/ /home/username/original/directory/
$ rsync -av --progress --delete-before /media/username/device1/backup_part2/ /home/username/original/directory/
If you want to only restore files that are missing and skip files that were modified more recently than the in the backup, use:
$ rsync -avu --progress --modify-window=1 /media/username/device1/backup_part1/ /home/username/original/directory/
$ rsync -avu --progress --modify-window=1 /media/username/device1/backup_part2/ /home/username/original/directory/
- The
-a
option tellsrsync
to operate recursively and to preserve file metadata such as permissions, groups, and modification times. - The
-u
option tellsrsync
to skip files that have more recent modification times in/home/username/original/directory
. - The
--modify-window=1
tellsrsync
to permit up to 1 second difference when comparing modification times of files in the source and destination directories.
There are various other options available to make rsync
do what you want but I have found the above commands to work well for me (as someone who exclusively uses Debian GNU/Linux systems for personal file archiving).
Comparison
The tar
and split
Method
Advantages:
- This method is better than some other methods that also use
tar
andsplit
since it does not require the unsplit tar archive file to be written to disk first. Instead, as$ tar cz
compresses and formats the archive stdout data stream, it sends the stream continuously to thesplit
command which creates the 100 megabyte chunks as they come in. This has the advantage of saving disk space, especially ifmy_large_file1
andmy_large_file2
are large. - This method compresses files which may be useful if the input files aren't already compressed.
. The point at which the file data to be backed up are split is automatically chosen, unlike the
rsync
method which, as outline above, requires a human to manually adjust the split point using file lists.
Disadvantages:
- It isn't obvious how the archived files may be recovered if one of the chunks from
split
is lost. If the compression option (-z
) isn't used then it may be easier to recover data since I knowtar
basically concatenates files to one another, saving permission data somewhere along the way. I won't go into investigating this option. - This method still uses a significant amount of disk space to hold the output files since all output files must be located within the same directory. If the backup must be stored across multiple devices (due to, for example, each device being unable to hold the entire backup), then this command is not as useful as the
rsync
with Multiple File Lists Method. - The process may be vulneable to time loss if the sending of the large chunks is interrupted due to an unreliable network connection.
The rsync
with Multiple File Lists Method
Advantages:
- Incremental backups are possible since files are copied one-by-one. It's possible to keep recent modifications of files if only a partial restoration from a backup is required.
- If the backup must be split it is possible for the user to specify exactly how to perform the split on a file-by-file basis. The
tar
andsplit
method doesn't have the capability (as far as I know) to specify exactly where the archive data stream is split so as to allow each segment of all archive data to be written to different locations. - If files are being transferred via an unreliable internet connection that fails mid-transfer,
rsync
can still resume operation where it left off since it transfers files individually. If there is some doubt as to the integrity of files whose transfer was interrupted (in some weird case where a transferred file was interrupted but the resulting partial file is indistinguishable in file size an other attributes from the original torsync
),rsync
has an option which forces a full read and checksum comparison of files between the SOURCE and DESTINATION locations.
Disadvantages:
- This method is more complex in the file sublist generation step. This method allows very fine adjustment for which file is transfered to which destination thanks to the use of file lists. This may be required for certain situations where limited storage space prevents a user from staging large intermediate files before final transfer to their destination directories. A program like a bash script could, in theory, be used to automate the sublist generation step after only requesting from a user the size of each sublist chunk, but I did not want to write or explore this method since I do not make split backups all that often.
Conclusion
Overall, the more versatile option for creating a backup split across two different devices is the use of rsync
with filelists. However, if plenty of space is available to store the intermediate archive chunks and incremental backups are not needed, then the tar
and split
method is simpler to perform.
References
Commands used
awk
GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2018 Free Software Foundation.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
cat
cat (GNU coreutils) 8.30, GPLv3+
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Torbjorn Granlund and Richard M. Stallman.
du
du (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Torbjorn Granlund, David MacKenzie, Paul Eggert,
and Jim Meyering.
find
find (GNU findutils) 4.6.0.225-235f
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Features enabled: D_TYPE O_NOFOLLOW(enabled) LEAF_OPTIMISATION FTS(FTS_CWDFD) CBO(level=2)
head
head (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by David MacKenzie and Jim Meyering.
rsync
rsync version 3.1.3 protocol version 31
Copyright (C) 1996-2018 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
append, ACLs, xattrs, iconv, symtimes, prealloc
rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you
are welcome to redistribute it under certain conditions. See the GNU
General Public Licence for details.
sort
sort (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and Paul Eggert.
split
split (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Torbjorn Granlund and Richard M. Stallman.
tail
tail (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Paul Rubin, David MacKenzie, Ian Lance Taylor,
and Jim Meyering.
tar
tar (GNU tar) 1.30
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by John Gilmore and Jay Fenlason.
This work by Steven Baltakatei Sandoval is licensed under CC BY-SA 4.0