Data Compression « Drew Scott Daniels' Blog

November 22, 2015

Minimal backups of a Debian or Ubuntu system

Filed under: Data Compression,technical,Uncategorized — admin @ 11:36 pm

Storage has become so cheap for me that I now use bacula and do full backups based on a broad file set definition. For offsite backups I focus on a smaller set of more important data. At work there are different kinds of requirements and restrictions too.

Back in 2002 I was thinking about minimal backups and I was dealing with quite a bit more software not following the Filesystem Hierarchy Standard A post about my thoughts is at https://lists.debian.org/debian-devel/2002/07/msg02232.html with Message-id <Pine.GSO.4.40.0207311107350.16701-100000@castor.cc.umanitoba.ca>. The same kind of methods can be used on Ubuntu too.

Example minimal backup scripts for Debian:
http://www.linux-backup.net/scripts/aaron.sh.txt
http://qref.sourceforge.net/Debian/reference/examples/backup

I wouldn’t ignore files that I installed that are in special directories and are not packages. The package called cruft or perl program Debarnacle (which can be installed using cpan) can be used to find files that are not in packages. The above scripts may not be smart enough to backup everything. “cruft” has some problems when being used to figure out what files to backup, this is because it is designed to find files that can be removed. Removing unnecessary files and packages is a good first step before a backup, but don’t delete anything you’re unsure about.

I forget all my difficulties in using “cruft” to assist in creating a backup, but I’ll try to recall. One problem was a feature that it ignored directories that it knew would have files that weren’t part of packages like /home. An inconvenience was that I had to sort through all the files that it said should be there, but weren’t (perhaps files to delete after a reinstall of packages, although some I should have had). Another problem was that I had to force it to skip some of the other file systems I had mounted like CD’s and my dos partitions. “cruft” seems to keep a large backup cache of it’s database, this was sometimes helpful, sometimes I needed to delete it (maybe I did to save disk space as I remember it being big).

Debian backup instructions, and backup information (obsolete and deprecated since 2006):
http://www.debian.org/doc/manuals/system-administrator/ch-sysadmin-backup.html
Replaced with:
http://www.debian.org/doc/manuals/debian-reference/ch10.en.html#_backup_and_recovery

“dpkg –get-selections>list_of_selections.txt” can backup a Debian user’s list of currently selected packages. “dpkg –set-selections<list_of_selections.txt” to start a system restore, iff packages are available. Some packages get removed from the Debian archive (http://ftp-master.debian.org/removals.txt has recent examples), if following unstable or testing, then you may want to use dpkg-repack or grab a copy of the packages you are worried about. http://archive.debian.org may also be helpful.

A simple rsync, ssh, cron solution can do regular incremental backups, but might ignore files that are easily available (like packages on cd or packages available by a quick download).

Discussion on backing up a Debian system:
http://wayback.archive.org/web/20020821185115/http://www.debianplanet.org/node.php?id=586

My Debian backup steps:
1. I remove files and packages that I’m sure I don’t want or need (deborphan can help me figure this out. I later learned to use aptitude and apt-cache to help find reverse dependancies)

2. I run “cruft” first on all directories, trying to avoid special devices or removing checks on directories where it seems to stall (not the best way, but the best way available right now). It would also be nice to use the md5 signatures for files to see if I manually changed a file in the file system, I’d of course want to back those up (watching for hacks of course).

3. I remove files that I’m sure I don’t want or need that are indicated to me by cruft. (possibly also removing them from cruft’s report file or files to save from doing step 7)

4. I fix the list of missing files indicated from cruft’s report(s). I also possibly install packages that have files installed, but the package isn’t listed as installed for some reason (not likely to be able to skip step 7 if I do this second part).

5. I look for packages that are not available for download or available in a reliable location (Usually all obsolete packages listed in dselect, sometimes more, some obsolete packages have -src packages that they can be
built from). I then backup these packages using dpkg-repack, or if available, I grab a copy of the proper package (dpkg-repack doesn’t create packages back to the way they were originally. http://archive.debian.org may be useful to find packages that are not in the archive anymore.).

6. I run “dpkg –get-selections>/root/myselections.txt” (this file is important to backup unless you want to go through the list of packages to install again, step 7 should catch this, or you can add it manually if you skip step 7).

7. I re-run “cruft” the same way I ran it for step 2.

8. Go through cruft’s report and remove any information that I don’t want backup (maybe keep a copy of the missing files separate).

9. Create a list of files to backup using cruft’s report as a good guide (remembering to include myselections.txt and any important packages). /etc and all the conf file information under /var/lib/dpkg/info may be good to have (not including unnecessary files would be nice).

The rest of the steps do not allow for incremental backups, and may be modified to allow incremental backups.

10. Append “tar -af backupfile.tar ” in a text file, at the beginning of every line that lists a file to backup except the first one which I do “tar -cf backupfile.tar” (tr may be helpful, but what’s the proper command? I’d prefer to avoid perl, but is it more common than tr? If so what’s the proper perl -e line?). I then make the text file executable and execute it.

11. I ran “bzip2 -9 backupfile.tar”. (p7zip might be one of the best choices for the time of this post)

12. I used xcdroast or cdrecord or something else to burn backupfile.tar to a recoradable cd. Brasero is more commonly used these days and burning to a DVD though USB and hard drives are getting more common. I currently use a porable USB hard drive and network attached storage.

I appreciate the help I’ve gotten so far in generating a program/script to automate these steps (and getting it made into an uploaded package). Help in figuring out a good way to make an incremental backup would be very useful. I think there must be a nice way to do it with tar and freshen, or using the archive bit.

Some packages get removed and the latest version isn’t publicly archived. I don’t like that unmaintained or any potentially installed packages are removed from the Debian archives. I brought this up with the qa group and was refered to http://snapshot.debian.org and “the morgue” which was on auric but now seems to be at merkel.debian.org:/org/ftp.debian.org/morgue/

I worry that not all packages will have md5 signatures of their files. I remember having a problem with many files not having md5 signatures, or even some packages having incorrect md5 signatures. I think *every* file that comes in a .deb should have an md5 signature that gets installed when the corresponding package gets installed. I wanted this to be a policy must as it’s good for backups, and detecting system corruptions (hacks and modifications that are intentional or malicious or accidental). md5 signatures should take little effort to create and maintain with an archive. This would help with tripwire or aide like setups (usually designed for file system based host intrusion detecion systems also called host IDS’s or HIDS).

Drew Daniels

Resume: http://www.boxheap.net/ddaniels/resume.html

Comments Off

May 25, 2011

Locality of Reference

Filed under: Data Compression,technical — admin @ 11:52 pm

In my studies of Computer Science I learned about the principle of locality which might be better called locality of reference. Two basic kinds of locality are temporal locality (locality in time) and spacial locality (locality in space). The theory basically says that similar things will group together. We see the same principle many places in daily life and other disciplines.

Some examples of locality include:

people speaking the same language tend to group together,
wheat fields being on land close together,
forests having many of the same species of tree,
minerals like gold being in high concentrations in certain areas,

In data compression, locality is important to reduce context windows to be small enough to fit in memory, to reduce context windows to reduce processing. A context is a kind of locality. A window is a term meaning the area being looked at (or evaluated). Many compression algorithms use a sliding window. A sliding window is a view of several blocks of data that shifts such that when one block is done being evaluated, the window moves one block.

Why am I talking about locality of reference in data compression? On April 14th, 2004 (2004/04/14), I wrote the following note to myself:

duplicate files
principle of locality
- files by time
- files by directory
- files by type

n(n-1)=0(n^2) Every file in front of every other file can be done in parallel.

This means that for file ordering in an archive, there are some short cuts to finding the optimal order that can take advantage of multi-processor systems. Now checks can also take better advantage of increased parallelism and faster random access provided by solid state disk drives (SSD’s).

In the above note, O(n^2) is Big O notation for order n squared. That means for every extra unit of input, the processing time roughly takes twice as long. This is a simple exponential curve.

For further references look up “locality of reference”, “sliding window” compression, “parallel processing”, and “Big O notation”. Also see my evolving nots on data compression including some future information on “Drew Daniels’ compression magic”.