Abstract
This repository provides an Ada library for writing Tar and PAX
archives (typical file extension .tar
).
The intended use case is creating archives from some other format and then streaming them to any target. As a result, this API does not currently provide any “file”-centric calls, i.e. there is no way to add a file from the file system to a Tar archive.
Instead, the API centers around creating entries – think of e.g. a single file – and formatting their metadata and data such that they become valid Tar blocks that can be streamed.
To put it another way: This library does not in itself have any side effects: The using applications are responsible for reading and writing the actual file data and metadata. This library only helps with formatting them according to Tar format requirements.
The library supports two output formats:
- USTAR
- PAX
Both formats are interpreted as specified by POSIX aka. the Open Group Base Specifications Issue 7, 2018 edition.
The entire library works under the assumption that a
Stream_Element
is a byte of 8 bits.
License
This library is licensed under GPL 3 or later. See
/usr/share/common-licenses/GPL-3
on any Debian system.
Ma_Sys.ma Tar Writer Library for Ada
(c) 2023 Ma_Sys.ma <info@masysma.net>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
Compiling
To compile this library, ant
and gnatmake
tools should be available alongside with a GNAT compiler. If these
dependencies are present, the library can be compiled by entering the
following command:
ant
If the necessary dependencies for building Debian packages are also installed, the following command can be used to create an installable Debian package:
ant package
Alternatively, see lib/build.xml
for a command that can
be used to compile the source files directly. This way it is possible to
compile even without having ant
installed. A minimal
compilation might then work as follows:
cd lib
gnatmake -fPIC -fstack-protector-strong -c tar-writer.adb
gcc -shared -o libtarada.so *.o
Repository Structure
The repsitory file structure is as follows:
bo-tar-ada/
│
├── lib/ *** This is the implementation. ***
│ ├── tar-writer.adb
│ ├── tar-writer.ads
│ ├── tar.ads
│ └── build.xml
│
├── test_suite/ Various test cases to check some basic
│ ├── tartest.adb library functions. Can run coverage tests
│ ├── references.ads by using `ant cov` on Linux systems.
│ └── build.xml
│
├── tool_taradaarc/ Minimal example of an “archiver”
│ ├── build.xml program that can be used to create TAR
│ ├── metadata.adb archives out of file system trees. It
│ ├── metadata.ads demonstrates the usage of the library in
│ ├── pstat.c a non-trivial use-case.
│ └── taradaarc.adb
│
├── tool_taradahello/ “Hello World” example using this library
│ ├── build.xml
│ └── taradahello.adb
│
├── README.md This file
├── debian-changelog.txt Changelog information for .deb build
└── build.xml Top-level build instructions
Example Program
The following example program taradahello.adb
can be
found in the directory tool_taradahello
in the
repository.
with Ada.Streams;
with Ada.Text_IO;
with Ada.Text_IO.Text_Streams;
with Tar;
with Tar.Writer;
procedure TarAdaHello is
: constant access Ada.Streams.Root_Stream_Type'Class :=
Stdout.Text_IO.Text_Streams.Stream(Ada.Text_IO.Standard_Output);
Ada
: constant String := ("Hello, world." & ASCII.LF);
Cnt
: Ada.Streams.Stream_Element_Array(0 ..
Cnt_Ar.Streams.Stream_Element_Offset(Cnt'Length - 1));
Adafor Cnt_Ar'Address use Cnt'Address;
: Tar.Writer.Tar_Entry := Tar.Writer.Init_Entry("hello.txt");
Ent
begin
.Set_Type(Tar.File);
Ent.Set_Access_Mode(8#644#);
Ent.Set_Size(Cnt'Length);
Ent.Set_Owner(1000, 1000);
Ent
.Write(Ent.Begin_Entry);
Stdout.Write(Ent.Add_Content(Cnt_Ar));
Stdout.Write(Ent.End_Entry);
Stdout.Write(Tar.Writer.End_Tar);
Stdout
end TarAdaHello;
The program creates a tar archive with a single file entry (no
directory) and name hello.txt
with the traditional content
of Hello, world.
followed by a newline. The tar data is
directly sent to the standard output in this example.
If you are interested in a more complex example, check the files
under tool_taradaarc
especially
tool_taradaarc/taradaarc.adb
which implements an archiver
that traverses file system trees and then writes the data to stdout as
.tar files.
Using the Library
Assuming the library is already installed on your system, you can
compile and run the example program from subdirectory
tool_taradahello
as follows:
gnatmake -o taradahello taradahello.adb \
-aO/usr/lib/x86_64-linux-gnu/ada/adalib/tar \
-aI/usr/share/ada/adainclude/tar -largs -ltarada
./taradahello | tar -tv
Output:
-rw-r--r-T 1000/1000 14 1970-01-01 01:00 hello.txt
Alternatively, you can provide the path to the sources during
compilation such that the library is statically linked. See the
build.xml
files in this repository for some examples about
how to approach this.
Tar Datatypes (tar.ads
)
subtype U64 is Interfaces.Unsigned_64;
type Dev_Node is mod 10 ** 7;
type Access_Mode is mod 8 ** 7;
type Tar_Entry_Type is
(File, Directory, FIFO, Symlink, Hardlink, Char, Block);
Package Tar
specifies data types which may be useful for
all kinds of Tar processing.
U64
is a standard Unsigned_64 integer. To permit using it without explicitly usingInterfaces
, the package also specifies most of the meaningful operators on it as a renamed version of theirInterfaces
counterpart.Dev_Node
is a modular type representing the supported data range for device node entries. It consists of all non-negative decimal numbers up to 7 digits.Access_Mode
is a modular type representing the supported file modes. It consists of all non-negative octal numbers up to 7 digits.Tar_Entry_Type
distinguishes the various kinds of entries that Tar data can represent.
Note: The restriction on the values of Dev_Node
and
Access_Mode
are not strictly mandated by the standard for
PAX outputs. They are defined this way here because this is (1) easier
to implement and (2) should be sufficient for most use cases. Please
tell me if there are real use cases where these ranges are insufficient,
because using PAX it is possible to lift them by extending the
implementation appropriately.
Tar Writer API
(tar-writer.ads
)
This is the core API for creating Tar archives.
function Init_Entry(Name: in String; Force_USTAR_Format: Boolean := False)
return Tar_Entry;
procedure Set_Type (Ent: in out Tar_Entry; Typ: in Tar_Entry_Type);
procedure Set_Access_Mode(Ent: in out Tar_Entry; Mode: in Access_Mode);
procedure Set_Size (Ent: in out Tar_Entry; SZ: in U64);
procedure Set_Modification_Time(Ent: in out Tar_Entry; M_Time: in U64);
procedure Set_Owner (Ent: in out Tar_Entry; UID, GID: in U64);
procedure Set_Owner (Ent: in out Tar_Entry; U_Name, G_Name: in String);
procedure Set_Link_Target(Ent: in out Tar_Entry; Target: in String);
procedure Set_Device (Ent: in out Tar_Entry; Major, Minor: in Dev_Node);
procedure Add_X_Attr (Ent: in out Tar_Entry; Key, Value: in String);
function Begin_Entry(Ent: in out Tar_Entry) return Stream_Element_Array;
function Add_Content(Ent: in out Tar_Entry; Cnt: in Stream_Element_Array)
return Stream_Element_Array;
function End_Entry(Ent: in out Tar_Entry) return Stream_Element_Array;
function End_Tar return Stream_Element_Array;
The lifecycle of an entire archive is as follows:
- Stream any number of entries
- Send a single archive “footer” through
End_Tar
which is really just 1K of zeroes.
Each entry inside the archive is created as follows:
- Call
Init_Entry
to obtain a context - Set all the metadata and add extended attributes as needed.
- Call
Begin_Entry
(once) to receive a header to stream. - Call
Add_Content
(any number of times) to stream the file contents. - Call
End_Entry
(once) to write the entry footer (think of zero-padding)
For all functions that return a Stream_Element_Array
, it
is intended to stream the returned data to the output in order to obtain
a valid Tar or PAX archive by the concatenation.
Initialization
function Init_Entry(Name; Force_USTAR_Format) return Tar_Entry
Prepares a Tar_Entry
from the entry name which is the
path inside the archive. This may be an absolute path like
e.g. /tmp/test.txt
or a relative path like
lib/build.xml
. A slash must be used to separate the path
components. The encoding must be valid UTF-8.
By default, the entry is created as a valid USTAR entry if the
metadata can be represented in that format. If metadata exceeds the
limits of USTAR, a PAX Extended Header is automatically created
as necessary. This behavior can be disabled by setting
Force_USTAR_Format := True
. In this case, instead of
creating a PAX Extended Header, exception
Not_Supported_In_Format
is raised.
Important cases where the USTAR limits are exceeded are e.g. any of the following:
- Path names longer than 255 characters.
- File sizes greater than 8 GiB.
Metadata
The following procedures can be used to configure the metadata of the
archive entry. They are only valid to be called after
Init_Entry
. The Begin_
routines must not have
been called on the same entry before.
Set_Type(Ent; Typ: in Tar_Entry_Type)
This procedure defines what kind of entry is to be produced. Most of the enumeration values directly correspond to the classic UNIX file types.
There is one peculiarity: The Hardlink
type can be used
to create links to existing entries from the same Tar as follows:
- The entry’s size must be set to 0.
- The
Set_Link_Target
procedure must be called with a path that exists in the same Tar archive.
It is recommended to call this procedure at least once for each entry.
Set_Access_Mode(Ent; Mode: in Access_Mode)
This procedure configures the entry’s access mode which is often
written in octal like e.g. 8#644#
for a typical file that
can be read and written by its owner and read by all other groups and
users.
Set_Size(Ent; SZ: in U64)
This procedure defines the data size of the entry to be created in bytes.
It is recommended to call this procedure at least once for each entry.
Set_Modification_Time(Ent; M_Time: in U64)
This procedure defines the modification time as UNIX timestamps i.e. in seconds since the epoch (1970-01-01 00:00:00 UTC). Earlier file dates are not supported.
Set_Owner(Ent; UID, GID: in U64)
This procedure configures the owner of the entry as a numeric user id (UID) and group id (GID). Typical values on desktop systems are e.g. (1000, 1000) for user-created files and (0, 0) for root-owned files.
Set_Owner(Ent; U_Name, G_Name: in String)
This procedure configures the owner of the entry by giving the user and group name as strings. These values are stored independently of the given numeric fields and upon extraction, tar-compatible applications are expected to prefer these names over the numeric IDs and only use the numeric values when the respective (named) owner does not exist on the current system.
Please consider the intended use case before blindly storing the user and group names here: For some users, the login name may correspond to their actual name and archives may be uploaded to online targets, breaching the users’ anonymity.
Universal archivers like e.g. GNU Tar provide the user with options
to change the default behaviour of storing the owner names (called
--numeric-owner
there). POSIX does not seem to prescribe
such an option for conformant pax
archivers, though.
Set_Link_Target(Ent; Target: in String)
This procedure defines the target of a symlink or a hardlink.
Set_Device(Ent; Major, Minor: in Dev_Node)
If the entry to be created corresponds to a device node, this
procedure sets the associated Major
and Minor
numbers.
Note: While PAX could represent arbitrarily long numbers here, this implementation limits the device node major and minor numbers to the limits defined for USTAR since that seems to cover all practical use cases already.
Add_X_Attr(Ent; Key, Value: in String)
This procedure adds an extended attribute as a free-form key/value pair.
Note that the storage of extended attributes is not defined by PAX
and thus the extended attributes can only be restored by archivers that
support the convention implemented here aka. SCHILY.xattr
,
cf. https://man.freebsd.org/cgi/man.cgi?query=star&sektion=5.
Content
After having configured all metadata for an entry, the associated
header can be obtained with Begin_Entry
. Then, any number
of calls to Add_Content
can be used to format data to be
added for this entry and finally, the entry is concluded by calling
End_Entry
. If no further entries appear in this TAR, obtain
the TAR Footer from End_Tar
.
Begin_Entry(Ent) return Stream_Element_Array
This function returns all the metadata configured for the current
entry as a readily streamable binary blob. It allows that subsequently,
Add_Content
can be called to process the actual file
contents.
Add_Content(Ent; Cnt: in Stream_Element_Array) return Stream_Element_Array
This function may seem like an identity function because it returns the same data as being input. In the course, it counts the number of bytes and keeps track of the alignment to TAR blocks (512 bytes each) which is necessary to properly end the entry.
End_Entry(Ent) return Stream_Element_Array
When all content has been added, End_Entry
concludes the
entry by returning suitable padding as to fill the 512 byte blocks. This
padding may be empty when the entry size is a multiple of 512 bytes.
End_Tar return Stream_Element_Array
When all entries have been added, End_Tar
can be used to
obtain the end of archive marker which is a fancy way of getting 1 KiB
of zero bytes btw.
Performance
This library has not been extensively optimized for performance. One
can use the example taradaarc
to do some very basic
performance tests, though.
Using a small test set of 2166 MiB and with 65685 entries, the
following timings are obtained by taradaarc
and GNU tar
(version 1.34 as shipped by Debian Bookworm). These benchmarks run on a
ramdisk.
$ hyperfine "tar -c /tmp/testset | dd of=/dev/null bs=1M"
Benchmark 1: tar -c /tmp/testset | dd of=/dev/null bs=1M
Time (mean ± σ): 1.103 s ± 0.060 s [User: 0.116 s, System: 1.763 s]
Range (min … max): 1.048 s … 1.260 s 10 runs
$ hyperfine "taradaarc /tmp/testset | dd of=/dev/null bs=1M"
Benchmark 1: taradaarc /tmp/testset | dd of=/dev/null bs=1M
Time (mean ± σ): 1.665 s ± 0.082 s [User: 0.370 s, System: 2.116 s]
Range (min … max): 1.600 s … 1.881 s 10 runs
That means GNU tar achieves 1964 MiB/s and taradaarc using the library achieves 1301 MiB/s of throughput which is significantly slower but probably OK for many practical use cases.
Rationale and Usage Recommendation
This library’s API is opinionated in that it does not supply standard
“archiver” functionality like adding files by a path and having them
read from disk automatically. Hence it may not be well-suited in cases
where a TAR library is required as a replacement for calling the
tar
command on the system. It has rather been designed as a
sort of portable file system abstraction because tar files can contain
many of UNIX’ special file types and attributes even on operating
systems which do not support them natively. If you are developing an
application where output files need to be written in a UNIX-specific
format but that is intended to run on other platforms, too, you could
consider using this library for output and then filtering the output of
your application through tar -x
. On UNIX platforms this
recreates all of the attributes as far as possible (i.e. as limited by
the running user’s capabilities) whereas on other platforms (like
e.g. Windows) it mostly gracefully degrades to what can be represented
there.
This library has not been tested for compatibility with a wide range of implementations. Instead, it was implemented based on reading the specification and then validating that it works with GNU tar. Depending on your use case this may be acceptable or additional validation and integration testing may be required.
The library is designed to minimize memory allocations while still
staying reasonably “easy” to use. Effectively the only place where
unbounded amounts of memory are needed is the
Indefinite_Ordered_Map
for constructing PAX extended
headers. If you enable USTAR mode, this map is not populated and always
remains an Empty_Map
which effectively removes the need for
dynamic memory allocation.
Future Directions
The test suite is a mess, it would benefit from being refactored and probably from moving to a dedicated “test framework” or such. Also, if the library ever gains support for reading archives, it is going to be much easier to perform some tests…
Feel free to send patches with bugfixes or missing functionality directly to info@masysma.net. Include a note to confirm that you are OK with these patches being included under GPL-3 or later license and add your preferred copyright line to the patch or e-mail.
Please note that API breaks are only accepted if good reasons exist to motivate them.