This is the command reposurgeon that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
reposurgeon - surgical operations on repositories
SYNOPSIS
reposurgeon [command...]
DESCRIPTION
The purpose of reposurgeon is to enable risky operations that VCSes (version-control
systems) don't want to let you do, such as (a) editing past comments and metadata, (b)
excising commits, (c) coalescing and splitting commits, (d) removing files and subtrees
from repo history, (e) merging or grafting two or more repos, and (f) cutting a repo in
two by cutting a parent-child link, preserving the branch structure of both child repos.
The original motivation for reposurgeon was to clean up artifacts created by repository
conversions. It was foreseen that the tool would also have applications when code needs to
be removed from repositories for legal or policy reasons.
To keep reposurgeon simple and flexible, it normally does not do its own repository
reading and writing. Instead, it relies on being able to parse and emit the command
streams created by git-fast-export and read by git-fast-import. This means that it can be
used on any version-control system that has both fast-export and fast-import utilities.
The git-import stream format also implicitly defines a common language of primitive
operations for reposurgeon to speak.
Fully supported systems (those for which reposurgeon can both read and write repositories)
include git, hg, bzr, svn, darcs, RCS, and SRC. For a complete list, with dependencies and
technical notes, type prefer to the reposurgeon prompt.
Writing to the file-oriented systems RCS and SRC is done via rcs-fast-import(1) and has
some serious limitations because those systems cannot represent all the metadata in a
git-fast-export stream. Consult that tool's documentation for details and partial
workarounds.
Writing Subversion repositories also has some significant limitations, discussed in the
section on Working With Subversion.
Fossil repository files can be read in using the --format=fossil option of the read
command and written out with the --format=fossil option of the write. Ignore patterns are
not translated in either direction.
CVS is supported for read only, not write. For CVS, reposurgeon must be run from within a
repository directory.
For guidance on the pragmatics of repository conversion, see the DVCS Migration HOWTO[1].
SAFETY WARNINGS
reposurgeon is a sharp enough tool to cut you. It takes care not to ever write a
repository in an actually inconsistent state, and will terminate with an error message
rather than proceed when its internal data structures are confused. However, there are
lots of things you can do with it - like altering stored commit timestamps to they no
longer match the commit sequence - that are likely to cause havoc after you're done.
Proceed with caution and check your work.
Also note that, if your DVCS does the usual thing of making commit IDs a cryptographic
hash of content and parent links, editing a publicly-accessible repository with this tool
would be a bad idea. All of the surgical operations in reposurgeon will modify the hash
chains, meaning others will become unable to pull from or push to the repo.
Please also see the notes on system-specific issues under the section called “LIMITATIONS
AND GUARANTEES”.
OPERATION
The program can be run in one of two modes, either as an interactive command interpreter
or in batch mode to execute commands given as arguments on the reposurgeon invocation
line. The only differences between these modes are (1) the interactive one begins by
turning on the 'verbose 1' option, (2) in batch mode all errors (including normally
recoverable errors in selection-set syntax) are fatal, and (3) each command-line argument
beginning with “--” has that stripped off (which, in particular means that --help and
--version will work as expected). Also, in interactive mode, Ctrl-P and Ctrl-N will be
available to scroll through your command history and tab completion of command keywords is
available.
A git-fast-import stream consists of a sequence of commands which must be executed in the
specified sequence to build the repo; to avoid confusion with reposurgeon commands we will
refer to the stream commands as events in this documentation. These events are implicitly
numbered from 1 upwards. Most commands require specifying a selection of event sequence
numbers so reposurgeon will know which events to modify or delete.
For all the details of event types and semantics, see the git-fast-import(1) manual page;
the rest of this paragraph is a quick start for the impatient. Most events in a stream are
commits describing revision states of the repository; these group together under a single
change comment one or more fileops (file operations), which usually point to blobs that
are revision states of individual files. A fileop may also be a delete operation
indicating that a specified previously-existing file was deleted as part of the version
commit; there are a couple of other special fileop types of lesser importance.
Commands to reposurgeon consist of a command keyword, sometimes preceded by a selection
set, sometimes followed by whitespace-separated arguments. It is often possible to omit
the selection-set argument and have it default to something reasonable.
Here are some motivating examples. The commands will be explained in more detail after the
description of selection syntax.
:15 edit ;; edit the object associated with mark :15
edit ;; edit all editable objects
29..71 list ;; list summary index of events 29..71
236..$ list ;; List events from 236 to the last
<#523> inspect ;; Look for commit #523; they are numbered
;; 1-origin from the beginning of the repository.
<2317> inspect ;; Look for a tag with the name 2317, a tip commit
;; of a branch named 2317, or a commit with legacy ID
;; 2317. Inspect what is found. A plain number is
;; probably a legacy ID inherited from a Subversion
;; revision number.
/regression/ list ;; list all commits and tags with comments or
;; committer headers or author headers containing
;; the string "regression"
1..:97 & =T delete ;; delete tags from event 1 to mark 97
[Makefile] inspect ;; Inspect all commits with a file op touching Makefile
;; and all blobs referred to in a fileop
;; touching Makefile.
:46 tip ;; Display the branch tip that owns commit :46.
@dsc(:55) list ;; Display all commits with ancestry tracing to :55
@min([.gitignore]) remove .gitignore delete
;; Remove the first .gitignore fileop in the repo.
SELECTION SYNTAX
The selection-set specification syntax is an expression-oriented minilanguage. The most
basic term in this language is a location. The following sorts of primitive locations are
supported:
event numbers
A plain numeric literal is interpreted as a 1-origin event-sequence number.
marks
A numeric literal preceded by a colon is interpreted as a mark; see the import stream
format documentation for explanation of the semantics of marks.
tag and branch names
The basename of a branch (including branches in the refs/tags namespace) refers to its
tip commit. The name of a tag is equivalent to its mark (that of the tag itself, not
the commit it refers to). Tag and branch locations are bracketed with < > (angle
brackets) to distinguish them from command keywords.
legacy IDs
If the contents of name brackets (< >) does not match a tag or branch name, the
interpreter next searches legacy IDs of commits. This is especially useful when you
have imported a Subversion dump; it means that commits made from it can be referred to
by their corresponding Subversion revision numbers.
commit numbers
Anumeric literal within name brackets (< >) preceded by # is interpreted as a 1-origin
commit-sequence number.
$
Refers to the last event.
These may be grouped into sets in the following ways:
ranges
A range is two locations separated by "..", and is the set of events beginning at the
left-hand location and ending at the right-hand location (inclusive).
lists
Comma-separated lists of locations and ranges are accepted, with the obvious meaning.
There are some other ways to construct event sets:
visibility sets
A visibility set is an expression specifying a set of event types. It will consist of
a leading equal sign, followed by type letters. These are the type letters:
┌──┬──────────────────────────┬──────────────────────────┐
│B │ blobs │ Most default selection │
│ │ │ sets exclude blobs; they │
│ │ │ have to be manipulated │
│ │ │ through the commits they │
│ │ │ are attached to. │
├──┼──────────────────────────┼──────────────────────────┤
│C │ commits │ │
├──┼──────────────────────────┼──────────────────────────┤
│D │ all-delete commits │ These are artifacts │
│ │ │ produced by some older │
│ │ │ repository-conversion │
│ │ │ tools. │
├──┼──────────────────────────┼──────────────────────────┤
│H │ head (branch tip) │ │
│ │ commits │ │
├──┼──────────────────────────┼──────────────────────────┤
│O │ orphaned (parentless) │ │
│ │ commits │ │
├──┼──────────────────────────┼──────────────────────────┤
│U │ commits with callouts as │ │
│ │ parents │ │
├──┼──────────────────────────┼──────────────────────────┤
│Z │ commits with no fileops │ │
├──┼──────────────────────────┼──────────────────────────┤
│M │ merge (multi-parent) │ │
│ │ commits │ │
├──┼──────────────────────────┼──────────────────────────┤
│F │ fork (multi-child) │ │
│ │ commits │ │
├──┼──────────────────────────┼──────────────────────────┤
│L │ commits with unclean │ │
│ │ multi-line comments │ │
│ │ (without a separating │ │
│ │ empty line after the │ │
│ │ first) │ │
├──┼──────────────────────────┼──────────────────────────┤
│I │ commits for which │ │
│ │ metadata cannot be │ │
│ │ decoded to UTF-8 │ │
├──┼──────────────────────────┼──────────────────────────┤
│T │ tags │ │
├──┼──────────────────────────┼──────────────────────────┤
│R │ resets │ │
├──┼──────────────────────────┼──────────────────────────┤
│P │ Passthrough │ All event types simply │
│ │ │ passed through, │
│ │ │ including comments, │
│ │ │ progress commands, and │
│ │ │ checkpoint commands. │
├──┼──────────────────────────┼──────────────────────────┤
│N │ Legacy IDs │ Any string matching a │
│ │ │ cookie (legacy-ID) │
│ │ │ format. │
└──┴──────────────────────────┴──────────────────────────┘
references
A reference name (bracketed by angle brackets) resolves to a single object, either a
commit or tag.
┌──────────────┬────────────────────────────────┐
│ type │ interpretation │
├──────────────┼────────────────────────────────┤
│ tag name │ annotated tag with that name │
├──────────────┼────────────────────────────────┤
│ branch name │ the branch tip commit │
├──────────────┼────────────────────────────────┤
│ legacy ID │ commit with that legacy ID │
├──────────────┼────────────────────────────────┤
│assigned name │ name equated to a selection by │
│ │ assign │
└──────────────┴────────────────────────────────┘
Note that if an annotated tag and a branch have the same name foo, <foo> will resolve
to the tag rather than the branch tip commit.
dates and action stamps
A date or action stamp in angle brackets resolves to a selection set of all matching
commits.
┌───────────────────────────────┬────────────────────────────────┐
│ type │ interpretation │
├───────────────────────────────┼────────────────────────────────┤
│ RFC3339 timestamp │ commit or tag with that │
│ │ time/date │
├───────────────────────────────┼────────────────────────────────┤
│action stamp (timestamp!email) │ commits or tags with that │
│ │ timestamp and author (or │
│ │ committer if no author). │
├───────────────────────────────┼────────────────────────────────┤
│ yyyy-mm-dd part of RFC3339 │ all commits and tags with that │
│ timestamp │ date │
└───────────────────────────────┴────────────────────────────────┘
To refine the match to a single commit, use a 1-origin index suffix separated by '#'.
Thus "<2000-02-06T09:35:10Z>" can match multiple commits, but
"<2000-02-06T09:35:10Z#2>" matches only the second in the set.
text search
A text search expression is a Python regular expression surrounded by forward slashes
(to embed a forward slash in it, use a Python string escape such as \x2f).
A text search normally matches against the comment fields of commits and annotated
tags, or against their author/committer names, or against the names of tags; also the
text of passthrough objects.
The scope of a text search can be changed with qualifier letters after the trailing
slash. These are as follows:
┌───────┬──────────────────────────────────┐
│letter │ interpretation │
├───────┼──────────────────────────────────┤
│ a │ author name in commit │
├───────┼──────────────────────────────────┤
│ b │ branch name in commit; also │
│ │ matches blobs referenced by │
│ │ commits on matching branches, │
│ │ and tags which point to commmits │
│ │ on patching branches. │
├───────┼──────────────────────────────────┤
│ c │ comment text of commit or tag │
├───────┼──────────────────────────────────┤
│ r │ committish reference in tag or │
│ │ reset │
├───────┼──────────────────────────────────┤
│ p │ text in passthrough │
├───────┼──────────────────────────────────┤
│ t │ tagger in tag │
├───────┼──────────────────────────────────┤
│ n │ name of tag │
├───────┼──────────────────────────────────┤
│ B │ blob content │
└───────┴──────────────────────────────────┘
Multiple qualifier letters can add more search scopes.
(The “b” qualifier replaces the branchset syntax in earlier versions of reposurgeon.)
paths
A "path expression" enclosed in square brackets resolves to the set of all commits and
blobs related to a path matching the given expression. The path expression itself is
either a path literal or a regular expression surrounded by slashes. Immediately after
the trailing / of a path regexp you can put any number of the following characters
which act as flags: 'a', 'c', 'D', "M', 'R', 'C', 'N'.
By default, a path is related to a commit if the latter has a fileop that touches that
file path - modifies that change it, deletes that remove it, renames and copies that
have it as a source or target. When the 'c' flag is in use the meaning changes: the
paths related to a commit become all paths that would be present in a checkout for
that commit.
A path literal matches a commit if and only if the path literal is exactly one of the
paths related to the commit (no prefix or suffix operation is done). In particular a
path literal won't match if it corresponds to a directory in the chosen repository.
A regular expression matches a commit if it matches any path related to the commit
anywhere in the path. You can use '^' or '$' if you want the expression to only match
at the beginning or end of paths. When the 'a' flag is in use, the path expression
selects commits whose every path matches the regular expression. This is not always a
subset of commits selected without the 'a' flag because it also selects commits with
no related paths (e.g. empty commits, deletealls and commits with empty trees). If you
want to avoid those, you can use e.g. '[/regex/] & [/regex/a]'.
The flags 'D', "M', 'R', 'C', 'N' restrict match checking to the corresponding fileop
types. Note that this means an 'a' match is easier (not harder) to achieve. These are
no-iops when used with 'c'.
A path or literal matches a blob if it matches any path that appeared in a
modification fileop that referred to that blob. To select purely matching blobs or
matching commits, compose a path expression with =B or =C.
If you need to embed '[^/]' into your regular expression (e.g. to express "all
characters but a slash") you can use a Python string escape such as \x2f.
function calls
The expression language has named special functions. The sequence for a named function
is “@” followed by a function name, followed by an argument in parentheses. Presently
the following functions are defined:
┌─────┬─────────────────────────────────┐
│name │ interpretation │
├─────┼─────────────────────────────────┤
│min │ minimum member of a selection │
│ │ set │
├─────┼─────────────────────────────────┤
│max │ maximum member of a selection │
│ │ set │
├─────┼─────────────────────────────────┤
│amp │ nonempty selection set becomes │
│ │ all objects, empty set is │
│ │ returned empty │
├─────┼─────────────────────────────────┤
│par │ all parents of commits in the │
│ │ argument set │
├─────┼─────────────────────────────────┤
│chn │ all children of commits in the │
│ │ argument set │
├─────┼─────────────────────────────────┤
│dsc │ all commits descended from the │
│ │ argument set (argument set │
│ │ included) │
├─────┼─────────────────────────────────┤
│anc │ all commits whom the argument │
│ │ set is descended from (argument │
│ │ set included) │
├─────┼─────────────────────────────────┤
│pre │ events before the argument set; │
│ │ empty if the argument set │
│ │ includes the first event. │
├─────┼─────────────────────────────────┤
│suc │ events after the argument set; │
│ │ empty if the argument set │
│ │ includes the last event. │
└─────┴─────────────────────────────────┘
Set expressions may be combined with the operators | and &; these are, respectively, set
union and intersection. The | has lower precedence than intersection, but you may use
parentheses '(' and ')' to group expressions in case there is ambiguity (this replaces the
curly brackets used in older versions of the syntax).
Any set operation may be followed by '?' to add the set members' neighbors and referents.
This extends the set to include the parents and children of all commits in the set, and
the referents of any tags and resets in the set. Each blob reference in the set is
replaced by all commits that refer to it. The '?' can be repeated to extend the
neighborhood depth.
Do set negation with prefix ~; it has higher precedence than & and | but lower than ?
IMPORT AND EXPORT
reposurgeon can hold multiple repository states in core. Each has a name. At any given
time, one may be selected for editing. Commands in this group import repositories, export
them, and manipulate the in-core list and the selection.
read [--format=fossil] [directory|-|<infile]
With a directory-name argument, this command attempts to read in the contents of a
repository in any supported version-control system under that directory; read with no
arguments does this in the current directory. If output is redirected to a plain file,
it will be read in as a fast-import stream or Subversion dumpfile. With an argument of
“-”, this command reads a fast-import stream or Subversion dumpfile from standard
input (this will be useful in filters constructed with command-line arguments).
If the contents is a fast-import stream, any "cvs-revision" property on a commit is
taken to be a newline-separated list of CVS revision cookies pointing to the commit,
and used for reference lifting.
If the contents is a fast-import stream, any "legacy-id" property on a commit is taken
to be a legacy ID token pointing to the commit, and used for reference-lifting.
If the read location is a git repository and contains a .git/cvsauthors file (such as
is left in place by git cvsimport -A) that file will be read in as if it had been
given to the authors read command.
If the read location is a directory, and its repository subdirectory has a file named
legacy-map, that file will be read as though passed to a legacy read command.
If the read location is a file and the --format=fossil is used, the file is
interpreted as a Fossil repository.
The just-read-in repo is added to the list of loaded repositories and becomes the
current one, selected for surgery. If it was read from a plain file and the file name
ends with one of the extensions .fi or .svn, that extension is removed from the load
list name.
Note: this command does not take a selection set.
write [--legacy] [--format=fossil] [--noincremental] [--callout] [>outfile|-]
Dump selected events as a fast-import stream representing the edited repository; the
default selection set is all events. Where to dump to is standard output if there is
no argument or the argument is '-', or the target of an output redirect.
Alternatively, if there is no redirect and the argument names a directory, the
repository is rebuilt into that directory, with any selection set being ignored; if
that target directory is nonempty its contents are backed up to a save directory.
If the write location is a file and the --format=fossil is used, the file is written
in Fossil repository format.
With the --legacy option, the Legacy-ID of each commit is appended to its commit
comment at write time. This option is mainly useful for debugging conversion edge
cases.
If you specify a partial selection set such that some commits are included but their
parents are not, the output will include incremental dump cookies for each branch with
an origin outside the selection set, just before the first reference to that branch in
a commit. An incremental dump cookie looks like "refs/heads/foo^0" and is a clue to
export-stream loaders that the branch should be glued to the tip of a pre-existing
branch of the same name. The --noincremental option suppresses this behavior.
When you specify a partial selection set, including a commit object forces the
inclusion of every blob to which it refers and every tag that refers to it.
Specifying a partial selection may cause a situation in which some parent marks in
merges don't correspond to commits present in the dump. When this happens and
--callout option was specified, the write code replaces the merge mark with a callout,
the action stamp of the parent commit; otherwise the parent mark is omitted. Importers
will fail when reading a stream dump with callouts; it is intended to be used by the
graft command.
Specifying a write selection set with gaps in it is allowed but unlikely to lead to
good results if it is loaded by an importer.
Property extensions will be be omitted from the output if the importer for the
preferred repository type cannot digest them.
Note: to examine small groups of commits without the progress meter, use inspect.
choose [reponame]
Choose a named repo on which to operate. The name of a repo is normally the basename
of the directory or file it was loaded from, but repos loaded from standard input are
"unnamed". reposurgeon will add a disambiguating suffix if there have been multiple
reads from the same source.
With no argument, lists the names of the currently stored repositories and their load
times. The second column is '*' for the currently selected repository, '-' for others.
drop [reponame]
Drop a repo named by the argument from reposurgeon's list, freeing the memory used for
its metadata and deleting on-disk blobs. With no argument, drops the currently chosen
repo.
rename reponame
Rename the currently chosen repo; requires an argument. Won't do it if there is
already one by the new name.
REBUILDS IN PLACE
reposurgeon can rebuild an altered repository in place. Untracked files are normally saved
and restored when the contents of the new repository is checked out (but see the
documentation of the “preserve” command for a caveat).
rebuild [directory]
Rebuild a repository from the state held by reposurgeon. This command does not take a
selection set.
The single argument, if present, specifies the target directory in which to do the
rebuild; if the repository read was from a repo directory (and not a git-import
stream), it defaults to that directory. If the target directory is nonempty its
contents are backed up to a save directory. Files and directories on the repository's
preserve list are copied back from the backup directory after repo rebuild. The
default preserve list depends on the repository type, and can be displayed with the
stats command.
If reposurgeon has a nonempty legacy map, it will be written to a file named
legacy-map in the repository subdirectory as though by a legacy write command. (This
will normally be the case for Subversion and CVS conversions.)
preserve [file...]
Add (presumably untracked) files or directories to the repo's list of paths to be
restored from the backup directory after a rebuild. Each argument, if any, is
interpreted as a pathname. The current preserve list is displayed afterwards.
It is only necessary to use this feature if your version-control system lacks a
command to list files under version control. Under systems with such a command (which
include git and hg), all files that are neither beneath the repository dot directory
nor under reposurgeon temporary directories are preserved automatically.
unpreserve [file...]
Remove (presumably untracked) files or directories to the repo's list of paths to be
restored from the backup directory after a rebuild. Each argument, if any, is
interpreted as a pathname. The current preserve list is displayed afterwards.
INFORMATION AND REPORTS
Commands in this group report information about the selected repository.
The output of these commands can individually be redirected to a named output file. Where
indicated in the syntax, you can prefix the output filename with “>” and give it as a
following argument. If you use “>>” the file is opened for append rather than write.
list [>outfile]
This is the main command for identifying the events you want to modify. It lists
commits in the selection set by event sequence number with summary information. The
first column is raw event numbers, the second a timestamp in local time. If the
repository has legacy IDs, they will be displayed in the third column. The leading
portion of the comment follows.
stamp [>outfile]
Alternative form of listing that displays full action stamps, usable as references in
selections. Supports > redirection.
tip [>outfile]
Display the branch tip names associated with commits in the selection set. These will
not necessarily be the same as their branch fields (which will often be tag names if
the repo contains either annotated or lightweight tags).
If a commit is at a branch tip, its tip is its branch name. If it has only one child,
its tip is the child's tip. If it has multiple children, then if there is a child with
a matching branch name its tip is the child's tip. Otherwise this function throws a
recoverable error.
tags [>outfile]
Display tags and resets: three fields, an event number and a type and a name. Branch
tip commits associated with tags are also displayed with the type field 'commit'.
Supports > redirection.
stats [repo-name...] [>outfile]
Report size statistics and import/export method information about named repositories,
or with no argument the currently chosen repository.
count [>outfile]
Report a count of items in the selection set. Default set is everything in the
currently-selected repo. Supports > redirection.
inspect [>outfile]
Dump a fast-import stream representing selected events to standard output. Just like a
write, except (1) the progress meter is disabled, and (2) there is an identifying
header before each event dump.
graph [>outfile]
Emit a visualization of the commit graph in the DOT markup language used by the
graphviz tool suite. This can be fed as input to the main graphviz rendering program
dot(1), which will yield a viewable image. Supports > redirection.
You may find a script like this useful:
graph $1 >/tmp/foo$$
shell dot </tmp/foo$$ -Tpng | display -; rm /tmp/foo$$
You can substitute in your own preferred image viewer, of course.
sizes [>outfile]
Print a report on data volume per branch; takes a selection set, defaulting to all
events. The numbers tally the size of uncompressed blobs, commit and tag comments, and
other metadata strings (a blob is counted each time a commit points at it).
The numbers are not an exact measure of storage size: they are intended mainly as a
way to get information on how to efficiently partition a repository that has become
large enough to be unwieldy.
Supports > redirection.
lint [>outfile]
Look for DAG and metadata configurations that may indicate a problem. Presently checks
for: (1) Mid-branch deletes, (2) disconnected commits, (3) parentless commits, (4) the
existence of multiple roots, (5) committer and author IDs that don't look well-formed
as DVCS IDs, (6) multiple child links with identical branch labels descending from the
same commit, (7) time and action-stamp collisions.
Options to issue only partial reports are supported; "lint --options" or "lint -?"
lists them.
The options and output format of this command are unstable; they may change without
notice as more sanity checks are added.
SURGICAL OPERATIONS
These are the operations the rest of reposurgeon is designed to support.
squash [policy...]
Combine or delete commits in a selection set of events. The default selection set for
this command is empty. Has no effect on events other than commits unless the --delete
policy is selected; see the 'delete' command for discussion.
Normally, when a commit is squashed, its file operation list (and any associated blob
references) gets either prepended to the beginning of the operation list of each of
the commit's children or appended to the operation list of each of the commit's
parents. Then children of a deleted commit get it removed from their parent set and
its parents added to their parent set.
The default is to squash forward, modifying children; but see the list of policy
modifiers below for how to change this.
Warning
It is easy to get the bounds of a squash command wrong, with confusing and
destructive results. Beware thinking you can squash on a selection set to merge
all commits except the last one into the last one; what you will actually do is to
merge all of them to the first commit after the selected set.
Normally, any tag pointing to a combined commit will also be pushed forward. But see
the list of policy modifiers below for how to change this.
Following all operation moves, every one of the altered file operation lists is
reduced to a shortest normalized form. The normalized form detects various
combinations of modification, deletion, and renaming and simplifies the operation
sequence as much as it can without losing any information.
After canonicalization, a file op list may still end up containing multiple M
operations on the same file. Normally the tool utters a warning when this occurs but
does not try to resolve it.
The following modifiers change these policies:
--delete
Simply discards all file ops and tags associated with deleted commit(s).
--coalesce
Discard all M operations (and associated blobs) except the last.
--pushback
Append fileops to parents, rather than prepending to children.
--pushforward
Prepend fileops to children. This is the default; it can be specified in a lift
script for explicitness about intentions.
--tagforward
With the "tagforward" modifier, any tag on the deleted commit is pushed forward to
the first child rather than being deleted. This is the default; it can be
specified for explicitness.
--tagback
With the "--tagback" modifier, any tag on the deleted commit is pushed backward to
the first parent rather than being deleted.
--quiet
Suppresses warning messages about deletion of commits with non-delete fileops.
--complain
The opposite of quiet. Can be specified for explicitness.
Under any of these policies except “--delete”, deleting a commit that has children
does not back out the changes made by that commit, as they will still be present in
the blobs attached to versions past the end of the deletion set. All a delete does
when the commit has children is lose the metadata information about when and by who
those changes were actually made; after the delete any such changes will be attributed
to the first undeleted children of the deleted commits. It is expected that this
command will be useful mainly for removing commits mechanically generated by
repository converters such as cvs2svn.
delete [policy...]
Delete a selection set of events. The default selection set for this command is empty.
On a set of commits, this is equivalent to a squash with the --delete flag. It
unconditionally deletes tags, resets, and passthroughs; blobs can be removed only as a
side effect of deleting every commit that points at them.
divide parent [child]
Attempt to partition a repo by cutting the parent-child link between two specified
commits (they must be adjacent). Does not take a general selection set. It is only
necessary to specify the parent commit, unless it has multiple children in which case
the child commit must follow (separate it with a comma).
If the repo was named 'foo', you will normally end up with two repos named 'foo-early'
and 'foo-late' (option and feature events at the beginning of the early segment will
be duplicated onto the beginning of the late one.). But if the commit graph would
remain connected through another path after the cut, the behavior changes. In this
case, if the parent and child were on the same branch 'qux', the branch segments are
renamed 'qux-early' and 'qux-late' but the repo is not divided.
expunge [path | /regexp/]...
Expunge files from the selected portion of the repo history; the default is the entire
history. The arguments to this command may be paths or Python regular expressions
matching paths (regexps must be marked by being surrounded with //).
All filemodify (M) operations and delete (D) operations involving a matched file in
the selected set of events are disconnected from the repo and put in a removal set.
Renames are followed as the tool walks forward in the selection set; each triggers a
warning message. If a selected file is a copy (C) target, the copy will be deleted and
a warning message issued. If a selected file is a copy source, the copy target will be
added to the list of paths to be deleted and a warning issued.
After file expunges have been performed, any commits with no remaining file operations
will be removed, and any tags pointing to them. Commits with deleted fileops pointing
both in and outside the path set are not deleted, but are cloned into the removal set.
The removal set is not discarded. It is assembled into a new repository named after
the old one with the suffix "-expunges" added. Thus, this command can be used to carve
a repository into sections by file path matches.
tagify [--canonicalize] [--tipdeletes] [--tagify-merges]
Search for empty commits and turn them into tags. Takes an optional selection set
argument defaulting to all commits. For each commit in the selection set, turn it into
a tag with the same message and author information if it has no fileops. By default
merge commits are not considered, even if they have no fileops (thus no tree
differences with their first parent). To change that, use the --tagify-merges option.
The name of the generated tag will be 'emptycommit-ident', where ident is generated
from the legacy ID of the deleted commit, or from its mark, or from its index in the
repository, with a disambiguation suffix if needed.
With the --canonicalize, tagify tries harder to detect trivial commits by first
ensuring that all fileops of selected commits will have an actual effect when
processed by fast-import.
With the --tipdeletes, tagify also considers branch tips with only deleteall fileops
to be candidates for tagification. The corresponding tags get names of the form
'tipdelete-branchname' rather than the default 'emptycommit-ident'.
With the --tagify-merges, tagify also tagifies merge commits that have no fileops.
When this is done the merge link is move to the yagified commit's parent.
coalesce [--debug}|--changelog] [timefuzz]
Scan the selection set for runs of commits with identical comments close to each other
in time (this is a common form of scar tissues in repository up-conversions from older
file-oriented version-control systems). Merge these cliques by deleting all but the
last commit, in order; fileops from the deleted commits are pushed forward to that
last one
The optional second argument, if present, is a maximum time separation in seconds; the
default is 90 seconds.
The default selection set for this command is =C, all commits. Occasionally you may
want to restrict it, for example to avoid coalescing unrelated cliques of "*** empty
log message ***" commits from CVS lifts.
With the --debug option, show messages about mismatches.
With the --changelog option, any commit with a comment containing the string 'empty
log message' (such as is generated by CVS) and containing exactly one file operation
modifying a path ending in ChangeLog is treated specially. Such ChangeLog commits are
considered to match any commit before them by content, and will coalesce with it if
the committer matches and the commit separation is small enough. This option handles a
convention used by Free Software Foundation projects.
split {at|by} item
The first argument is required to be a commit location; the second is a preposition
which indicates which splitting method to use. If the preposition is 'at', then the
third argument must be an integer 1-origin index of a file operation within the
commit. If it is 'by', then the third argument must be a pathname to be
prefix-matched, pathname match is done first).
The commit is copied and inserted into a new position in the event sequence,
immediately following itself; the duplicate becomes the child of the original, and
replaces it as parent of the original's children. Commit metadata is duplicated; the
mark of the new commit is then changed, with 'bis' added as a suffix.
Finally, some file operations - starting at the one matched or indexed by the split
argument - are moved forward from the original commit into the new one. Legal indices
are 2-n, where n is the number of file operations in the original commit.
add {D path | M perm mark path | R source target C source target}
From a specified commit, add a specified fileop.
For a D operation to be valid there must be an M operation for the path in the
commit's ancestry. For an M operation to be valid, the 'perm' part must be a token
ending with 755 or 644 and the 'mark' must refer to a blob that precedes the commit
location. For an R or C operation to be valid, there must be an M operation for the
source in the commit's ancestry.
remove [index | path | deletes] [to commit]
From a specified commit, remove a specified fileop. The op must be one of (a) the
keyword “deletes”, (b) a file path, (c) a file path preceded by an op type set (some
subset of the letters DMRCN), or (d) a 1-origin numeric index. The “deletes” keyword
selects all D fileops in the commit; the others select one each.
If the “to” clause is present, the removed op is appended to the commit specified by
the following singleton selection set. This option cannot be combined with “deletes”.
Note that this command does not attempt to scavenge blobs even if the deleted fileop
might be the only reference to them. This behavior may change in a future release.
blob
Create a blob at mark :1 after renumbering other marks starting from :2. Data is taken
from stdin, which may be a here-doc. This can be used with the add command to patch
synthetic data into a repository.
renumber
Renumber the marks in a repository, from :1 up to :<n> where <n> is the count of the
last mark. Just in case an importer ever cares about mark ordering or gaps in the
sequence.
mailbox_out [>outfile]
Emit a mailbox file of messages in RFC822 format representing the contents of
repository metadata. Takes a selection set; members of the set other than commits,
annotated tags, and passthroughs are ignored (that is, presently, blobs and resets).
The output from this command can optionally be redirected to a named output file.
Prefix the filename with “>” and give it as a following argument.
May have an option --filter, followed by = and a /-enclosed regular expression. If
this is given, only headers with names matching it are emitted. In this context the
name of the header includes its trailing colon.
mailbox_in [<infile] [--changed >outfile]
Accept a mailbox file of messages in RFC822 format representing the contents of the
metadata in selected commits and annotated tags. Takes no selection set. If there is
an argument it will be taken as the name of a mailbox file to read from; no argument,
or one of '-'; reads from standard input.
Users should be aware that modifying an Event-Number or Event-Mark field will change
which event the update from that message is applied to. This is unlikely to have good
results.
If the Event-Number and Event-Mark fields are absent, the mailbox_in logic will
attempt to match the commit or tag first by Legacy-ID, then by a unique committer ID
and timestamp pair.
If output is redirected and the modifier “--changed” appears, a minimal set of
modifications actually made is written to the output file in a form that can be fed
back in.
setfield attribute value
In the selected objects (defaulting to none) set every instance of a named field to a
string value. The string may be quoted to include whitespace, and use backslash
escapes interpreted by the Python string-escape codec, such as \n and \t.
Attempts to set nonexistent attributes are ignored. Valid values for the attribute are
internal Python field names; in particular, for commits, “comment” and “branch” are
legal. Consult the source code for other interesting values.
append [--rstrip] [>text]
Append text to the comments of commits and tags in the specified selection set. The
text is the first token of the command and may be a quoted string. C-style escape
sequences in the string are interpreted using Python's string_decode codec.
If the option --rstrip is given, the comment is right-stripped before the new text is
appended.
filter [--shell|--regex|--replace|--dedos]
Run blobs, commit comments, or tag comments in the selection set through the filter
specified on the command line.
In any mode other than --dedos, attempting to specify a selection set including both
blobs and non-blobs (that is, commits or tags) throws an error. Inline content in
commits is filtered when the selection set contains (only) blobs and the commit is
within the range bounded by the earliest and latest blob in the specification.
When filtering blobs, if the command line contains the magic cookie '%PATHS%' it is
replaced with a space-separated list of all paths that reference the blob.
With --shell, the remainder of the line specifies a filter as a shell command. Each
blob or comment is presented to the filter on standard input; the content is replaced
with whatever the filter emits to standard output. At present --shell is required.
Other filtering modes will be supported in the future.
With --regex, the remainder of the line is expected to be a Python regular expression
substitution written as /from/to/ with from and to being passed as arguments to the
standard re.sub() function and it applied to modify the content. Actually, any
non-space character will work as a delimiter in place of the /; this makes it easier
to use / in patterns. Ordinarily only the first such substitution is performed;
putting 'g' after the slash replaces globally, and a numeric literal gives the maximum
number of substitutions to perform. Other flags available restrict substitution scope
- 'c' for comment text only, 'C' for committer name only, 'a' for author names only.
With --replace, the behavior is like --regexp but the expressions are not interpreted
as regular expressions. (This is slightly faster).
With --dedos, DOS/Windows-style \r\n line terminators are replaced with \n.
transcode codec
Transcode blobs, commit comments and committer/author names, or tag comments and tag
committer names in the selection set to UTF-8 from the character encoding specified on
the command line.
Attempting to specify a selection set including both blobs and non-blobs (that is,
commits or tags) throws an error. Inline content in commits is filtered when the
selection set contains (only) blobs and the commit is within the range bounded by the
earliest and latest blob in the specification.
The encoding argument must name one of the codecs known to the Python standard codecs
library. In particular, 'latin-1' is a valid codec name.
Errors in this command are fatal, because an error may leave repository objects in a
damaged state.
The theory behind the design of this command is that the repository might contain a
mixture of encodings used to enter commit metadata by different people at different
times. After using =I to identify metadata containing non-Unicode high bytes in text,
a human must use context to identify which particular encodings were used in
particular event spans and compose appropriate transcode commands to fix them up.
edit
Report the selection set of events to a tempfile as mailbox_out does, call an editor
on it, and update from the result as mailbox_in does. If you do not specify an editor
name as second argument, it will be taken from the $EDITOR variable in your
environment.
Normally this command ignores blobs because mailbox_out does. However, if you specify
a selection set consisting of a single blob, your editor will be called directly on
the blob file.
timeoffset offset [timezone]
Apply a time offset to all time/date stamps in the selected set. An offset argument is
required; it may be in the form [+-]ss, [+-]mm:ss or [+-]hh:mm:ss. The leading sign is
required to distinguish it from a selection expression.
Optionally you may also specify another argument in the form [+-]hhmm, a timezone
literal to apply. To apply a timezone without an offset, use an offset literal of +0
or -0.
unite [--prune] reponame...
Unite repositories. Name any number of loaded repositories; they will be united into
one union repo and removed from the load list. The union repo will be selected.
The root of each repo (other than the oldest repo) will be grafted as a child to the
last commit in the dump with a preceding commit date. Running last to first, duplicate
names will be disambiguated using the source repository name (thus, recent duplicates
will get priority over older ones). After all grafts, marks will be renumbered.
The name of the new repo will be the names of all parts concatenated, separated by
'+'. It will have no source directory or preferred system type.
With the option --prune, at each join D operations for every ancestral file existing
will be prepended to the root commit, then it will be canonicalized using the rules
for squashing the effect will be that only files with properly matching M, R, and C
operations in the root survive.
graft [--prune] reponame
For when unite doesn't give you enough control. This command may have either of two
forms, selected by the size of the selection set. The first argument is always
required to be the name of a loaded repo.
If the selection set is of size 1, it must identify a single commit in the currently
chosen repo; in this case the name repo's root will become a child of the specified
commit. If the selection set is empty, the named repo must contain one or more
callouts matching a commits in the currently chosen repo.
Labels and branches in the named repo are prefixed with its name; then it is grafted
to the selected one. Any other callouts in the named repo are also resolved in the
context of the currently chosen one. Finally, the named repo is removed from the load
list.
With the option --prune, prepend a deleteall operation into the root of the grafted
repository.
path [source] rename [--force}] [target]
Rename a path in every fileop of every selected commit. The default selection set is
all commits. The first argument is interpreted as a Python regular expression to match
against paths; the second may contain back-reference syntax.
Ordinarily, if the target path already exists in the fileops, or is visible in the
ancestry of the commit, this command throws an error. With the --force option, these
checks are skipped.
paths [{sub|sup}] [dirname] [>outfile]
Takes a selection set. Without a modifier, list all paths touched by fileops in the
selection set (which defaults to the entire repo). This reporting variant does
>-redirection.
With the 'sub' modifier, take a second argument that is a directory name and prepend
it to every path. With the 'sup' modifier, strip the first directory component from
every path.
merge
Create a merge link. Takes a selection set argument, ignoring all but the lowest
(source) and highest (target) members. Creates a merge link from the highest member
(child) to the lowest (parent).
unmerge
Linearize a commit. Takes a selection set argument, which must resolve to a single
commit, and removes all its parents except for the first.
It is equivalent to reparentfirst_parent,commitrebase, where commit is the same
selection set as used with unmerge and first_parent is a set resolving commit's first
parent (see the reparent command below
The main interest of the unmerge is that you don't have to find and specify the first
parent yourself, saving time and avoiding errors when nearby surgery would make a
manual first parent argument stale.
reparent [rebase]
Changes the parent list of a commit. Takes a selection set argument and an optional
policy argument. The selection set must resolve to exactly two commits, the latest of
which is the commit to modify, and the earliest is the new first parent. All other
parents links are cleared; if you want you can recreate them with the 'merge' command.
By default, the manifest of the reparented commit is computed before modifying it; a
deleteall and fileops are prepended so that the manifest stays unchanged even when the
first parent has been changed. Using the keyword 'rebase' as a third argument inhibits
this behavior - no deleteall is and the tree contents of all descendents can be
modified as a result.
branch branchname... {rename|delete} [arg]
Rename or delete a branch (and any associated resets). First argument must be an
existing branch name; second argument must one of the verbs 'rename' or 'delete'.
For a 'rename', the third argument may be any token that is a syntactically valid
branch name (but not the name of an existing branch). For a 'delete', no third
argument is required.
For either name, if it does not contain a '/' the prefix 'refs/heads' is prepended.
tag tagname... {move|rename|delete} [arg].
Move, rename, or delete a tag. First argument must be an existing tag name; second
argument must be one of the verbs “move”, “rename”, or “delete”.
For a “move”, a third argument must be a singleton selection set. For a “rename”, the
third argument may be any token that is a syntactically valid tag name (but not the
name of an existing tag). For a “delete”, no third argument is required.
The behavior of this command is complex because features which present as tags may be
any of three things: (1) True tag objects, (2) lightweight tags, actually sequences of
commits with a common branchname beginning with “refs/tags” - in this case the tag is
considered to point to the last commit in the sequence, (3) Reset objects. These may
occur in combination; in fact, stream exporters from systems with annotation tags
commonly express each of these as a true tag object (1) pointing at the tip commit of
a sequence (2) in which the basename of the common branch field is identical to the
tag name. An exporter that generates lightweight-tagged commit sequences (2) may or
may not generate resets pointing at their tip commits.
This command tries to handle all combinations in a natural way by doing up to three
operations on any true tag, commit sequence, and reset matching the source name. In a
rename, all are renamed together. In a delete, any matching tag or reset is deleted;
then matching branch fields are changed to match the branch of the unique descendent
of the tagged commit, if there is one. When a tag is moved, no branch fields are
changed and a warning is issued.
Attempts to delete a lightweight tag may fail with the message “couldn't determine a
unique successor”. When this happens, the tag is on a commit with multiple children
that have different branch labels. There is a hole in the specification of git
fast-import streams that leaves it uncertain how branch labels can be safely
reassigned in this case; rather than do something risky, reposurgeon throws a
recoverable error.
reset resetname... {move|rename|delete} [arg].
Move, rename, or delete a reset. First argument must match an existing reset name;
second argument must be one of the verbs “move”, “rename”, or “delete”.
For a “move”, a third argument must be a singleton selection set. For a “rename”, the
third argument may be any token token that matches a syntactically valid reset name
(but not the name of an existing reset). For a “delete”, no third argument is
required.
For either name, if it does not contain a “/” the prefix “heads/” is prepended. If it
does not begin with “refs/”, “refs/” is prepended.
An argument matches a reset's name if it is either the entire reference
(refs/heads/FOO or refs/tags/FOO for some some value of FOO) or the basename (e.g.
FOO), or a suffix of the form heads/FOO or tags/FOO. An unqualified basename is
assumed to refer to a head.
When a reset is renamed, commit branch fields matching the tag are renamed with it to
match. When a reset is deleted, matching branch fields are changed to match the branch
of the unique descendent of the tip commit of the associated branch, if there is one.
When a reset is moved, no branch fields are changed.
debranch source-branch... [target-branch].
Takes one or two arguments which must be the names of source and target branches; if
the second (target) argument is omitted it defaults to refs/heads/master. Any trailing
segment of a branch name is accepted as a synonym for it; thus master is the same as
refs/heads/master. Does not take a selection set.
The history of the source branch is merged into the history of the target branch,
becoming the history of a subdirectory with the name of the source branch. Any resets
of the source branch are removed.
strip [blobs|reduce].
Reduce the selected repository to make it a more tractable test case. Use this when
reporting bugs.
With the modifier 'blobs', replace each blob in the repository with a small,
self-identifying stub, leaving all metadata and DAG topology intact. This is useful
when you are reporting a bug, for reducing large repositories to test cases of
manageable size.
A selection set is effective only with the 'blobs' option, defaulting to all blobs.
The 'reduce' mode always acts on the entire repository.
With the modifier 'reduce', perform a topological reduction that throws out
uninteresting commits. If a commit has all file modifications (no deletions or copies
or renames) and has exactly one ancestor and one descendant, then it may be boring. To
be fully boring, it must also not be referred to by any tag or reset. Interesting
commits are not boring, or have a non-boring parent or non-boring child.
With no modifiers, this command strips blobs.
ignores [rename]. [translate]. [defaults].
Intelligent handling of ignore-pattern files. This command fails if no repository has
been selected or no preferred write type has been set for the repository. It does not
take a selection set.
If the rename modifier is present, this command attempts to rename all ignore-pattern
files to whatever is appropriate for the preferred type - e.g. .gitignore for git,
.hgignore for hg, etc. This option does not cause any translation of the ignore files
it renames.
If the translate modifier is present, syntax translation of each ignore file is
attempted. At present, the only transformation the code knows is to prepend a 'syntax:
glob' header if the preferred type is hg.
If the defaults modifier is present, the command attempts to prepend these default
patterns to all ignore files. If no ignore file is created by the first commit, it
will be modified to create one containing the defaults. This command will error out on
prefer types that have no default ignore patterns (git and hg, in particular). It will
also error out when it knows the import tool has already set default patterns.
REFERENCE LIFTING
This group of commands is meant for fixing up references in commits that are in the format
of older version control systems. The general workflow is this: first, go over the comment
history and change all old-fashioned commit references into machine-parseable cookies.
Then, automatically turn the machine-parseable cookie into action stamps. The point of
dividing the process this way is that the first part is hard for a machine to get right,
while the second part is prone to errors when a human does it.
A Subversion cookie is a comment substring of the form [[SVN:ddddd]] (example:
[[SVN:2355]] with the revision read directly via the Subversion exporter, deduced from
git-svn metadata, or matching a $Revision$ header embedded in blob data for the filename.
A CVS cookie is a comment substring of the form [[CVS:filename:revision]] (example:
[[CVS:src/README:1.23]] with the revision matching a CVS $Id$ or $Revision$ header
embedded in blob data for the filename.
A mark cookie is of the form [[:dddd]] and is simply a reference to the specified mark.
You may want to hand-patch this in when one of previous forms is inconvenient.
An action stamp is an RFC3339 timestamp, followed by a '!', followed by an author email
address (author rather than committer because that timestamp is not changed when a patch
is replayed on to a branch). It attempts to refer to a commit without being VCS-specific.
Thus, instead of "commit 304a53c2" or "r2355", "2011-10-25T15:11:09Z![email protected]".
The following git aliases allow git to work directly with action stamps. Append it to your
~/.gitconfig; if you already have an [alias] section, leave off the first line.
[alias]
# git stamp <commit-ish> - print a reposurgeon-style action stamp
stamp = show -s --format='%cI!%ce'
# git scommit <stamp> <rev-list-args> - list most recent commit that matches <stamp>.
# Must also specify a branch to search or --all, after these arguments.
scommit = "!f(){ d=${1%%!*}; a=${1##*!}; arg=\"--until=$d -1\"; if [ $a != $1 ]; then arg=\"$arg --committer=$a\"; fi; shift; git rev-list $arg ${1:+\"$@\"}; }; f"
# git scommits <stamp> <rev-list-args> - as above, but list all matching commits.
scommits = "!f(){ d=${1%%!*}; a=${1##*!}; arg=\"--until=$d --after $d\"; if [ $a != $1 ]; then arg=\"$arg --committer=$a\"; fi; shift; git rev-list $arg ${1:+\"$@\"}; }; f"
# git smaster <stamp> - list most recent commit on master that matches <stamp>.
smaster = "!f(){ git scommit \"$1\" master --first-parent; }; f"
smasters = "!f(){ git scommits \"$1\" master --first-parent; }; f"
# git shs <stamp> - show the commits on master that match <stamp>.
shs = "!f(){ stamp=$(git smasters $1); shift; git show ${stamp:?not found} $*; }; f"
# git slog <stamp> <log-args> - start git log at <stamp> on master
slog = "!f(){ stamp=$(git smaster $1); shift; git log ${stamp:?not found} $*; }; f"
# git sco <stamp> - check out most recent commit on master that matches <stamp>.
sco = "!f(){ stamp=$(git smaster $1); shift; git checkout ${stamp:?not found} $*; }; f"
There is a rare case in which an action stamp will not refer uniquely to one commit. It is
theoretically possible that the same author might check in revisions on different branches
within the one-second resolution of the timestamps in a fast-import stream. There is
nothing to be done about this; tools using action stamps need to be aware of the
possibility and throw a warning when it occurs.
In order to support reference lifting, reposurgeon internally builds a legacy-reference
map that associates revision identifiers in older version-control systems with commits.
The contents of this map comes from three places: (1) cvs2svn:rev properties if the
repository was read from a Subversion dump stream, (2) $Id$ and $Revision$ headers in
repository files, and (3) the .git/cvs-revisions created by git cvsimport.
The detailed sequence for lifting possible references is this: first, find possible CVS
and Subversion references with the references or =N visibility set; then replace them with
equivalent cookies; then run references lift to turn the cookies into action stamps (using
the information in the legacy-reference map) without having to do the lookup by hand.
references [list|edit|lift] [>outfile]
With the modifier 'list', list commit and tag comments for strings that might be CVS-
or Subversion-style revision identifiers. This will be useful when you want to replace
them with equivalent cookies that can automatically be translated into VCS-independent
action stamps. This reporting command supports >-redirection. It is equivalent to '=N
list'.
With the modifier 'edit', edit the set where revision IDs are found. This is
equivalent to '=N edit'.
With the modifier "lift", attempt to resolve Subversion and CVS cookies in comments
into action stamps using the legacy map. An action stamp is a
timestamp/email/sequence-number combination uniquely identifying the commit associated
with that blob, as described in the section called “TRANSLATION STYLE”.
It is not guaranteed that every such reference will be resolved, or even that any at
all will be. Normally all references in history from a Subversion repository will
resolve, but CVS references are less likely to be resolvable.
MACROS AND EXTENSIONS
Occasionally you will need to issue a large number of complex surgical commands of very
similar form, and it's convenient to be able to package that form so you don't need to do
a lot of error-prone typing. For those occasions, reposurgeon supports a simple form of
macro expansion.
define name body
Define a macro. The first whitespace-separated token is the name; the remainder of the
line is the body, unless it is “{”, which begins a multi-line macro terminated by a
line beginning with “}”.
A later “do” call can invoke this macro.
The command “define” by itself without a name or body produces a macro list.
do name arguments...
Expand and perform a macro. The first whitespace-separated token is the name of the
macro to be called; remaining tokens replace {0}, {1}... in the macro definition (the
conventions used are those of the Python format method). Tokens may contain whitespace
if they are string-quoted; string quotes are stripped. Macros can call macros.
If the macro expansion does not itself begin with a selection set, whatever set was
specified before the "do" keyword is available to the command generated by the
expansion.
undefine name]
Undefine the named macro.
Here's an example to illustrate how you might use this. In CVS repositories of projects
that use the GNU ChangeLog convention, a very common pre-conversion artifact is a commit
with the comment "***empty log message***" that modifies only a ChangeLog entry explaining
the commit immediately previous to it. The following
define changelog <{0}> & /empty log message/ squash --pushback
do changelog 2012-08-14T21:51:35Z
do changelog 2012-08-08T22:52:14Z
do changelog 2012-08-07T04:48:26Z
do changelog 2012-08-08T07:19:09Z
do changelog 2012-07-28T18:40:10Z
is equivalent to the more verbose
<2012-08-14T21:51:35Z> & /empty log message/ squash --pushback
<2012-08-08T22:52:14Z> & /empty log message/ squash --pushback
<2012-08-07T04:48:26Z> & /empty log message/ squash --pushback
<2012-08-08T07:19:09Z> & /empty log message/ squash --pushback
<2012-07-28T18:40:10Z> & /empty log message/ squash --pushback
but you are less likely to make difficult-to-notice errors typing the first version.
(Also note how the text regexp acts as a failsafe against the possibility of typing a
wrong date that doesn't refer to a commit with an empty comment. This was a real-world
example from the CVS-to-git conversion of groff.)
When even a macro is not enough, you can write and call custom Python extensions.
exec name
Execute custom code from standard input (normally a file via < redirection). Use this
to set up custom extension functions for later eval calls. The code has full access to
all internal data structures. Functions defined are accessible to later eval calls.
This can be called in a script with the extension code in a here-doc.
eval function-name
Evaluate a line of code in the current interpreter context. Typically this will be a
call to a function defined by a previous exec. The variables _repository and
_selection will have the obvious values. Note that _selection will be a list of
integers, not objects.
ARTIFACT REMOVAL
Some commands automate fixing various kinds of artifacts associated with repository
conversions from order systems.
authors [read|write] [<filename] [>filename]
Apply or dump author-map information for the specified selection set, defaulting to
all events.
Lifts from CVS and Subversion may have only usernames local to the repository host in
committer and author IDs. DVCSes want email addresses (net-wide identifiers) and
complete names. To supply the map from one to the other, an authors file is expected
to consist of lines each beginning with a local user ID, followed by a '=' (possibly
surrounded by whitespace) followed by a full name and email address, optionally
followed by a timezone offset field. Thus:
ferd = Ferd J. Foonly <[email protected]> -0500
An authors file may have comment lines beginning with '#'; these are ignored.
When an authors file is applied, email addresses in committer and author metadata for
which the local ID matches between < and @ are replaced according to the mapping (this
handles git-svn lifts). Alternatively, if the local ID is the entire address, this is
also considered a match (this handles what git-cvsimport and cvs2git do)
With the 'read' modifier, or no modifier, apply author mapping data (from standard
input or a <-redirected file). May be useful if you are editing a repo or dump created
by cvs2git or by git-svn invoked without -A.
With the 'write' modifier, write a mapping file that could be interpreted by authors
read, with entries for each unique committer, author, and tagger (to standard output
or a <-redirected mapping file). This may be helpful as a start on building an authors
file, though each part to the right of an equals sign will need editing.
branchify [path-set]
Specify the list of directories to be treated as potential branches (to become tags if
there are no modifications after the creation copies) when analyzing a Subversion
repo. This list is ignored when the --nobranch read option is used. It defaults to the
'standard layout' set of directories, plus any unrecognized directories in the
repository root.
With no arguments, displays the current branchification set.
An asterisk at the end of a path in the set means 'all immediate subdirectories of
this path, unless they are part of another (longer) path in the branchify set'.
Note that the branchify set is a property of the reposurgeon interpreter, not of any
individual repository, and will persist across Subversion dumpfile reads. This may
lead to unexpected results if you forget to re-set it.
branchify_map [/regex/branch/...]
Specify the list of regular expressions used for mapping the svn branches that are
detected by branchify. If none of the expressions match the default behaviour applies,
which maps a branch to the name of the last directory, except for trunk and “*” which
are mapped to master and root.
With no arguments the current regex replacement pairs are shown. Passing 'reset' will
clear the reset mapping.
Will match each branch name against regex1 and if it matches rewrite its branch name
to branch1. If not it will try regex2 and so forth until it either found a matching
regex or there are no regexs left. The regular expressions should be in Python's[2].
format. The branch name can use backreferences (see the sub function in the Python
documentation).
Note that the regular expressions are appended to 'refs/' without either the needed
'heads/' or 'tags/'. This allows for choosing the right kind of branch type.
While the syntax template above uses slashes, any first character will be used as a
delimeter (and you will need to use a different one in the common case that the paths
contain slashes).
Note that the branchify_map set is a property of the reposurgeon interpreter, not of
any individual repository, and will persist across Subversion dumpfile reads. This may
lead to unexpected results if you forget to re-set it.
EXAMINING TREE STATES
manifest [regular expression] [>outfile]
Takes an optional selection set argument defaulting to all commits, and an optional
Python regular expression. For each commit in the selection set, print the mapping of
all paths in that commit tree to the corresponding blob marks, mirroring what files
would be created in a checkout of the commit. If a regular expression is given, only
print "path -> mark" lines for paths matching it. This command supports > redirection.
checkout directory
Takes a selection set which must resolve to a single commit, and a second argument.
The second argument is interpreted as a directory name. The state of the code tree at
that commit is materialized beneath the directory.
diff [>outfile]
Display the difference between commits. Takes a selection-set argument which must
resolve to exactly two commits. Supports output redirection.
HOUSEKEEPING
These are backed up by the following housekeeping commands, none of which take a selection
set:
help
Get help on the interpreter commands. Optionally follow with whitespace and a command
name; with no argument, lists all commands. '?' also invokes this.
shell
Execute the shell command given in the remainder of the line. '!' also invokes this.
prefer [repotype]
With no arguments, describe capabilities of all supported systems. With an argument
(which must be the name of a supported system) this has two effects:
First, if there are multiple repositories in a directory you do a read on, reposurgeon
will read the preferred one (otherwise it will complain that it can't choose among
them).
Secondly, this will change reposurgeon's preferred type for output. This means that
you do a write to a directory, it will build a repo of the preferred type rather than
its original type (if it had one).
If no preferred type has been explicitly selected, reading in a repository (but not a
fast-import stream) will implicitly set the preferred type to the type of that
repository.
In older versions of reposurgeon this command changed the type of the selected
repository, if there is one. That behavior interacted badly with attempts to interpret
legacy IDs and has been removed.
sourcetype [repotype]
Report (with no arguments) or select (with one argument) the current repository's
source type. This type is normally set at repository-read time, but may remain unset
if the source was a stream file.
The source type affects the interpretation of legacy IDs (for purposes of the =N
visibility set and the 'references' command) by controlling the regular expressions
used to recognize them. If no preferred output type has been set, it may also change
the output format of stream files made from the repository.
The source type is reliably set whenever a live repository is read, or when a
Subversion stream or Fossil dump is interpreted but not necessarily by other stream
files. Streams generated by cvs-fast-export(1) using the --reposurgeon are detected as
CVS. In some other cases, the source system is detected from the presence of magic
$-headers in contents blobs.
INSTRUMENTATION
A few commands have been implemented primarily for debugging and regression-testing
purposes, but may be useful in unusual circumstances.
The output of most of these commands can individually be redirected to a named output
file. Where indicated in the syntax, you can prefix the output filename with “>” and give
it as a following argument.
index [>outfile]
Display four columns of info on objects in the selection set: their number, their
type, the associate mark (or '-' if no mark) and a summary field varying by type. For
a branch or tag it's the reference; for a commit it's the commit branch; for a blob
it's the repository path of the file in the blob.
The default selection set for this command is =CTRU, all objects except blobs.
resolve [label-text...]
Does nothing but resolve a selection-set expression and echo the resulting
event-number set to standard output. The remainder of the line after the command is
used as a label for the output.
Implemented mainly for regression testing, but may be useful for exploring the
selection-set language.
assign [name]
Compute a leading selection set and assign it to a symbolic name. It is an error to
assign to a name that is already assigned, or to any existing branch name. Assignments
may be cleared by sequence mutations (though not ordinary deletions); you will see a
warning when this occurs.
With no selection set and no name, list all assignments.>
Use this to optimize out location and selection computations that would otherwise be
performed repeatedly, e.g. in macro calls.
unassign [name]
Unassign a symbolic name. Throws an error if the name is not assigned.
names [>outfile]
List the names of all known branches and tags. Tells you what things are legal within
angle brackets and parentheses.
verbose [n]
'verbose 1' enables the progress meter and messages, 'verbose 0' disables them. Higher
levels of verbosity are available but intended for developers only.
quiet [on | off]
Without an argument, this command requests a report of the quiet boolean; with the
argument 'on' or 'off' it is changed. When quiet is on, time-varying report fields
which would otherwise cause spurious failures in regression testing are suppressed.
print output-text...
Does nothing but ship its argument line to standard output. Useful in regression
tests.
echo [number]
'echo 1' causes each reposurgeon command to be echoed to standard output just before
its output. This can be useful in constructing regression tests that are easily
checked by eyeball.
script filename [arg...]
Takes a filename and optional following arguments. Reads each line from the file and
executes it as a command.
During execution of the script, the script name replaces the string $0 and the
optional following arguments (if any) replace the strings $1, $2 ... $n in the script
text. This is done before tokenization, so the $1 in a string like “foo$1bar” will be
expanded. Additionally, $$ is expanded to the current process ID (which may be useful
for scripts that use tempfiles).
Within scripts (and only within scripts) reposurgeon accepts a slightly extended
syntax: First, a backslash ending a line signals that the command continues on the
next line. Any number of consecutive lines thus escaped are concatenated, without the
ending backslashes, prior to evaluation. Second, a command that takes an input
filename argument can instead take literal following data in the syntax of a shell
here-document. That is: if the filename is replaced by "<<EOF", all following lines in
the script up to a terminating line consisting only of "EOF" will be read, placed in a
temporary file, and that file fed to the command and afterwards deleted. EOF may be
replaced by any string. Backslashes have no special meaning while reading a
here-document.
Scripts may have comments. Any line beginning with a '#' is ignored. If a line has a
trailing position that begins with one or more whitespace characters followed by '#',
that trailing portion is ignored.
version [version...]
With no argument, display the program version and the list of VCSes directly
supported. With argument, declare the major version (single digit) or full version
(major.minor) under which the enclosing script was developed. The program will error
out if the major version has changed (which means the surgical language is not
backwards compatible).
It is good practice to start your lift script with a version requirement, especially
if you are going to archive it for later reference.
prompt [format...]
Set the command prompt format to the value of the command line; with an empty command
line, display it. The prompt format is evaluated in Python after each command with the
following dictionary substitutions:
chosen
The name of the selected repository, or None if none is currently selected.
Thus, one useful format might be 'rs[%(chosen)s]%% '.
More format items may be added in the future. The default prompt corresponds to the
format 'reposurgeon%% '. The format line is evaluated with shell quotng of tokens, so
that spaces can be included.
history
List the commands you have entered this session.
legacy [read|write] [<filename] [>filename]
Apply or list legacy-reference information. Does not take a selection set. The 'read'
variant reads from standard input or a <-redirected filename; the 'write' variant
writes to standard output or a >-redirected filename.
A legacy-reference file maps reference cookies to (committer, commit-date,
sequence-number) pairs; these in turn (should) uniquely identify a commit. The format
is two whitespace-separated fields: the cookie followed by an action stamp identifying
the commit.
It should not normally be necessary to use this command. The legacy map is
automatically preserved through repository reads and rebuilds, being stored in the
file legacy-map under the repository subdirectory..
set [option]
Turn on an option flag. With no arguments, list all options
Most options are described in conjunction with the specific operations that the
modify. One of general interest is “compressblobs”; this enables compression on the
blob files in the internal representation reposurgeon uses for editing repositories.
With this option, reading and writing of repositories is slower, but editing a
repository requires less (sometimes much less) disk space.
clear [option]
Turn off an option flag. With no arguments, list all options
profile
Enable profiling. Profile statistics are dumped to the path given as argument. Must be
one of the initial command-line arguments, and gathers statistics only on code
executed via '-'.
timing
Display statistics on phase timing in repository analysis. Mainly of interest to
developers trying to speed up the program.
exit
Exit, reporting the time. Included here because, while EOT will also cleanly exit the
interpreter, this command reports elapsed time since start.
WORKING WITH SUBVERSION
reposurgeon can read Subversion dumpfiles or edit a Subversion repository (and you must
point it at a repository, not a checkout directory). The reposurgeon distribution includes
a script named “repotool” that you can use to make and then incrementally update a local
mirror of a remote repository for editing or conversion purposes.
READING SUBVERSION REPOSITORIES
Certain optional modifiers on the read command change its behavior when reading Subversion
repositories:
--nobranch
Suppress branch analysis.
--ignore-properties
Suppress read-time warnings about discarded property settings.
--user-ignores
Don't generate .gitignore files from svn:ignore properties. Instead, just pass through
.gitignore files found in the history.
--use-uuid
If the --use-uuid read option is set, the repository's UUID will be used as the
hostname when faking up email addresses, a la git-svn. Otherwise, addresses will be
generated the way git cvs-import does it, simply ciopying the username into the
address field.
These modifiers can go anywhere in any order on the read command line after the read verb.
They must be whitespace-separated.
Here are the rules used for mapping subdirectories in a Subversion repository to branches:
1. At any given time there is a set of eligible paths and path wildcards which declare
potential branches. See the documentation of the branchify for how to alter this set,
which initially consists of {trunk, tags/*, branches/*, and '*'}.
2. A repository is considered "flat" if it has no directory that matches a path or path
wildcard in the branchify set. All commits in a flat repository are assigned to branch
master, and what would have been branch structure becomes directory structure. In this
case, we're done; all the other rules apply to non-flat repos.
If you give the option --nobranch when reading a Subversion repository, branch
analysis is skipped and the repository is treated as though flat (left as a linear
sequence of commits on refs/heads/master). This may be useful if your repository
configuration is highly unusual and you need to do your own branch surgery. Note that
this option will disable partitioning of mixed commits.
3. If "trunk" is eligible, it always becomes the master branch.
4. If an element of the branchify set ends with *, each immediate subdirectory of it is
considered a potential branch. If '*' is in the branchify set (which is true by
default) all top-level directories other than /trunk, /tags, and /branches are also
considered potential branches.
5. Each potential branch is checked to see if it has commits on it after the initial
creation or copy. If there are such commits, it becomes a branch. If not, it becomes a
tag in order to preserve the commit metadata. (In all cases, the name of the tag or
branch is the basename of the directory.)
6. Files in the top-level directory are assigned to a synthetic branch named 'root'.
Each commit that only creates or deletes directories (in particular, copy commits for tags
and branches, and commits that only change properties) will be transformed into a tag
named after the branch, containing the date/author/comment metadata from the commit. While
this produces a desirable result for tags, non-tag branches (including trunk) will also
get root tags this way. This apparent misfeature has been accepted so that reposurgeon
will never destroy human-generated metadata that might have value; it is left up to the
user to manually remove unwanted tags.
Subversion branch deletions are turned into deletealls, clearing the fileset of the
import-stream branch. When a branch finishes with a deleteall at its tip, the deleteall is
transformed into a tag. This rule cleans up after aborted branch renames.
Occasionally (and usually by mistake) a branchy Subversion repository will contain
revisions that touch multiple branches. These are handled by partitioning them into
multiple import-stream commits, one on each affected branch. The Legacy-ID of such a split
commit will have a pseudo-decimal part - for example, if Subversion revision 2317 touches
three branches, the three generated commits will have IDs 2317.1, 2317.2, and 2317.3.
The svn:executable and svn:special properties are translated into permission settings in
the input stream; svn:executable becomes 100755 and svn:special becomes 120000 (indicating
a symlink; the blob contents will be the path to which the symlink should resolve).
Any cvs2svn:rev properties generated by cvs2svn are incorporated into the internal map
used for reference-lifting, then discarded.
Normally, per-directory svn:ignore properties become .gitignore files. Actual .gitignore
files in a Subversion directory are presumed to have been created by git-svn users
separately from native Subversion ignore properties and discarded with a warning. It is up
to the user to merge the content of such files into the target repository by hand. But
this behavior is inverted by the --user-ignores option; if that is on, .gitignore files
are passed through and Subversion svn:ignore properties are discarded.
(Regardless of the setting of the --user-ignores option, .cvsignore files found in
Subversion repositories always become .gitignores in the translation. The assumption is
that these date from before a CVS-to-SVN lift and should be preserved to affect behavior
when browsing that section of the repository.)
svn:mergeinfo properties are interpreted. Any svn:mergeinfo property on a revision A with
a merge source range ending in revision B produces a merge link such that B becomes a
parent of A.
All other Subversion properties are discarded. (This may change in a future release.) The
property for which this is most likely to cause semantic problems is svn:eol-style.
However, since property-change-only commits get turned into annotated tags, the translated
tags will retain information about setting changes.
The sub-second resolution on Subversion commit dates is discarded; Git wants integer
timestamps only.
Because fast-import format cannot represent an empty directory, empty directories in
Subversion repositories will be lost in translation.
Normally, Subversion local usernames are mapped in the style of git cvs-import; thus user
"foo" becomes "foo <foo>", which is sufficient to pacify git and other systems that
require email addresses. With the option "svn_use_uuid", usernames are mapped in the
git-svn style, with the repository's UUID used as a fake domain in the email address. Both
forms can be remapped to real address using the authors read command.
Reading a Subversion stream enables writing of the legacy map as 'legacy' passthroughs
when the repo is written to a stream file.
reposurgeon tries hard to silently do the right thing, but there are Subversion edge cases
in which it emits warnings because a human may need to intervene and perform fixups by
hand. Here are the less obvious messages it may emit:
user-generated .gitignore
This message means means reposurgeon has found a .gitignore file in the Subversion
repository it is analyzing. This probably happened because somebody was using git-svn
as a live gateway, and created ignores which may or may not be congruent with those in
the generated .gitignore files that the Subversion ignore properties will be
translated into. You'll need to make a policy decision about which set of ignores to
use in the conversion, and possibly set the --user-ignores option on read to pass
through user-created .gitignore files; in that case this warning will not be emitted.
can't connect nonempty branch XXXX to origin
This is a serious error. reposurgeon has been unable to find a link from a specified
branch to the trunk (master) branch. The commit graph will not be fully connected and
will need manual repair.
permission information may be lost
A Subversion node change on a file sets or clears properties, but no ancestor can be
found for this file. Executable or symlink position may be set wrongly on later
revisions of this file. Subversion user-defined properties may also be scrambled or
lost. Usually this error can be ignored.
properties set
reposurgeon has detected a setting of a user-defined property, or the Subversion
properties svn:externals. These properties cannot be expressed in an import stream;
the user is notified in case this is a showstopper for the conversion or some
corrective action is required, but normally this error can be ignored. This warning is
suppressed by the --ignore-properties option.
branch links detected by file ops only
Branch links are normally deduced by examining Subversion directory copy operations. A
common user error (making a branch with a non-Subversion directory copy and then doing
an svn add on the contends) can defeat this. While reposurgeon should detect and cope
with most such copies correctly, you should examine the commit graph to check that the
branch is rooted at the correct place.
could not tagify root commit
The earliest commit in your Subversion repository has file operations, rather than
being a pure directory creation. This probably means your Subversion dump file is
malformed, or you may have attempted to lift from an incremental dump. Proceed with
caution.
deleting parentless tip delete
This message may be triggered by a Subversion branch move followed by a re-creation
under the source name. Check near the indicated revision to make sure the renamed
branch is connected to master.
mid-branch deleteall
A deleteall operation has been found in the middle of a branch history. This usually
indicates that a Subversion tag or branch was created by mistake, and someone later
tried to undo the error by deleting the tag/branch directory before recreating it with
a copy operation. Examine the topology near the deleteall closely, it may need
hand-hacking. It is fairly likely that both (a) the reposurgeon translation will be
different from what other translators (such as git-svn) produce, and (b) it will not
be immediately obvious which is right.
couldn't find a branch root for the copy
Branch analysis failed, probably due to a set of file copies that reposurgeon thought
it should interpret as a botched branch creation but couldn't deduce a history for.
Use the --nobranch option.
inconsistently empty from set
This message means means reposurgeon has failed an internal sanity check; the
directory structure implied by its internally-built filemaps is not consistent with
what's in the parsed Subversion nodes. This should never happen; if you see it, report
a bug in reposurgeon.
WRITING SUBVERSION REPOSITORIES
reposurgeon has support for writing Subversion repositories. Due to mismatches between the
ontology of Subversion and that of git import streams, this support has some significant
limitations and bugs.
In summary, Subversion repository histories do not round-trip through reposurgeon editing.
File content changes are preserved but some metadata is unavoidably lost. Furthermore,
writing out a DVCS history in Subversion also loses significant portions of its metadata.
Details follow.
Writing a Subversion repository or dump stream discards author information, the
committer's name, and the hostname part of the commit address; only the commit timestamp
and the local part of the committer's email address are preserved, the latter becoming the
Subversion author field. However, reading a Subversion repository and writing it out again
will preserve the author fields.
Import-stream timestamps have 1-second granularity. The sub-second parts of Subversion
commit timestamps will be lost on their way through reposurgeon.
Empty directories aren't represented in import streams. Consequently, reading and writing
Subversion repositories preserves file content, but not empty directories. It is also not
guaranteed that after editing a Subversion repository that the sequence of directory
creations and deletions relative to other operations will be identical; the only guarantee
is that enclosing directories will be created before any files in them are.
When reading a Subversion repository, reposurgeon discards the special directory-copy
nodes associated with branch creations. These can't be recreated if and when the
repository is written back out to Subversion; rather, each branch copy node from the
original translates into a branch creation plus the first set of file modifications on the
branch.
When reading a Subversion repository, reposurgeon also automatically breaks apart
mixed-branch commits. These are not re-united if the repository is written back out.
When writing to a Subversion repository, all lightweight tags become Subversion tag copies
with empty log comments, named for the tag basename. The committer name and timestamp are
copied from the commit the tag points to. The distinction between heads and tags is lost.
Because of the preceding two points, it is not guaranteed that even revision numbers will
be stable when a Subversion repository is read in and then written out!
Subversion repositories are always written with a standard (trunk/tags/branches) layout.
Thus, a repository with a nonstandard shape that has been analyzed by reposurgeon won't be
written out with the same shape.
When writing a Subversion repository, branch merges are translated into svn:mergeinfo
properties in the simplest possible way - as an svn:mergeinfo property of the translated
merge commit listing the merge source revisions.
Subversion has a concept of "flows"; that is, named segments of history corresponding to
files or directories that are created when the path is added, cloned when the path is
copied, and deleted when the path is deleted. This information is not preserved in import
streams or the internal representation that reposurgeon uses. Thus, after editing, the
flow boundaries of a Subversion history may be arbitrarily changed.
IGNORE PATTERNS
reposurgeon recognizes how supported VCSes represent file ignores (CVS .cvsignore files
lurking untranslated in older Subversion repositories, Subversion ignore properties,
.gitignore/.hgignore/.bzrignore file in other systems) and moves ignore declarations among
these containers on repo input and output. This will be sufficient if the ignore patterns
are exact filenames.
Translation may not, however, be perfect when the ignore patterns are Unix glob patterns
or regular expressions. This compatibility table describes which patterns will translate;
“plain” indicates a plain filename with no glob or regexp syntax or negation.
RCS has no ignore files or patterns and is therefore not included in the table.
┌─────────────┬───────────────┬──────────────┬───────────────────┬───────────────────┬─────────────────────┬──────────────┬────────────┐
│ │ from CVS │ from svn │ from git │ from hg │ from bzr │ from darcs │ from SRC │
├─────────────┼───────────────┼──────────────┼───────────────────┼───────────────────┼─────────────────────┼──────────────┼────────────┤
│ to │ all │ all │ all │ all │ all │ plain │ all │
│ CVS │ │ │ except │ │ except │ │ │
├─────────────┼───────────────┼──────────────┼───────────────────┼───────────────────┼─────────────────────┼──────────────┼────────────┤
│ to │ !.PP │ all │ all except │ all │ all except │ plain │ all │
│ svn │ │ │ !-prefixed │ │ RE:- and │ │ │
├─────────────┼───────────────┼──────────────┼───────────────────┼───────────────────┼─────────────────────┼──────────────┼────────────┤
│ to │ all │ all │ all │ all │ all except │ plain │ all │
│ git │ │ │ │ except │ RE:-prefixed │ │ │
├─────────────┼───────────────┼──────────────┼───────────────────┼───────────────────┼─────────────────────┼──────────────┼────────────┤
│ to │ all │ all │ all except │ all │ all except │ plain │ all │
│ hg │ except │ │ !-prefixed │ │ RE:- and │ │ │
├─────────────┼───────────────┼──────────────┼───────────────────┼───────────────────┼─────────────────────┼──────────────┼────────────┤
│ to │ all │ all │ all │ all │ all │ plain │ all │
│ bzr │ │ │ │ │ │ │ │
├─────────────┼───────────────┼──────────────┼───────────────────┼───────────────────┼─────────────────────┼──────────────┼────────────┤
│ to │ plain │ plain │ plain │ plain │ plain │ all │ all │
│ darcs │ │ │ │ │ │ │ │
├─────────────┼───────────────┼──────────────┼───────────────────┼───────────────────┼─────────────────────┼──────────────┼────────────┤
│ to │ all │ all │ all except │ all │ all except │ plain │ all │
│ SRC │ except │ │ !-prefixed │ │ RE:- and │ │ │
└─────────────┴───────────────┴──────────────┴───────────────────┴───────────────────┴─────────────────────┴──────────────┴────────────┘
The hg rows and columns of the table describes compatibility to hg's glob syntax rather
than its default regular-expression syntax. When writing to an hg repository from any
other kind, reposurgeon prepends to the output .hgignore a "syntax: glob" line.
TRANSLATION STYLE
After converting a CVS or SVN repository, check for and remove $-cookies in the head
revision(s) of the files. The full Subversion set is $Date:, $Revision:, $Author:,
$HeadURL and $Id:. CVS uses $Author:, $Date:, $Header:, $Id:, $Log:, $Revision:, also
(rarely) $Locker:, $Name:, $RCSfile:, $Source:, and $State:.
When you need to specify a commit, use the action-stamp format that references lift
generates when it can resolve an SVN or CVS reference in a comment. It is best that you
not vary from this format, even in trivial ways like omitting the 'Z' or changing the 'T'
or '!' or ':'. Making action stamps uniform and machine-parseable will have good
consequences for future repository-browsing tools.
Sometimes, in converting a repository, you may need to insert an explanatory comment - for
example, if metadata has been garbled or missing and you need to point to that fact. It's
helpful for repository-browsing tools if there is a uniform syntax for this that is highly
unlikely to show up in repository comments. We recommend enclosing translation notes in [[
]]. This has the advantage of being visually similar to the [ ] traditionally used for
editorial comments in text.
It is good practice to include, in the comment for the root commit of the repository, a
note dating and attributing the conversion work and explaining these conventions. Example:
[[This repository was converted from Subversion to git on 2011-10-24 by Eric S. Raymond
<[email protected]>. Here and elsewhere, conversion notes are enclosed in double square
brackets. Junk commits generated by cvs2svn have been removed, commit references have been
mapped into a uniform VCS-independent syntax, and some comments edited into
summary-plus-continuation form.]]
It is also good practice to include a generated tag at the point of conversion. E.g
mailbox_in --create <<EOF
Tag-Name: git-conversion
Marks the spot at which this repository was converted from Subversion to git.
EOF
ADVANCED EXAMPLES
define lastchange {
@max(=B & [/ChangeLog/] & /{0}/B)? list
}
List the last commit that refers to a ChangeLog file containing a specified string. (The
trick here is that ? extends the singleton set consisting of the last eligible ChangeLog
blob to its set of referring commits, and listonly notices the commits.)
STREAM SYNTAX EXTENSIONS
The event-stream parser in “reposurgeon” supports some extended syntax. Exporters designed
to work with “reposurgeon” may have a --reposurgeon option that enables emission of
extended syntax; notably, this is true of cvs-fast-export(1). The remainder of this
section describes these syntax extensions. The properties they set are (usually) preserved
and re-output when the stream file is written.
The token “#reposurgeon” at the start of a comment line in a fast-import stream signals
reposurgeon that the remainder is an extension command to be interpreted by “reposurgeon”.
One such extension command is implemented: #sourcetype, which behaves identically to the
reposurgeonsourcetype command. An exporter for a version-control system named “frobozz”
could, for example, say
#reposurgeon sourcetype frobozz
Within a commit, a magic comment of the form “#legacy-id” declares a legacy ID from the
stream file's source version-control system.
Also accepted is the bzr syntax for setting per-commit properties. While parsing commit
syntax, a line beginning with the token “property” must contibue with a
whitespace-separated property-name token. If it is then followed by a newline it is taken
to set that boolean-valued property to true. Otherwise it must be followed by a numeric
token specifying a data length, a space, following data (which may contain newlines) and a
terminating newline. For example:
commit refs/heads/master
mark :1
committer Eric S. Raymond <[email protected]> 1289147634 -0500
data 16
Example commit.
property legacy-id 2 r1
M 644 inline README
Unlike other extensions, bzr properties are only preserved on stream output if the
preferred type is bzr, because any importer other than bzr's will choke on them.
INCOMPATIBLE LANGUAGE CHANGES
In versions before 3.0, the general command syntax put the command verb first, then the
selection set (if any) then modifiers (VSO). It has changed to optional selection set
first, then command verb, then modifiers (SVO). The change made parsing simpler, allowed
abolishing some noise keywords, and recapitulates a successful design pattern in some
other Unix tools - notably sed(1).
In versions before 3.0, path expressions only matched commits, not commits and the
associated blobs as well. The names of the “a” and “c” flags were different.
In reposurgeon versions before 3.0, the delete command had the semantics of squash; also,
the policy flags did not require a “--” prefix. The “--delete” flag was named
“obliterate”.
In reposurgeon versions before 3.0, read and write optionally took file arguments rather
than requiring redirects (and the write command never wrote into directories). This was
changed in order to allow these commands to have modifiers. These modifiers replaced
several global options that no longer exist.
In reposurgeon versions before 3.0, the earliest factor in a unite command always kept its
tag and branch names unaltered. The new rule for resolving name conflicts, giving priority
to the latest factor, produces more natural behavior when uniting two repositories end to
end; the master branch of the second (later) one keeps its name.
In reposurgeon versions before 3.0, the tagify command expected policies as trailing
arguments to alter its behaviour. The new syntax uses similarly named options with leading
dashes, that can appear anywhere after the tagify command
In versions before 2.9. the syntax of "authors", "legacy", "list", and "mailbox_{in|out}"
was different (and "legacy" was "fossils"). They took plain filename arguments rather that
using redirect < and >.
LIMITATIONS AND GUARANTEES
Guarantee: In DVCses that use commit hashes, editing with reposurgeon never changes the
hash of a commit object unless (a) you edit the commit, or (b) it is a descendant of an
edited commit in a VCS that includes parent hashes in the input of a child object's hash
(git and hg both do this).
Guarantee: reposurgeon only requires main memory proportional to the size of a
repository's metadata history, not its entire content history. (Exception: the data from
inline content is held in memory.)
Guarantee: In the worst case, reposurgeon makes its own copy of every content blob in the
repository's history and thus uses intermediate disk space approximately equal to the size
of a repository's content history. However, when the repository to be edited is presented
as a stream file, reposurgeon requires no or only very little extra disk space to
represent it; the internal representation of content blobs is a (seek-offset, length) pair
pointing into the stream file.
Guarantee: reposurgeon never modifies the contents of a repository it reads, nor deletes
any repository. The results of surgery are always expressed in a new repository.
Guarantee: Any line in a fast-import stream that is not a part of a command reposurgeon
parses and understands will be passed through unaltered. At present the set of potential
passthroughs is known to include the progress, the options, and checkpoint commands as
well as comments led by #.
Guarantee: All reposurgeon operations either preserve all repository state they are not
explicitly told to modify or warn you when they cannot do so.
Guarantee: reposurgeon handles the bzr commit-properties extension, correctly passing
through property items including those with embedded newlines. (Such properties are also
editable in the mailbox format.)
Limitation: Because reposurgeon relies on other programs to generate and interpret the
fast-import command stream, it is subject to bugs in those programs.
Limitation: bzr suffers from deep confusion over whether its unit of work is a repository
or a floating branch that might have been cloned from a repo or created from scratch, and
might or might not be destined to be merged to a repo one day. Its exporter only works on
branches, but its importer creates repos. Thus, a rebuild operation will produce a
subdirectory structure that differs from what you expect. Look for your content under the
subdirectory 'trunk'.
Limitation: under git, signed tags are imported verbatim. However, any operation that
modifies any commit upstream of the target of the tag will invalidate it.
Limitation: Stock git (at least as of version 1.7.3.2) will choke on property extension
commands. Accordingly, reposurgeon omits them when rebuilding a repo with git type.
Limitation: While the Subversion read-side support is in good shape, the write-side
support is more of a sketch or proof-of-concept than a robust implementation; it only
works on very simple cases and does not round-trip. It may improve in future releases.
Limitation: reposurgeon may misbehave under a filesystem which smashes case in filenames,
or which nominally preserves case but maps names differing only by case to the same
filesystem node (Mac OS X behaves like this by default). Problems will arise if any two
paths in a repo differ by case only. To avoid the problem on a Mac, do all your surgery on
an HFS+ file system formatted with case sensitivity specifically enabled.
Guarantee: As version-control systems add support for the fast-import format, their
repositories will become editable by reposurgeon.
REQUIREMENTS
reposurgeon relies on importers and exporters associated with the VCSes it supports.
git
Core git supports both export and import.
bzr
Requires bzr plus the bzr-fast-import plugin.
hg
Requires core hg, the hg-fastimport plugin, and the third-party hg-fast-export.py
script.
svn
Stock Subversion commands support export and import.
darcs
Stock darcs commands support export and import.
CVS
Requires cvs-fast-export. Note that the quality of CVS lifts may be poor, with
individual lifts requiring serious hand-hacking. This is due to inherent problems with
CVS's file-oriented model.
RCS
Requires cvs-fast-export (yes, that's not a typo; cvs-fast-export handles RCS
collections as well). The caveat for CVS applies.
CANONICALIZATION RULES
It is expected that reposurgeon will be extended with more deletion policies. Policy
authors may need to know more about how a commit's file operation sequence is reduced to
normal form after operations from deleted commits are prepended to it.
Recall that each commit has a list of file operations, each a M (modify), D (delete), R
(rename), C (copy), or 'deleteall' (delete all files). Only M operations have associated
blobs. Normally there is only one M operation per individual file in a commit's operation
list.
To understand how the reduction process works, it's enough to understand the case where
all the operation in the list are working on the same file. Sublists of operations
referring to different files don't affect each other and reducing them can be thought of
as separate operations. Also, a "deleteall" acts as a D for everything and cancels all
operations before it in the list.
The reduction process walks through the list from the beginning looking for adjacent pairs
of operations it can compose. The following table describes all possible cases and all but
one of the reductions.
┌──────────────────────────┬──────────────────────────────────┐
│ M + D → D │ │
│ │ If a file is modified │
│ │ then deleted, the result │
│ │ is as though it had been │
│ │ deleted. If the M was the │
│ │ only modify for the file, │
│ │ it's removed too. │
├──────────────────────────┼──────────────────────────────────┤
│M a + R a b → R a b + M b │ │
│ │ The purpose of this │
│ │ transformation is to push │
│ │ renames toward the │
│ │ beginning of the list, │
│ │ where they may become │
│ │ adjacent to another R or │
│ │ C they can be composed │
│ │ with. If the M is the │
│ │ only modify operation for │
│ │ this file, the rename is │
│ │ dropped. │
├──────────────────────────┼──────────────────────────────────┤
│ M a + C a b │ │
│ │ No reduction. │
├──────────────────────────┼──────────────────────────────────┤
│ M b + R a b → nothing │ │
│ │ Should be impossible, and │
│ │ may indicate repository │
│ │ corruption. │
├──────────────────────────┼──────────────────────────────────┤
│ M b + C a b → nothing │ │
│ │ The copy undoes the │
│ │ modification. │
├──────────────────────────┼──────────────────────────────────┤
│ D + M → M │ │
│ │ If a file is deleted and │
│ │ modified, the result is │
│ │ as though the deletion │
│ │ had not taken place │
│ │ (because M operations │
│ │ store entire files, not │
│ │ deltas). │
├──────────────────────────┼──────────────────────────────────┤
│ D + {D|R|C} │ │
│ │ These cases should be │
│ │ impossible and would │
│ │ suggest the repository │
│ │ has been corrupted. │
├──────────────────────────┼──────────────────────────────────┤
│ R a b + D a │ │
│ │ Should never happen, and │
│ │ is another case that │
│ │ would suggest repository │
│ │ corruption. │
├──────────────────────────┼──────────────────────────────────┤
│ R a b + D b → D a │ │
│ │ The delete removes the │
│ │ just-renamed file. │
├──────────────────────────┼──────────────────────────────────┤
│ {R|C} + M │ │
│ │ No reduction. │
├──────────────────────────┼──────────────────────────────────┤
│ R a b + R b c → R a c │ │
│ │ The b terms have to match │
│ │ for these operations to │
│ │ have made sense when they │
│ │ lived in separate │
│ │ commits; if they don't, │
│ │ it indicates repository │
│ │ corruption. │
├──────────────────────────┼──────────────────────────────────┤
│ R a b + C b c │ │
│ │ No reduction. │
├──────────────────────────┼──────────────────────────────────┤
│ C a b + D a → R a b │ │
│ │ Copy followed by delete │
│ │ of the source is a │
│ │ rename. │
├──────────────────────────┼──────────────────────────────────┤
│ C a b + D b → nothing │ │
│ │ This delete undoes the │
│ │ copy. │
├──────────────────────────┼──────────────────────────────────┤
│ C a b + R a c │ │
│ │ No reduction. │
├──────────────────────────┼──────────────────────────────────┤
│ C a b + R b c → C a c │ │
│ │ Copy followed by a rename │
│ │ of the target reduces to │
│ │ single copy │
├──────────────────────────┼──────────────────────────────────┤
│ C + C │ │
│ │ No reduction. │
└──────────────────────────┴──────────────────────────────────┘
CRASH RECOVERY
This section will become relevant only if reposurgeon or something underneath it in the
software and hardware stack crashes while in the middle of writing out a repository, in
particular if the target directory of the rebuild is your current directory.
The tool has two conflicting objectives. On the one hand, we never want to risk clobbering
a pre-existing repo. On the other hand, we want to be able to run this tool in a directory
with a repo and modify it in place.
We resolve this dilemma by playing a game of three-directory monte.
1. First, we build the repo in a freshly-created staging directory. If your target
directory is named /path/to/foo, the staging directory will be a peer named
/path/to/foo-stageNNNN, where NNNN is a cookie derived from reposurgeon's process ID.
2. We then make an empty backup directory. This directory will be named /path/to/foo.~N~,
where N is incremented so as not to conflict with any existing backup directories.
reposurgeon never, under any circumstances, ever deletes a backup directory.
So far, all operations are safe; the worst that can happen up to this point if the
process gets interrupted is that the staging and backup directories get left behind.
3. The critical region begins. We first move everything in the target directory to the
backup directory.
4. Then we move everything in the staging directory to the target.
5. We finish off by restoring untracked files in the target directory from the backup
directory. That ends the critical region.
During the critical region, all signals that can be ignored are ignored.
ERROR RETURNS
Returns 1 on fatal error, 0 otherwise. In batch mode all errors are fatal.
Use reposurgeon online using onworks.net services