To split files by number of bytes or lines there is standard
split
command, but how to break file into pieces of variable
size that are delimited by a separator? Search revealed
csplit
tool, hence this quick overview with a small example.
Here only string pattern will be shown, see documentation for more.
Let’s assume the following input file input
(note variable size of sections):
first section ============= second section second section ============= third section third section third section
Desired output is (or something close to this, the example is artificial, in real life file might contain thousands of lines and tenth of sections):
first section
second section second section
third section third section third section
Break once
First try in which we’re passing regular expression to match separator
(===...
):
$ csplit input /===/
and getting two files as the output: xx00
and xx01
. What happened is that
input file was splitted only once at first separator: all above it went to
xx00
and everything else to xx01
.
Break as many times as possible
Need to request pattern repetition by passing {*}
:
$ csplit input /===/ {*}
This time five files are created.
Better file names
Names like xx{digits}
are not very informative, so it’s better to provide
custom output file prefix with --prefix
option (-f
is its short version):
$ csplit --prefix=input. input /===/ {*}
And get:
- input.00
first section
- input.01
============= second section second section
- input.02
============= third section third section third section
Output files might still require some post-processing to remove leading/trailing lines, but it’s still much better than splitting long text files manually, one doesn’t even need to write a program or a script to do it, it’s already there.
There some more useful options and types of pattern, so take a look at this page, which repeats content of the manual page in HTML format or just read man csplit.