Windows split text file

How to split a text file into multiple text files

I have a text file called entry.txt that contains the following:

I would like to split it into three text files: entry1.txt , entry2.txt , entry3.txt . Their contents are as follows.

entry1.txt:

entry2.txt:

entry3.txt:

In other words, the [ character indicates a new file should begin.

Is there any way I can accomplish automatic text file splitting? My eventual, actual input entry.txt actually contains 200,001 entries.

Doing the text split in either Windows or Linux would be great. I do not have access to a Mac machine. Thanks!

5 Answers 5

And here’s a nice, simple gawk one-liner :

This will work for any file size, irrespective of the number of lines in each entry, as long as each entry header looks like [ blahblah blah blah ] . Notice the space just after the opening [ and just before the closing ] .

awk and gawk read an input file line by line. As each line is read, its contents are saved in the $0 variable. Here, we are telling awk to match anything within square brackets, and save its match into the array k .

So, every time that regular expression is matched, that is, for every header in your file, k[1] will have the matched region of the line. Namely, «entry1», «entry2» or «entry3» or «entryN». name=k[1] just saves the value of k[1] (the match) into a new variable name .

Finally, we print each line into a file called .txt , ie entry1.txt, entry2.txt . entryN.txt.

This method will be much faster than perl for larger files.

I can’t vouch for this as I have never used windows shell, but I am willing to bet it will be far faster than that also. Gawk/awk are FAST.

How to split large text file in windows?

I have a log file with size of 2.5 GB. Is there any way to split this file into smaller files using windows command prompt?

6 Answers 6

If you have installed Git for Windows, you should have Git Bash installed, since that comes with Git.

Use the split command in Git Bash to split a file:

into files of size 500MB each: split myLargeFile.txt -b 500m

into files with 10000 lines each: split myLargeFile.txt -l 10000

Tips:

If you don’t have Git/Git Bash, download at https://git-scm.com/download

If you lost the shortcut to Git Bash, you can run it using C:\Program Files\Git\git-bash.exe

That’s it!

I always like examples though.

Example:

You can see in this image that the files generated by split are named xaa , xab , xac , etc.

Читайте также:  Как обновить драйвера для windows через биос

These names are made up of a prefix and a suffix, which you can specify. Since I didn’t specify what I want the prefix or suffix to look like, the prefix defaulted to x , and the suffix defaulted to a two-character alphabetical enumeration.

Another Example:

This example demonstrates

  • using a filename prefix of MySlice (instead of the default x ),
  • the -d flag for using numerical suffixes (instead of aa , ab , ac , etc. ),
  • and the option -a 5 to tell it I want the suffixes to be 5 digits long:

Split text file into smaller multiple text file using command line

I have multiple text file with about 100,000 lines and I want to split them into smaller text files of 5000 lines each.

That creates files:

files with no extensions. I just want to call them something like:

or if that is not possible, i just want them to have the «.txt» extension.

9 Answers 9

I know the question has been asked a long time ago, but I am surprised that nobody has given the most straightforward unix answer:

  • -l 5000 : split file into files of 5,000 lines each.
  • -d : numerical suffix. This will make the suffix go from 00 to 99 by default instead of aa to zz.
  • —additional-suffix : lets you specify the suffix, here the extension
  • $FileName : name of the file to be split.
  • file : prefix to add to the resulting files.

As always, check out man split for more details.

For Mac, the default version of split is apparently dumbed down. You can install the GNU version using the following command. (see this question for more GNU utils)

and then you can run the above command by replacing split with gsplit . Check out man gsplit for details.

Here’s an example in C# (cause that’s what I was searching for). I needed to split a 23 GB csv-file with around 175 million lines to be able to look at the files. I split it into files of one million rows each. This code did it in about 5 minutes on my machine:

Here’s a native windows batch that should accomplish the task.

Now I’ll not say that it’ll be fast (less than 2 minutes for each 5Kline output file) or that it will be immune to batch character-sensitivites. Really depends on the characteristics of your target data.

I used a file named q25249516.txt containing 100Klines of data for my testing.

Revised quicker version

Note that I used llimit of 50000 for testing. Will overwrite the early file numbers if llimit *100 is gearter than the number of lines in the file (cure by setting fcount to 1999 and use

2 in file-renaming line.)

You can maybe do something like this with awk

Читайте также:  Adobe cs5 для mac os

Basically, it calculates the name of the output file by taking the record number (NR) and dividing it by 5000, adding 1, taking the integer of that and zero-padding to 2 places.

By default, awk prints the entire input record when you don’t specify anything else. So, print > outfile writes the entire input record to the output file.

As you are running on Windows, you can’t use single quotes because it doesn’t like that. I think you have to put the script in a file and then tell awk to use the file, something like this:

and script.awk will contain the script like this:

Or, it may work if you do this:

Syntax looks like:

where prefix is PREFIXaa, PREFIXab, .

Just use proper one and youre done or just use mv for renameing. I think $ mv * *.txt should work but test it first on smaller scale.

My requirement was a bit different. I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data. And they’re really big, so I need to split them into manageable parts (whilst preserving the header row).

So, I reverted back to my classic VBScript method and bashed together a small .vbs script that can be run on any Windows computer (it gets automatically executed by the WScript.exe script host engine on Window).

The benefit of this method is that it uses Text Streams, so the underlying data isn’t loaded into memory (or, at least, not all at once). The result is that it’s exceptionally fast and it doesn’t really need much memory to run. The test file I just split using this script on my i7 was about 1 GB in file size, had about 12 million lines of test and made 25 part files (each with about 500k lines each) – the processing took about 2 minutes and it didn’t go over 3 MB memory used at any point.

The caveat here is that it relies on the text file having «lines» (meaning each record is delimited with a CRLF) as the Text Stream object uses the «ReadLine» function to process a single line at a time. But hey, if you’re working with TSV or CSV files, it’s perfect.

Split a text file by lines

With Powershell i’m trying to split a text file into multiple files using the the beginning of each line as a delimiter

Input file ( transfer.txt ):

Output file ( Article.txt ):

Here’s a snippet of my code:

This as a result, creates the same file as the input file. What’s wrong with the code?

2 Answers 2

The below line is the problem. It is outside the If loop and adding the content of each line to the output file. But as I understand, that is not what you want. You want only the content that pass the If condition to be added to the output file. Hence, it needs to be inside the If loop.

Читайте также:  Наушники для windows mixed reality

Although I am not too found of this approach because you would be making as many Disk I/O operations as there are lines that pass the if condition. Not very good for scalability.

You can change your code to something like this to reduce number of Disk I/O to just 1.

As others have already pointed out, you never change the output file to anything different from «Article.txt», and you write all input lines to the defined output file.

If you want to write the lines of the input file to different files depending on the value of the first field I’d recommend naming the output files after that value. And since you’re writing the output with Add-Content I’d also suggest reading the input file via Get-Content for simplicity reasons. Use a StreamReader when performance is an issue (in which case you’ll want to use a StreamWriter too), but not just because.

Split a large text file

I have a large text file grouped with separate headers that I need to split into separate files.

For instance the file has headers like this:

I need to split the large file into text files based on their headers.

So for the example, there would be a 3 file output.

Does anyone know of a windows or max app/script that can do this?

Or maybe give instructions on how to write something like this in a programming language. I don’t know python or java but maybe this is the time to learn. 🙂

2 Answers 2

This is not the simplest answer, hopefully someone will come up with something neater. I put together a little script which will do this that should work on the Mac.

This works using csplit to chop up the file. The fourth line basically says ignore everything before the first header line and then split up the headers after that. lines 2-3 work out how many times csplit has to split up the file.

csplit names its output files xx followed by a 2 digit number. The last 4 lines rename all these files to whatever is in the header line with the — removed.

Here’s a «one liner» 8-]. It’s similar to what Martin has down. This will work on your Mac. Just open up the «Terminal» app and navigate to the directory containing myfile.txt

PS. Make sure there are no files in the directory that are named FILE*. ie, make sure ls FILE* shows nothing.

Оцените статью