Linux add bom to file

How can I re-add a unicode byte order marker in linux?

I have a rather large SQL file which starts with the byte order marker of FFFE. I have split this file using the unicode aware linux split tool into 100,000 line chunks. But when passing these back to windows, it does not like any of the parts other than the first one as only it has the FFFE byte order marker on.

How can I add this two byte code using echo (or any other bash command)?

7 Answers 7

Based on sed’s solution of Anonymous, sed -i ‘1s/^/\xef\xbb\xbf/’ foo adds the BOM to the UTF-8 encoded file foo . Usefull is that it also converts ASCII files to UTF8 with BOM

To add BOMs to the all the files that start with «foo-«, you can use sed . sed has an option to make a backup.

strace ing this shows sed creates a temp file with a name starting with «sed». If you know for sure there is no BOM already, you can simplify the command:

Make sure you need to set UTF-16, because i.e. UTF-8 is different.

For a general-purpose solution—something that sets the correct byte-order mark regardless of whether the file is UTF-8, UTF-16, or UTF-32—I would use vim’s ‘bomb’ option:

( -e means runs in ex mode instead of visual mode; -s means don’t print status messages; -c means “do this”)

Something like (backup first)):

Matthew Flaschen’s answer is a good one, however it has a couple of flaws.

  • There’s no check that the copy succeeded before the original file is truncated. It would be better to make everything contingent on a successful copy, or test for the existence of the temporary file, or to operate on the copy. If you’re a belt-and-suspenders kind of person, you’d do a combo as I’ve illustrated below
  • The ls is unnecessary.
  • I’d use a better variable name than «i» — perhaps «file».

Of course, you could be very paranoid and check for the existence of the temporary file at the beginning so you don’t accidentally overwrite it and/or use a UUID or a generated file name. One of mktemp, tempfile or uuidgen would do the trick.

Traps might be better than all the separate error handlers I’ve added.

No doubt all this extra caution is overkill for a one-shot script, but these techniques can save you when push comes to shove, especially in a multi-file operation.

Источник

Adding BOM to UTF-8 files

I’m searching (without success) for a script, which would work as a batch file and allow me to prepend a UTF-8 text file with a BOM if it doesn’t have one.

Читайте также:  Software to run windows from usb

Neither the language it is written in (perl, python, c, bash) nor the OS it works on, matters to me. I have access to a wide range of computers.

I’ve found a lot of scripts to do the reverse (strip the BOM), which sounds to me as kind of silly, as many Windows program will have trouble reading UTF-8 text files if they don’t have a BOM.

Did I miss the obvious?

7 Answers 7

I wrote this addbom.sh using the ‘file’ command and ICU’s ‘uconv’ command.

edit: Added quotes around the mv arguments. Thanks @DirkR and glad this script has been so helpful!

The easiest way I found for this is

I know it uses an external program (cat). but it will do the job easily in bash

Tested on osx but should work on linux as well

NOTE that it assumes that the file doesn’t already have BOM (!)

To add BOMs to the all the files that start with «foo-«, you can use sed . sed has an option to make a backup.

If you know for sure there is no BOM already, you can simplify the command:

Make sure you need to set UTF-8, because i.e. UTF-16 is different (otherwise check How can I re-add a unicode byte order marker in linux?)

As an improvement on Yaron U.’s solution, you can do it all on a single line:

The cat — bit says to concatenate to the front of source.txt what’s being piped in from the print command. Tested on OS X and Ubuntu.

I find it pretty simple. Assuming the file is always UTF-8(you’re not detecting the encoding, you know the encoding):

Read the first three characters. Compare them to the UTF-8 BOM sequence(wikipedia says it’s 0xEF,0xBB,0xBF). If it’s the same, print them in the new file and then copy everything else from the original file to the new file. If it’s different, first print the BOM, then print the three characters and only then print everything else from the original file to the new file.

In C, fopen/fclose/fread/fwrite should be enough.

Источник

How to convert a file from ASCII to UTF-8?

I’m trying to transcode a bunch a files from ASCII to UTF-8.

For that, I tried using iconv :

-f ENCODING the encoding of the input

-t ENCODING the encoding of the output

Still that file didn’t convert to UTF-8. It is a .dat file.

Before posting this, I searched Google and found information like:

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from «encoding it to UTF-8» would be exactly the same bytes. There’s no difference between them.

Still the above links didn’t help.

Even though it is in ASCII it will support UTF-8 as UTF-8 is a super set, the other party who is going to receive the files from me need file encoding as UTF-8. He just need file format as UTF-8.

Any suggestions please.

1 Answer 1

I’m a little confused by the question, because, as you indicated, ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded.

Читайте также:  Репозитории для alt linux

If you’re sending files containing only ASCII characters to the other party, but the other party is complaining that they’re not ‘UTF-8 Encoded’, then I would guess that they’re referring to the fact that the ASCII file has no byte order mark explicitly indicating the contents are UTF-8.

If that is indeed the case, then you can add a byte order mark using the answer here:

If the other party indicates that he does not need the ‘BOM’ (Byte Order Mark), but is still complaining that the files are not UTF-8, then another possibility is that your initial file is not actually ASCII, but rather contains characters that are encoded using ANSI or ISO-8859-1.

Edited to add the following experiment, after comment from Ram regarding the other party looking for the type using the ‘file’ command

Источник

Adding UTF-8 BOM to string/Blob

I need to add a UTF-8 byte-order-mark to generated text data on client side. How do I do that?

Using new Blob([‘\xEF\xBB\xBF’ + content]) yields ‘»my data»‘ , of course.

Neither did ‘\uBBEF\x22BF’ work (with ‘\x22’ == ‘»‘ being the next character in content ).

Is it possible to prepend the UTF-8 BOM in JavaScript to a generated text?

Yes, I really do need the UTF-8 BOM in this case.

4 Answers 4

See discussion between @jeff-fischer and @casey for details on UTF-8 and UTF-16 and the BOM. What actually makes the above work is that the string \ufeff is always used to represent the BOM, regardless of UTF-8 or UTF-16 being used.

See p.36 in The Unicode Standard 5.0, Chapter 2 for a detailed explanation. A quote from that page

The endian order entry for UTF-8 in Table 2-4 is marked N/A because UTF-8 code units are 8 bits in size, and the usual machine issues of endian order for larger code units do not apply. The serialized order of the bytes must not depart from the order defined by the UTF- 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

I had the same issue and this is the solution I came up with:

Using Uint8Array prevents the browser from converting those bytes into string (tested on Chrome and Firefox).

You should replace text/plain with your desired MIME type.

I’m editing my original answer. The above answer really demands elaboration as this is a convoluted solution by Node.js.

The short answer is, yes, this code works.

The long answer is, no, FEFF is not the byte order mark for utf-8. Apparently node took some sort of shortcut for writing encodings within files. FEFF is the UTF16 Little Endian encoding as can be seen within the Byte Order Mark wikipedia article and can also be viewed within a binary text editor after having written the file. I’ve verified this is the case.

Читайте также:  Как обновить дистрибутив linux до последней версии

Apparently, Node.JS uses the \ufeff to signify any number of encoding. It takes the \ufeff marker and converts it into the correct byte order mark based on the 3rd options parameter of writeFile. The 3rd parameter you pass in the encoding string. Node.JS takes this encoding string and converts the \ufeff fixed byte encoding into any one of the actual encoding’s byte order marks.

UTF-16 Little Endian Example:

So, as you can see the \ufeff is simply a marker stating any number of resulting encodings. The actual encoding that makes it into the file is directly dependent the encoding option specified. The marker used within the string is really irrelevant to what gets written to the file.

I suspect that the reasoning behind this is because they chose not to write byte order marks and the 3 byte mark for UTF-8 isn’t easily encoded into the javascript string to be written to disk. So, they used the UTF16LE BOM as a placeholder mark within the string which gets substituted at write-time.

Источник

Java – How to add and remove BOM from UTF-8 file

By mkyong | Last updated: April 14, 2021

Viewed: 7,984 (+390 pv/w)

This article shows you how to add, check and remove the byte order mark (BOM) from a UTF-8 file. The UTF-8 representation of the BOM is the byte sequence 0xEF , 0xBB , 0xBF (hexadecimal), at the beginning of the file.

Further Reading
Read more about BOM and UTF-8

P.S The below BOM examples only works for UTF-8 file.

1. Add BOM to a UTF-8 file

To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file.

Note
The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this.

1.1 The below example, write a BOM to a UTF-8 file /home/mkyong/file.txt .

1.2 Before Java 8, BufferedWriter and OutputStreamWriter examples of writing BOM to a UTF-8 file.

1.3 PrintWriter and OutputStreamWriter example to write BOM to a UTF-8 file. The 0xfeff is the byte order mark (BOM) codepoint.

1.4 Alternatively, we can write the BOM byte sequence 0xEF , 0xBB , and 0xBF directly to a file.

2. Check if a file contains UTF-8 BOM

The below example read the first 3 bytes from a file and check if it contains the 0xEF , 0xBB , 0xBF byte sequence.

The import org.apache.commons.codec.binary.Hex; is in the below commons-codec library. Or, we can use one of these methods to convert bytes to hex.

3. Remove BOM from a UTF-8 file

The below example ByteBuffer to remove BOM from a UTF-8 file.

P.S Some XML, JSON, CSV parsers may fail to parse or process the file if it contains BOM in the UTF-8 file; it is common to remove or skip the BOM before parsing the file.

4. Copy a file and add BOM

The below example copy of a file and add a BOM to the target file.

Источник

Оцените статью