Convert XML to CSV the ugly way using Unix utilities
I recently wanted to export my email address book from Sylpheed email client and import it into my Gmail account. Unfortunately, Gmail wants to import contact lists as CSV files, and there isn’t an export-address-book-to-csv feature in the otherwise wonderful Sylpheed. Worse, the Sylpheed address book is in XML format, and XML-to-CSV conversion isn’t straightforward.
Warning: What follows contains code that may offend real programmers. The code works but it’s pretty simple-minded, and if you’re a real programmer and not a hobbyist hacker like me, you may not want to read any further.
#The address book XML
The first screenshot shows a typical set of five entries from the 174 in my Sylpheed address book (names and addresses changed, of course). What I did was extract each full name and its associated email address from this XML file and separate them with a comma, as shown in the second screenshot. I then added Name,Email as the first line of the file and uploaded it to Gmail.
To understand how I did this, you need to see the address book not as something formatted in a complicated way as XML, but as simple lines of text. And lines of text can be manipulated easily in a terminal using simple command line tools.
I navigated to the directory (
/.sylpheed-2.0 ) containing the Sylpheed address book, which is called addrbook-000001.xml , and entered the following one-line commands, one after the other:
The result was the file gmail.csv on my desktop, looking just like the second screenshot, but listing 174 real names and addresses.
#The first command
Start with the third screenshot to see how this works for the first entry in the address book.
The first command searches with grep for lines containing the string cn=»[something]» in the address book. The -o option means that grep only reports the strings it finds, not the whole lines. The result of the search is sent (piped) to the tr command with the -d option, which in this case deletes the double quotes in the list. The trimmed result is sent to the cut command. Here the c4- option tells cut to report everything in each line beginning with the 4th character, or in other words after the first three characters, cn= . The final list is written to a new text file, names.txt .
#The second command
As shown in the last screenshot, the second command first pulls out the email addresses, but note that grep finds everything up to the last double quote in the line. To remove the remarks=»» strings, I used cut again. I told cut that the line consisted of fields ( f ) which were delimited ( d ) by a blank space, and asked it to report back with the contents of the first field ( f1 ) only. The result was sent to tr , then to cut as for the first command, and emails.txt was written.
#The third command
To create gmail.csv from names.txt and emails.txt , I merged the two files with the paste command. What paste does is put line 1, file 1 and line 1, file 2 into a new line 1 in a new file, and so on, line by line through the paired files. The d «,» option told paste to use a comma to separate each pair of merged lines in the new file.
Note that this won’t work if there’s more than one email address in each entry. In that case the emails list will be longer than than the names list, and paste won’t be able to do the correct matching. In my address book, people with more than one email address have more than one entry, such as Joe Blow — work , Joe Blow — home , Joe Blow — church .
If you search the Web for ways to extract values from XML files, or for ways to manipulate strings with sed and awk , you’ll find that there are much more sophisticated ways than mine to convert XML to CSV. My excuse is that I really like the Unix and GNU text tools! Like the man with a hammer for whom everything looks like a nail, I’m afraid I always reach first for grep , cut , tr , etc. And they work!
License
Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice is preserved.
Источник
Ubuntu Documentation
Many people working with different kinds of data come across the problem that they need to convert between different data formats. Comma Separated Values (CSV) and the Extensible Markup Language (XML) are the most widely used formats for data, and conversion between these two formats needs often to be accomplished. Especially to XML, because this format is very well supported by modern applications, and is very well suited for furter data manipulation and customization.
XML to CSV conversion is trivial for some simple document structures. It can probably be done best with XSL Transformations (XSLT), and won’t be discussed here. (Well, maybe some examples can help.)
Countless tools and methods for CSV to XML conversion do exist, nevertheless it is very often quite hard to find a method that suits one’s particular needs. Different situations/users/documents require different tools. Requirements such as ease-of-installation, set-up, customization, GUI/CLI, encodings, availability of a scripting language interpreter, may have different impact on each use case. Here we will review a wide array of tools and methods for the conversion of CSV documents into XML.
Online Tools
The Creativyst® CSV to XML Converter let’s you paste your CSV document in one form field, and copy the resulting XML from another field online and for free. Also check out their excellent documentation about the CSV file format.
Spreadsheet Applications
Spreadsheet applications such as Openoffice Calc, KSpread, Gnumeric or MS Excel are often very useful tools for manually entering data into tables. Once created, you can usually save (or export) the data as a CSV document. Maybe your preferred applications allows you to export your data as «simple» XML natively, or with help of a third-party plugin. Modern spreadsheet aplications support file formats that are based on XML, and thus can be processed with standard XML tools.
ODF + XSLT
The OpenDocument Format that is used by default in Openoffice Calc is a ZIP compressed archive, which contains XML data. Thus, the content of your spreadsheet can be processed with XSLT.
- Create your data in your spreadsheet application, or import a CSV document.
- Save the spreadsheet as ‘document.ods’.
- Rename the file to ‘document.zip’.
- Extract the archive, and navigate to the ‘document’ directory.
- Here, we are only interested in the ‘content.xml’ file. Convert it with an XSLT stylesheet and xsltproc:
A basic ‘odf.xsl’ stylesheet might look like:
Please note, that the use of ‘position()’ in the ‘table:table-cell’ template might be problematic. It allows you to get distinct element names for each data column. It should be possible to use a similar mechanism for reading labels from the first row. Or you can get a flat element naming replacing with . Flat elements are a little bit harder to process, but it can be done.
Graphical (GUI) Applications
A specialized tool based on Java should be mentionned here, the CSV2XML Konverter published by OIO for free use. It’s possible to use it as a standalone or as a Java Web Start Application or integrated in own software projects. The tool gives several options, e.g. define the delimiter and define the first line as source for the element names (very useful!). Nevertheless, it’s good for ASCII but it has some problems with Unicode.
Another tool for CSV2XML conversion is the Altova XMLSpy. It’s not free, but when you work a lot with XML, maybe it’s worth to you because it’s like a Swiss Army Knife for XML use. How to convert now? In menu klick convert — import textfile — «convert CSV to XML» (2nd option). Then choose path and define delimiter as encoding (!). Additionally, as like as in the OIO tool, you may choose first line entries as element names. Finally import. The results are perfect. Unfortunately, only a Windows version exists.
Command Line Tools
The Unix shell is extremely powerful. It allows you to combine existing programs with redirections and pipes, creating your own custom tools. If some part is not available, you can write or modify a custom script in your scripting language of choice. For more information, you can read the Command Line Howto in this Wiki.
ffe Flat File Extractor
The Flat File Extractor is a command-line tool that allows you to convert arbitrary text files to XML. It is available as the ‘ffe’ package in Ubuntu Universe, and can be installed as follows:
Once installed, an ‘input.csv’ file can be converted to an ‘output.xml’ file using the ffe tool with the following instructions on the command line:
Where the ‘csv2xml.fferc’ configuration file looks something like the following:
This will read a simple CSV file with four columns, and fields separated by ‘;’. You can edit the ‘field’ and separator values as needed.
If you want to read the field names from the first line of the CSV file, you can use the following alternative structure. Just make sure to add the correct number of fields inside the record:
csv2xml
csv2xml is a command line tool released under th BSD license. It is written in C to process very large files very quickly. There seem to be no packages available, so you must compile it manually.
After downloading the source from sourceforge (for example ‘csv2xml-0.6.tar.gz’), make sure that you have ‘build-essential’ installed, and that the ‘/usr/local/man/man1’ directory exists. Then you can simply make & make install:
Now, you can process a comma-separated ‘input.csv’ file (without spaces), write it to ‘output.xml’. The xml2csv tool will read the column headers from the first line in the input file:
Warning: This tool does not create a well-formed XML document, as it lacks a proper root element! Thus, you will have to add this to get an XML document, probably together with an XML declaration:
The Ubuntu ‘libxml-csv-perl’ package provides all the functionality needed for converting CSV to XML with Perl. At it’s most basic, the following script converts ‘input.csv’ to ‘output.xml’, reading the «headings» labels from the first line of the CSV:
In order to use this script, you must save it as ‘csv2xml.pl’, and set it’s executable flag. Or you can invoke perl explicitly:
Warning: This script only works with ASCII text. All lines containing accented characters or other extended characters are silently ignored. A more robust approach would probably require the use of the Encode module.
A complete command-line tool written in Perl is available at the following place: http://sourceforge.net/forum/forum.php?thread_id=1179362&forum_id=422301.
Python
The following Python script reads the ‘input.csv’ document, and writes the data in XML to the standard output. You can adapt it to your personal requirements, adjusting the formatting parameters directly in the code below.
In order to run it, you must save the previous as ‘csv2xml.py’ and change the permissions to executable. Or you can also invoke it explicitly:
Another script for Python3 allows you to set a lot of input and output formatting options directly with command-line options.
PHP is a widely-used general-purpose scripting language that fits particularly well with XML. It is most commonly used in servers, but you can easily install it with the ‘php5’ metapackage from the Ubuntu repository.
You can use this sample script to convert CSV to XML, or you could also write your own script, using the fgetcsv function.
Ruby is probably not installed by default, but you can easily get it from the ‘ruby’ package. Due to it’s simplicity and elegance, this language is becoming a favorite of many programmers.
In order to execute this script it, you must save the previous as ‘csv2xml.rb’ and change the permissions to executable. Or you can also invoke it explicitly:
While the XSL Transformations template language looks like the natural choice for the conversion from XML to CSV, it is not very well suited for conversions the other way around.
Yet, there are a few solutions available on the web. Andrew Welch has developed the following XSLT 2.0 stylesheet:
Save this in a file called ‘csv2xml.xsl’ and perform the conversion with help of the Saxon XSLT processor. You will probably have to install the ‘libsaxonb-java’ package, which provides the XSLT 2.0 aware version. To convert ‘input.csv’ to ‘output.xml’, use the following command in a terminal:
An XSLT 1.0 solution by Kirk Evans requires you to wrap the whole CSV document inside some dummy XML element tags.
Anyway, XSLT is not wery well suited for this task, these are quite advanced hacks, and you should use these methods at your own risk.
Troubleshooting
A very common cause of problems in all text-processing software, including XML processors are the encoding, special, and non-ASCII characters. If your source document contains any of these, you should check your output document very carefully, because very strange things may happen, and usually without a warning!
Encoding
You have to be very careful about the encoding of your source-document. Unicode UTF-8 is the standard today on all modern operating systems, and that’s what you get usually when you create text documents with Ubuntu. If you are using legacy documents, they could be using other encodings, and you will probably have to tell your software somehow about that. Beware, many applications still don’t handle Unicode, or behave badly/unexpectedly. You might have to change your document’s encoding, maybe using the VIM or Emacs text editors.
Encoding only matters if you are using characters outside of the ASCII range, ie. accented or non-latin characters.
Special Characters
Some characters might cause problems with some applications/tools. The most common causes of conflict are summarized on the following table:
Источник
UNIX and Linux Applications
Xml to csv
Hello,
Does anyone know of a way to convert an .xml file (ONIX) to something more workable, like a .csv (or even .xls) file? Ideally something on the command line would be ideal, but not absolutely necessary. I would be dealing with .xml files of 125 MB+.
I am using XQuartz in El Capitan.
Thanks very much!
You didn’t supply sample input and desired output, so I couldn’t attempt a relevant demonstration.
Thanks for the replies. I have copied .xml code for a single item below. I am trying to extract three items (field indices a001, b203, and j151), so the desired output would be:
Here is a quickly-put-together solution using xml2 as the fundamental operation, followed by steps that could be put together in a simple step:
Hello,
I want to thank you so much for taking the time to do this. After replacing 1-data1 with the xml filename, I receive the following:
I’m going to be the Devil’s Advocate here and suggest something entirely different. If this is a one-time-only conversion you have to do, or if it’s something you won’t have to do on a regular basis, I’d honestly import the XML into a spreadsheet like MS Excel or OpenOffice/LibreOffice Calc, and then look at tidying it up and exporting it out as a CSV from there.
Of course if this is going to be an ongoing thing you anticipate needing to do many times per day forever then some kind of script would be desirable, but if it’s not going to be something you have to spend lots of time doing then you may actually save more time using a spreadsheet than trying to write a script for this.
Источник