lunes, julio 12, 2010

Unicode y UTFs

Esta es una respuesta parcial a un correo...


El problema por lo que no te deja editar el fichero, es por la codificación del mismo

¿qué es un fichero de texto plano (habitualmente llevan extensión txt en windows)?

Como sabes, las tablas de caracteres de 255 elementos, se quedaron cortas y eran una mala solución


Para solucionarlo surgieron varias ampliaciones.


En linux, unix y casi "too" lo que se menea optaron por UTF-8

El UTF-8 lo inventaron como parte de lo necesario en un nuevo sistema operativo (llamado Inferno) un par de tipos (uno de ellos fue uno de los padres de Unix y C, casi naaa)

Lo bueno de UTF-8, es que se utilizan símbolos ASCII válidos y no desperdicia mucho espacio

Un documento UTF-8 puede abrirse sin problemas con un editor ASCII no UTF-8 y sólo aparecerán secuencias extrañas en los códigos no ASCII


(Existe también UTF-16 y UTF-32)


Windows optó por UTF-16 y lo mal llamaron UNICODE

UTF significa Unicode Transformation Format

¿En qué consiste UTF-16?

En utilizar dos bytes por cada carácter siempre. Con esto tienes una lista de 65536 caracteres (que son menos de los definidos en Unicode que creo que son más de 170 mil)

UTF-16 desperdicia mogollón de espacio comparado con UTF-8 y además, no está codificado en ASCII


Por tanto, un documento con UTF-16, para un programa ASCII es un binario


Por cierto, en internet se utiliza casi al 100% UTF-8 (webs, correo,
ftp, etc...)


El caso es que tu documento ha venido en UTF-16 de windows

Tu linux estará configurado para trabajar con UTF-8 (se puede configurar para trabajar con cualquier otra cosa, pero no es sencillo y nada, nada recomendable)

Entonces kate abre el documento y ve un montón de símbolos raros que no son ASCII

Así que te muestra lo que puede, te oculta el resto y para que no rompas nada, no te deja modificar


Fíjate que en kate te muestra las letras siempre separadas por un espacio. Eso es el otro byte de utf-16 (que no es ascii)



He probado a abrirlo con kword y con OpenOffice


Con kword, pregunta el formato, le dices que es utf-16 y voalá, todo bien

Con OpenOffice es un poco peor, porque te pregunta el formato, le dices Unicode y todo bien.


En realidad no es Unicode (UTF-8 también es una forma de codificar unicode). Insisto en que en realidad es UTF-16, pero Mocochoft, ya se sabe, le pone el nombre que le da la gana o el que le interesa.


Pos eso, que lo puedes abrir con uno de esos dos programas o convertir con otros


http://en.wikipedia.org/wiki/Unicode_Transformation_Format
http://en.wikipedia.org/wiki/Unicode
http://msdn.microsoft.com/en-us/library/dd374081(VS.85).aspx

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt



Subject: UTF-8 history
From: "Rob 'Commander' Pike"
Date: Wed, 30 Apr 2003 22:32:32 -0700 (Thu 06:32 BST)
To: mkuhn (at) acm.org, henry (at) spsystems.net
Cc: ken (at) entrisphere.com

Looking around at some UTF-8 background, I see the same incorrect
story being repeated over and over. The incorrect version is:
1. IBM designed UTF-8.
2. Plan 9 implemented it.
That's not true. UTF-8 was designed, in front of my eyes, on a
placemat in a New Jersey diner one night in September or so 1992.

What happened was this. We had used the original UTF from ISO 10646
to make Plan 9 support 16-bit characters, but we hated it. We were
close to shipping the system when, late one afternoon, I received a
call from some folks, I think at IBM - I remember them being in Austin
- who were in an X/Open committee meeting. They wanted Ken and me to
vet their FSS/UTF design. We understood why they were introducing a
new design, and Ken and I suddenly realized there was an opportunity
to use our experience to design a really good standard and get the
X/Open guys to push it out. We suggested this and the deal was, if we
could do it fast, OK. So we went to dinner, Ken figured out the
bit-packing, and when we came back to the lab after dinner we called
the X/Open guys and explained our scheme. We mailed them an outline
of our spec, and they replied saying that it was better than theirs (I
don't believe I ever actually saw their proposal; I know I don't
remember it) and how fast could we implement it? I think this was a
Wednesday night and we promised a complete running system by Monday,
which I think was when their big vote was.

So that night Ken wrote packing and unpacking code and I started
tearing into the C and graphics libraries. The next day all the code
was done and we started converting the text files on the system
itself. By Friday some time Plan 9 was running, and only running,
what would be called UTF-8. We called X/Open and the rest, as they
say, is slightly rewritten history.

Why didn't we just use their FSS/UTF? As I remember, it was because
in that first phone call I sang out a list of desiderata for any such
encoding, and FSS/UTF was lacking at least one - the ability to
synchronize a byte stream picked up mid-run, with less that one
character being consumed before synchronization. Becuase that was
lacking, we felt free - and were given freedom - to roll our own.

I think the "IBM designed it, Plan 9 implemented it" story originates
in RFC2279. At the time, we were so happy UTF-8 was catching on we
didn't say anything about the bungled history. Neither of us is at
the Labs any more, but I bet there's an e-mail thread in the archive
there that would support our story and I might be able to get someone
to dig it out.

So, full kudos to the X/Open and IBM folks for making the opportunity
happen and for pushing it forward, but Ken designed it with me
cheering him on, whatever the history books say.

-rob

No hay comentarios: