joseluis_estebanaparicio: Unicode y UTFs

Esta es una respuesta parcial a un correo...

El problema por lo que no te deja editar el fichero, es por la codificación del mismo

¿qué es un fichero de texto plano (habitualmente llevan extensión txt en windows)?

Como sabes, las tablas de caracteres de 255 elementos, se quedaron cortas y eran una mala solución

Para solucionarlo surgieron varias ampliaciones.

En linux, unix y casi "too" lo que se menea optaron por UTF-8

El UTF-8 lo inventaron como parte de lo necesario en un nuevo sistema operativo (llamado Inferno) un par de tipos (uno de ellos fue uno de los padres de Unix y C, casi naaa)

Lo bueno de UTF-8, es que se utilizan símbolos ASCII válidos y no desperdicia mucho espacio

Un documento UTF-8 puede abrirse sin problemas con un editor ASCII no UTF-8 y sólo aparecerán secuencias extrañas en los códigos no ASCII

(Existe también UTF-16 y UTF-32)

Windows optó por UTF-16 y lo mal llamaron UNICODE

UTF significa Unicode Transformation Format

¿En qué consiste UTF-16?

En utilizar dos bytes por cada carácter siempre. Con esto tienes una lista de 65536 caracteres (que son menos de los definidos en Unicode que creo que son más de 170 mil)

UTF-16 desperdicia mogollón de espacio comparado con UTF-8 y además, no está codificado en ASCII

Por tanto, un documento con UTF-16, para un programa ASCII es un binario

Por cierto, en internet se utiliza casi al 100% UTF-8 (webs, correo,
ftp, etc...)

El caso es que tu documento ha venido en UTF-16 de windows

Tu linux estará configurado para trabajar con UTF-8 (se puede configurar para trabajar con cualquier otra cosa, pero no es sencillo y nada, nada recomendable)

Entonces kate abre el documento y ve un montón de símbolos raros que no son ASCII

Así que te muestra lo que puede, te oculta el resto y para que no rompas nada, no te deja modificar

Fíjate que en kate te muestra las letras siempre separadas por un espacio. Eso es el otro byte de utf-16 (que no es ascii)

He probado a abrirlo con kword y con OpenOffice

Con kword, pregunta el formato, le dices que es utf-16 y voalá, todo bien

Con OpenOffice es un poco peor, porque te pregunta el formato, le dices Unicode y todo bien.

En realidad no es Unicode (UTF-8 también es una forma de codificar unicode). Insisto en que en realidad es UTF-16, pero Mocochoft, ya se sabe, le pone el nombre que le da la gana o el que le interesa.

Pos eso, que lo puedes abrir con uno de esos dos programas o convertir con otros

http://en.wikipedia.org/wiki/Unicode_Transformation_Format

http://en.wikipedia.org/wiki/Unicode

http://msdn.microsoft.com/en-us/library/dd374081(VS.85).aspx

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

Subject: UTF-8 history

From: "Rob 'Commander' Pike"

Date: Wed, 30 Apr 2003 22:32:32 -0700 (Thu 06:32 BST)

To: mkuhn (at) acm.org, henry (at) spsystems.net

Cc: ken (at) entrisphere.com

Looking around at some UTF-8 background, I see the same incorrect

story being repeated over and over. The incorrect version is:

1. IBM designed UTF-8.

2. Plan 9 implemented it.

That's not true. UTF-8 was designed, in front of my eyes, on a

placemat in a New Jersey diner one night in September or so 1992.

What happened was this. We had used the original UTF from ISO 10646

to make Plan 9 support 16-bit characters, but we hated it. We were

close to shipping the system when, late one afternoon, I received a

call from some folks, I think at IBM - I remember them being in Austin

- who were in an X/Open committee meeting. They wanted Ken and me to

vet their FSS/UTF design. We understood why they were introducing a

new design, and Ken and I suddenly realized there was an opportunity

to use our experience to design a really good standard and get the

X/Open guys to push it out. We suggested this and the deal was, if we

could do it fast, OK. So we went to dinner, Ken figured out the

bit-packing, and when we came back to the lab after dinner we called

the X/Open guys and explained our scheme. We mailed them an outline

of our spec, and they replied saying that it was better than theirs (I

don't believe I ever actually saw their proposal; I know I don't

remember it) and how fast could we implement it? I think this was a

Wednesday night and we promised a complete running system by Monday,

which I think was when their big vote was.

So that night Ken wrote packing and unpacking code and I started

tearing into the C and graphics libraries. The next day all the code

was done and we started converting the text files on the system

itself. By Friday some time Plan 9 was running, and only running,

what would be called UTF-8. We called X/Open and the rest, as they

say, is slightly rewritten history.

Why didn't we just use their FSS/UTF? As I remember, it was because

in that first phone call I sang out a list of desiderata for any such

encoding, and FSS/UTF was lacking at least one - the ability to

synchronize a byte stream picked up mid-run, with less that one

character being consumed before synchronization. Becuase that was

lacking, we felt free - and were given freedom - to roll our own.

I think the "IBM designed it, Plan 9 implemented it" story originates

in RFC2279. At the time, we were so happy UTF-8 was catching on we

didn't say anything about the bungled history. Neither of us is at

the Labs any more, but I bet there's an e-mail thread in the archive

there that would support our story and I might be able to get someone

to dig it out.

So, full kudos to the X/Open and IBM folks for making the opportunity

happen and for pushing it forward, but Ken designed it with me

cheering him on, whatever the history books say.

-rob

joseluis_estebanaparicio

lunes, julio 12, 2010

Unicode y UTFs

No hay comentarios:

About Me