IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Derek Atkins
For the masses, we had this discussion on IRC earlier today.
I'm copying the logs here for posterity (and my cstim's request).

-derek

<cstim> btw our gnucash data file still begins with <?xml version="1.0"?> i.e. without the encoding="something" attribute.
<cstim> (doesn't it?)
<cstim> This is the root cause of http://bugzilla.gnome.org/show_bug.cgi?id=329202
<jsled> Correct.
<cstim> and we should start as soon as possible to add that tag again, but I'm not yet sure how to set it and what implications this will might have as well.
<cstim> I didn't have a problem of 1.8 <-> 2.0 file compatibility, but only because my German umlauts happen to be in latin-1
<warlord> io-gncxml-v2.c:    fprintf(out, "<?xml version=\"1.0\"?>\n");
<warlord> in src/backend/file
<cstim> really?!?
<jsled> la la we suck.
<cstim> royally.
<warlord> in write_v2_header()
<jsled> We should really have 2.0 fix the upgrade path.
<warlord> yes, but 2.0 needs to figure out what locale the old datafile is in..
<warlord> (and prompt the user).. Which might be.. challenging.. based on the code path and lack of callback.
<jsled> Hmm.  There should be some way to determine the system-default character encoding.
<jsled> If we open a data-file without @encoding, then we use that, and convert to utf-8 on subsequent writes.
<jsled> If we see @encoding, then we're good to go.
<cstim> although I'm confused as to why my latin-1 characters (the non-ascii ones) were read correctly, because http://www.xmlsoft.org/encoding.html claims:
<cstim> "If there is no encoding declaration, then the input has to be in either UTF-8 or UTF-16, if it is not then at some point when processing the input, the converter/checker of UTF-8 form will raise an encoding error"
<cstim> (and non-ascii latin1 is neither UTF-8 nor UTF-16)
<warlord> it could be that we're just not "properly" using libxml..
<warlord> I mean, we're certainly writing out the data ourselves...
<cstim> jsled: the system-default encoding can be obtained by g_get_charset(), http://developer.gnome.org/doc/API/2.0/glib/glib-Character-Set-Conversion.html#g-get-charset
<cstim> And of course the original @encoding needs to be stored somewhere.
<jsled> why?
<cstim> compatibility to 1.8
<jsled> If we convert on the way in to utf-8, and we only ever write utf-8...
<jsled> Oh.  I don't think I care.
<cstim> no, you cannot "don't care", *yet*.
<warlord> if we specify utf-8 then 1.8 should "do the right thing", no?
<cstim> we even still don't write the xml namespaces because of compatibility to 1.8.0
<jsled> Yes, we do.
<warlord> cstim: actually in 1.9/svn we do write the namespace.
<cstim> we do? so we only have compatibility to >= 1.8.5
<warlord> cstim: yea, but 1.8.5 was released in late 2003, so I think that's okay to only have compatability with the last 2.5-3 years backwards.
<cstim> obviously nobody really read http://wiki.gnucash.org/wiki/Release_Schedule because I raised that question there for some time now.
<warlord> 2003-09-11  Chris Lyttle  <[hidden email]>
<warlord>         * configure.in: Release 1.8.6
<warlord>         * NEWS: Release 1.8.6
<warlord> Oh, I read it and thought "but we do that already"...  But didn't think to comment.  Sorry.  I was reading it for the schedule parts, not that.
<warlord> What would g_get_charset() do in that koir(sp?) environment?  Would it say "utf-8?" or "koir"?
<cstim> gnucash-1.8 uses libxml1, doesn't it?
<warlord> I think 1.8 can build against either xml1 or xml2
<cstim> warlord: I guess it would say "koi8-r"
<warlord> [warlord@cliodev build]$ ldd /opt/gnucash-1.8/lib/gnucash/libgncmod-backend-file.so | grep libxml
<warlord>         libxml.so.1 => /usr/lib/libxml.so.1 (0xb7eb4000)
<warlord> Sorry, I was wrong.  It builds against libxml1
<warlord> my xml fu is pretty low..  So, how do we tell the xml parser to convert the data for us?
<cstim> it's all on that xmlsoft.org page.
<warlord> rather, how do we tell libxml that it's really not utf-8?
<cstim> I *think* if @encoding exists then the xml parser will automatically switch to that.
<cstim> xmlSwitchEncoding() ?
* cstim just discovered that xmllint --encode UTF-8 test1.xml > utf8.xml would work as well
<warlord> http://mail.gnome.org/archives/xml/2001-July/msg00160.html
<warlord> The problem is that xmlParseDocument() will call xmlSwitchEncoding().
<warlord> And according to Dan Veillard, the encoding specified in the xml document is canonical:  http://mail.gnome.org/archives/xml/2001-July/msg00161.html
<warlord> So we would need to 'sed' the document to give it a locale.
<jsled> except... I thought we had evidence that that's not happening.
<cstim> http://www.xmlsoft.org/html/libxml-parserInternals.html#xmlSwitchEncoding isn't particularly verbose, too
<jsled> This is going to be great fun! :)
<cstim> I wonder whether these statements from 2001 are still valid.
<warlord> I dont know.. I think we need to look at the code history and see what the code says and when it was changed.
<warlord> Of course we're 5 years behind now...
<warlord> I'm afraid we might need to tell people to manually modify their data files...
<warlord> we should also make sure that g2 will automatically output utf8 even in a koi8-r locale.
<warlord> (and we should set the encoding)
<jsled> I don't think we need to have people manually modify their data files.
<jsled> Either libxml2 DOES convert to utf-8, in which case we're golden.
<cstim> re output: well, first we need to verify that 1.8 will correctly recognize a encoding="UTF-8" attribute
<jsled> Or it DOESN'T, and we determine the system-default charset and convert ourselves.
<cstim> jsled: convert in which step?
<jsled> 2.0 only  ever saves in utf-8, and sets @encoding.
<jsled> Well, easiest to let libxml do it by forcing the xmlSwitchEncoding call, I'd guess.
<cstim> internally, libxml2 has *only* utf-8
<warlord> jsled: except gnc-2.0 doesn't set the xml encoding.
<warlord> I'm fixing that now.
<cstim> warlord: how?
<jsled> except what?
<cstim> warlord: This area needs some discussion input from non-ascii people or otherwise we won't get it right again.
<warlord> when gnucash (svn) creates a data file, it does not specify the XML encoding.
<jsled> oh, sure.  That's a bug, you're apparently fixing.
<jsled> But there aren't any documents of relevance that have been saved with 2.0.
<warlord> correct, I just fixed that.  So now whenever we create an xml document, I specify encoding="utf-8"
<warlord> We still, however, have the upgrade problem.
<jsled> yeah.
<warlord> (and I dont know what happens if the user is running in a non-utf8 locale)
<jsled> um.
<warlord> will their g2 output still be utf8?
<warlord> we might just printf() the data directly..
<jsled> wtf?
<cstim> what?!??
<warlord> Oh, never mind, we do build a DOM Tree on save.
<warlord> However we output the xml header ourself.
<jsled> hmm. both, I guess; there's certainly a bunch of fprintfs for the framing.
<cstim> regarding encodings in general, did you read http://www.joelonsoftware.com/articles/Unicode.html ? I'd strongly recommend it.
<jsled> yeah ... most of the structures are xmlElemDump'ed
<warlord> Yes, I've read it.
<jsled> But the framing is fprintf'ed.
* jsled sighs
--- hampton|away is now known as hampton|slow
<warlord> that shouldn't be an issue..
<jsled> True.
<jsled> cstim: So, as per that discussion with kostik, it seems 1.8 saves in the system-default encoding, always; would you agree?
<cstim> jsled: yes
<jsled> Great.  So 2.0 just needs to branch on the presence of @encoding.
<warlord> Assuming libxml has the SetEncoding override, I wonder if we could just use the libxml encoding detection ala that 2001 email thread to detect and override our own.
<cstim> warlord: you mean instead of using the system's g_get_charset()?
<warlord> e.g., do something like:  enc = xmlDetectCharEncoding(start, 4);
<warlord>     if (enc != XML_CHAR_ENCODING_NONE) {
<warlord>         xmlSwitchEncoding(ctxt, locale-encoding);
<jsled> What's the other branch?
<jsled> We don't care if the encoding's present.
<warlord> jsled: correct.
<cstim> warlord: you mean xmlDetectCharEncoding instead of the system's g_get_charset?
<warlord> cstim: no
<warlord> I'm not up on the libxml API -- I'm trying to find the docs now.  What I mean is:
<warlord> if (xml has encoding specified) {
<warlord>    use xml-specified encoding;
<warlord> } else { /* xml does not specify encoding */
<warlord>   use "locale" encoding;
<warlord>   warn user to check their data;
<warlord> }
<jsled> Hmm.  Or, `else { warn user; return iconv(from=locale-encoding, to=utf-8); }`...
<cstim> jsled: no
<jsled> no?
<hampton|slow> instead of 'use "locale" encoding' I would say that we should request an encoding from the user and default the response to the locale encoding.
<cstim> jsled: iff we can get libxml2 to accept that the file is in a different encoding, then it will do the conversion itself
<jsled> cstim: ah, yah. true.
<warlord> cstim: it's POSSIBLE that xmlSetEncoding() will tell libxml "use this encoding dammit"
<cstim> hampton|slow: in principle yes, but in practice the user will get this question each time when 1.8 has used the file in the meantime
<warlord> but I'd need to see the source to verify.
<warlord> cstim: I think that's okay -- I can't imagine that many users will switch back and forth between 1.8 and 2.0 -- also -- we don't know if a koi8-r user on 1.8 can read a utf-8 encoded XML document with "koi8-r - translated" characters.
<hampton|slow> How many people are going to switch back and forth from 1.8 to g2? Besides, we could always cache the answer.
<cstim> the docs of libxml2 suck. Not that I'd be up for writing better docs, though.
* cstim switches back and forth between 1.8 and 2.0
<cstim> warlord: what was that with the koi8-r user? I hope we get feedback from Kostik about that question, because IFF libxml1 honors the encoding="utf-8" then it will also read it correctly.
<warlord> There's xmlSwithcEncoding() and xmlSwitchInputEncoding()
<warlord> That's a very big IFF
<warlord> But worse -- let's say that libxml1 honors it -- what's the charset of the data that gnucash will read out of the DOM tree?  will libxml1 convert the utf-8 to koi8-r?
<warlord> gnucash 1.8 will expect it in koi8-r..
<cstim> I'm not sure at all.
<warlord> me either.
<warlord> *grr*  
* warlord would like to strangle the original gnc xml authors for not dealing with this..
<cstim> I just know that my non-ascii latin1 file is being used back and forth in 1.8 and 1.9, where 1.8 has LANG=de_DE (i.e. *not* utf8) but 1.9 has LANG=de_DE.utf-8
<warlord> Then again, we didn't deal with it in 1.8, either.
<cstim> and for whatever reason it works fine so far
<warlord> that's because libxml2 can detect iso-latin1 and 'gets it right'
<warlord> it doesn't "get it right" for koi8-r
<warlord> this is discussed in that thread from 2001 that I posted earlier.
<warlord> let me find the actual message which is pretty good about it.
<cstim> actually I'm not too sure about the actual encoding in my file. I'll need to check, but I don't have time before the weekend.
<warlord> first, here's a code snippet that might be interesting to us:
<warlord> http://mail.gnome.org/archives/xml/2001-July/msg00164.html
<warlord> here's the good description of the problem:
<warlord> http://mail.gnome.org/archives/xml/2001-July/msg00168.html
* warlord will be back shortly.
<cstim> Because either the file is in latin1, which means that 1.9/libxml2 miracuously writes latin1 instead of utf8 and 1.8/libxml1 uses this without conversion, or the file is in utf8, which means 1.9 uses it without conversion and 1.8/libxml1 miracuously converts the utf8 to the system locale latin1.
<jsled> hmm.  are you using non-utf8 latin1 characters?
<cstim> I use äüöß
<cstim> German umlauts, in HTML &auml; &uuml; &ouml; &szlig;
<cstim> the encoding of those differs between latin1 and utf-8, yes.
* jsled nods
* warlord is back
<warlord> cstim: can you compare the encodings used in a file generated by 1.8 and one generated by svn?
<cstim> I can, but not before next week.
<cstim> Can someone copy the interesting part of our discussion somewhere? Maybe a wiki page?
* cstim needs to un
<cstim> s/un/run/
--- cstim is now known as cstim_away
--
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       [hidden email]                        PGP key available

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Chris Shoemaker

  I have some comments/proposals that are related to the 1.8<->2.0
compatibility issue and slightly related to the encoding issues.  So,
I'll just jump into this thread.

  GnuCash 1.8 is pretty forward-incompatible, in the sense that it's
not easy to keep the format backward-compatible.  Aside from
encodings, it just bombs on unrecognized xml elements.  :(

  To solve thing for the future, I propose that we let 2.0 at least
_try_ to continue reading a file that contains unknown elements.  That
way, if we're smart about how we extend the format, it will be
backward compatible.

  Unfortunately, that doesn't help for 1.8.  For that, I propose a
"save-as" or "Export" option that will be _sure_ to save a file that
can be opened with 1.8.  For 1.8 compatibility that file would have to
drop new features (e.g. no budgets), and if there turns out to be an
encoding incompatibilty, it can also handle that.

  Note: 2.0 datafiles are mostly 1.8-compatible, but this option would
have to _always_ work.

-chris
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Derek Atkins
Chris Shoemaker <[hidden email]> writes:

>   I have some comments/proposals that are related to the 1.8<->2.0
> compatibility issue and slightly related to the encoding issues.  So,
> I'll just jump into this thread.
>
>   GnuCash 1.8 is pretty forward-incompatible, in the sense that it's
> not easy to keep the format backward-compatible.  Aside from
> encodings, it just bombs on unrecognized xml elements.  :(

So was 1.6..  New-in-1.8 features would cause your data file to become
unreadable in 1.6.

>   To solve thing for the future, I propose that we let 2.0 at least
> _try_ to continue reading a file that contains unknown elements.  That
> way, if we're smart about how we extend the format, it will be
> backward compatible.

You can read them, but then what?  You would need to /remember/ them
somehow so you could put them back into the data file.. . Otherwise
you just corrupt the data going back.

Also, what happens when an unrecognized tag really has some core
meaning about the object.  For example, let's say that there weren't
a void flag in transactions, but in the next version we implement
transaction voiding -- if you go back in time you lose the semantic
meaning of the flag, or WORSE, destroy that flag.

It's HARD to do this..  Unless we kept the data in the DOM tree and
modified it in-place it would be REALLY HARD to not lose data when we
go back and then forward again.

>   Unfortunately, that doesn't help for 1.8.  For that, I propose a
> "save-as" or "Export" option that will be _sure_ to save a file that
> can be opened with 1.8.  For 1.8 compatibility that file would have to
> drop new features (e.g. no budgets), and if there turns out to be an
> encoding incompatibilty, it can also handle that.

Nah, I don't think it's an issue.  We didn't get a significant amount
of crying from 1.6->1.8 about data files..  I don't think we'll get
much on 1.8->2.0 either.

>   Note: 2.0 datafiles are mostly 1.8-compatible, but this option would
> have to _always_ work.

I just don't think this is a good idea for the above reasoning.

> -chris

-derek

--
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       [hidden email]                        PGP key available
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Chris Shoemaker
On Fri, Feb 03, 2006 at 11:28:43AM -0500, Derek Atkins wrote:

> Chris Shoemaker <[hidden email]> writes:
>
> >   I have some comments/proposals that are related to the 1.8<->2.0
> > compatibility issue and slightly related to the encoding issues.  So,
> > I'll just jump into this thread.
> >
> >   GnuCash 1.8 is pretty forward-incompatible, in the sense that it's
> > not easy to keep the format backward-compatible.  Aside from
> > encodings, it just bombs on unrecognized xml elements.  :(
>
> So was 1.6..  New-in-1.8 features would cause your data file to become
> unreadable in 1.6.
>
> >   To solve thing for the future, I propose that we let 2.0 at least
> > _try_ to continue reading a file that contains unknown elements.  That
> > way, if we're smart about how we extend the format, it will be
> > backward compatible.
>
> You can read them, but then what?  You would need to /remember/ them
> somehow so you could put them back into the data file.. . Otherwise
> you just corrupt the data going back.

There's no way I'd expect the old version to _preserve_ new data-types
added by newer versions.  But IMO, _ignoring_ new data-types and
remaining usable is better than just bombing.

> Also, what happens when an unrecognized tag really has some core
> meaning about the object.  For example, let's say that there weren't
> a void flag in transactions, but in the next version we implement
> transaction voiding -- if you go back in time you lose the semantic
> meaning of the flag, or WORSE, destroy that flag.

Sure, if you make non-backward compatible changes to the file format,
you need to reflect that in the format version.

> It's HARD to do this..  Unless we kept the data in the DOM tree and
> modified it in-place it would be REALLY HARD to not lose data when we
> go back and then forward again.

Of course.  That's WAY too hard.  But that's not what I was
suggesting.  Currently, it's _impossible_ to change the format
_at_all_ without breaking old versions.  I'm just suggesting we should
at least make it _possible_ to make backward compatible changes.

Budgets is a perfect example.  There's _no_ good reason why a
pre-budgets version of gnucash shouldn't be able to open a data-file
from a version of gnucash that supports budgets.  Of course it can't
preserve budget data, but it should at least work with the data it
knows about.

If we _ever_ want to make any backward-compatible changes, we should
fix this.

> >   Unfortunately, that doesn't help for 1.8.  For that, I propose a
> > "save-as" or "Export" option that will be _sure_ to save a file that
> > can be opened with 1.8.  For 1.8 compatibility that file would have to
> > drop new features (e.g. no budgets), and if there turns out to be an
> > encoding incompatibilty, it can also handle that.
>
> Nah, I don't think it's an issue.  We didn't get a significant amount
> of crying from 1.6->1.8 about data files..  I don't think we'll get
> much on 1.8->2.0 either.

I tend to be pretty conservative about these things.  As a user, if I
know that upgrading the program means that I lose the ability to load
the data file in my older version, it _is_ an issue I think twice
about.  I have certainly avoided upgrading certain financial
applications for _exactly_ this reason.

I'm not saying we _always_ have to be backward compatible.  We should
be free to consciously break compatibility, but I think we should also
have the option of making backward-compatible changes.

>
> >   Note: 2.0 datafiles are mostly 1.8-compatible, but this option would
> > have to _always_ work.
>
> I just don't think this is a good idea for the above reasoning.

Well, the 1.8->2.0 is a bit of a separate issue, since the fix is
completely different.  I'm still not quite clear on whether the
encoding issue is solvable.  But personally, I'd feel a bit
irresponsible if we broke backward-compatibility for no other reason
than introducing budgets.  It just feels... rude.

-chris
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Derek Atkins
Quoting Chris Shoemaker <[hidden email]>:

>> You can read them, but then what?  You would need to /remember/ them
>> somehow so you could put them back into the data file.. . Otherwise
>> you just corrupt the data going back.
>
> There's no way I'd expect the old version to _preserve_ new data-types
> added by newer versions.  But IMO, _ignoring_ new data-types and
> remaining usable is better than just bombing.

I completely disagree.  As soon as you save off the data file you've
now lost data.  I do agree that it shouldn't "bomb", but I do not believe
that it should read the data.  IMNSHO It should say "you need a newer
version of GnuCash to read this file" and just refuse to open it.

> Sure, if you make non-backward compatible changes to the file format,
> you need to reflect that in the format version.

ANY unrecognized option is a non-backwards compatible change to the file
format..  And losing that data is even worse.  Imagine if 1.6 did this.
Now a user of 1.8 who has all their business and SX data in the file
loads it into 1.6 which prompty ignores the business and SX data but
doesn't tell the user.  User now saves file and LOSES all their business
and SX data.  User now complains to us.

> Of course.  That's WAY too hard.  But that's not what I was
> suggesting.  Currently, it's _impossible_ to change the format
> _at_all_ without breaking old versions.  I'm just suggesting we should
> at least make it _possible_ to make backward compatible changes.

That's not completely true.  This is what the KVP frames are for.
If you put new data into the KVP frame then previous versions can
read it just fine..  But that makes it much harder on, say, DB schemas.

> Budgets is a perfect example.  There's _no_ good reason why a
> pre-budgets version of gnucash shouldn't be able to open a data-file
> from a version of gnucash that supports budgets.  Of course it can't
> preserve budget data, but it should at least work with the data it
> knows about.

I disagree.  See above.  If you DONT have budget data, or any
new-in-new-version
features in your data file, then yes, it should be backwards compatible.
But as soon as you get a new-in-new-version feature the file should NOT
be readable by previous versions due to the data loss issue.

Solve the data loss issue and I'll change my opinion, but not before.

> If we _ever_ want to make any backward-compatible changes, we should
> fix this.

I don't see it as a problem.

> I tend to be pretty conservative about these things.  As a user, if I
> know that upgrading the program means that I lose the ability to load
> the data file in my older version, it _is_ an issue I think twice
> about.  I have certainly avoided upgrading certain financial
> applications for _exactly_ this reason.
>
> I'm not saying we _always_ have to be backward compatible.  We should
> be free to consciously break compatibility, but I think we should also
> have the option of making backward-compatible changes.

It's a trade off.  I agree that upgrading /by itself/ should not render
a data file unreadable...  And that /IS/ the case.   But I'm also worried
about data loss by a casual user..  To me, unintentional data loss is a
BIGGER concern than data incompatibilty.

> Well, the 1.8->2.0 is a bit of a separate issue, since the fix is
> completely different.  I'm still not quite clear on whether the
> encoding issue is solvable.  But personally, I'd feel a bit
> irresponsible if we broke backward-compatibility for no other reason
> than introducing budgets.  It just feels... rude.

We broke it for Business and SX in 1.8.  It was broken from 1.4->1.6
for other reasons that I don't recall...  But it's broken ONLY IF
YOU USE THE NEW FEATURE..  Just running the program doesn't break
compatibility..  (xml encoding issues asside).

Please stop thinking like a developer.  Think like a user, a dumb
user, a dumb user who knows NOTHING about computers or programming
or the issues of when data loss can occur.  They care more about data
integrity than compatibility.  Go ahead, ask on -user.  Ask this question
and see what response you get:

   Would you rather be able to always read a data file created in a new
   version of gnucash using an older version of gnucash, where saving that
   data file will lose data when you re-open it in the new version, or would
   you rather the older version of gnucash fail to load the data when it
   sees data it doesn't understand in order to prevent accidental data loss?

> -chris

-derek
--
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       [hidden email]                        PGP key available

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Dan Widyono
> I completely disagree.  As soon as you save off the data file you've
> now lost data.  I do agree that it shouldn't "bomb", but I do not believe
> that it should read the data.  IMNSHO It should say "you need a newer
> version of GnuCash to read this file" and just refuse to open it.

IMHO (as a user) I agree with Derek (quoted above) on this, and apparently so
does the user interface group of OO.org (perhaps modeled after the usability
studies of Microsoft).  If you know you're going back to GC1.8 after using
GC2.0, you would Save As... in GC2.0 with a file format of "GnuCash 1.8".
Then it would warn you that "You will lose some information (such as
budgeting, etc.), would you like to proceed?"

That way the older versions continue to read their own file format *only*,
while the newer versions understand both file formats and give you the
ability to save in the older format with a warning.

Cheers,
Dan W.

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Chris Shoemaker
In reply to this post by Derek Atkins
On Fri, Feb 03, 2006 at 12:27:59PM -0500, Derek Atkins wrote:

> Quoting Chris Shoemaker <[hidden email]>:
>
> >>You can read them, but then what?  You would need to /remember/ them
> >>somehow so you could put them back into the data file.. . Otherwise
> >>you just corrupt the data going back.
> >
> >There's no way I'd expect the old version to _preserve_ new data-types
> >added by newer versions.  But IMO, _ignoring_ new data-types and
> >remaining usable is better than just bombing.
>
> I completely disagree.  As soon as you save off the data file you've
> now lost data.  I do agree that it shouldn't "bomb", but I do not believe
> that it should read the data.  IMNSHO It should say "you need a newer
> version of GnuCash to read this file" and just refuse to open it.

I think it should warn: "This file contains some data that was
introduced by a newer version of GnuCash, but which may safely be
ignored by this version of GnuCash, without affecting the integrity of
the remaining data.  However, that ignored data WILL BE ERASED from
any file that you save with this version of GnuCash?  Do you want to
continue?"  [ Note: I call this the "first warning" later on. ]

Yes, this means distinguishing between changes that really are
backward-compatible, like "this is a new feature", from those which
aren't, like "this new field means the transaction is actually VOID."

> >Sure, if you make non-backward compatible changes to the file format,
> >you need to reflect that in the format version.
>
> ANY unrecognized option is a non-backwards compatible change to the file
> format..  

This is currently true, but if we purposely design a change that is
semantically backward-compatible, like budgets are, why should we
_guarantee_ that the change is syntactically backward-INcompatible.

> And losing that data is even worse.  

The user should be allowed to make an informed choice between losing
certain incompatible data and accessing their data with their older
software.  I make this decision every time I save an OpenOffice
document, and I _appreciate_ the freedom to do so.

> Imagine if 1.6 did this.
> Now a user of 1.8 who has all their business and SX data in the file
> loads it into 1.6 which prompty ignores the business and SX data but
> doesn't tell the user.  User now saves file and LOSES all their business
> and SX data.  User now complains to us.

I agree there's a problem with this scenario, but I think the problem
is abuse of expectations, NOT losing the business and SX data.  This
cuts both ways: User upgrades GnuCash, tries it out for some time,
checks out the the new features, decides that the one _dropped_
feature is a must-have, tries to go back to their old program only to
find out their only option is to lose all their newly entered
transactions.  Again there's data-loss, but the data-loss was not the
real problem.  The real problem was the the new version didn't import
his data with a warning that said, "WARNING: if you use any feature in
this version that isn't in an older version, your ENTIRE data file
will be UNREADABLE by that old version."

The way I see it, we _need_ one warning or the other, just to be fair
to the users.  I don't know which one we want for 1.8->2.0, but I know
that, in general, I want the _option_ of using the "first warning".

> >Of course.  That's WAY too hard.  But that's not what I was
> >suggesting.  Currently, it's _impossible_ to change the format
> >_at_all_ without breaking old versions.  I'm just suggesting we should
> >at least make it _possible_ to make backward compatible changes.
>
> That's not completely true.  This is what the KVP frames are for.
> If you put new data into the KVP frame then previous versions can
> read it just fine..  But that makes it much harder on, say, DB schemas.

Ok, that's true.  But I don't think we need to constrain
backward-compatible format changes to that syntax.

> >Budgets is a perfect example.  There's _no_ good reason why a
> >pre-budgets version of gnucash shouldn't be able to open a data-file
> >from a version of gnucash that supports budgets.  Of course it can't
> >preserve budget data, but it should at least work with the data it
> >knows about.
>
> I disagree.  See above.  If you DONT have budget data, or any
> new-in-new-version
> features in your data file, then yes, it should be backwards compatible.
> But as soon as you get a new-in-new-version feature the file should NOT
> be readable by previous versions due to the data loss issue.
>
> Solve the data loss issue and I'll change my opinion, but not before.

IMO, "solving" the data-loss issue means giving the user a choice:
"Yes, you may use this file with your old program, OR you may keep you
budget in this file, but not both.  Which do you want?"

<snip>

> >Well, the 1.8->2.0 is a bit of a separate issue, since the fix is
> >completely different.  I'm still not quite clear on whether the
> >encoding issue is solvable.  But personally, I'd feel a bit
> >irresponsible if we broke backward-compatibility for no other reason
> >than introducing budgets.  It just feels... rude.
>
> We broke it for Business and SX in 1.8.  It was broken from 1.4->1.6
> for other reasons that I don't recall...  But it's broken ONLY IF
> YOU USE THE NEW FEATURE..  Just running the program doesn't break
> compatibility..  (xml encoding issues asside).
>
> Please stop thinking like a developer.  Think like a user, a dumb
> user, a dumb user who knows NOTHING about computers or programming
> or the issues of when data loss can occur.  They care more about data
> integrity than compatibility.  

I really think I am thinking like a user.  But I don't think they're
so dumb as to be unable to choose between retaining the data related
to new features and using their old, familiar program to access their
file.  Apparently, lots of other document-format designers agree,
since this is a pretty standard option.

> Go ahead, ask on -user.  Ask this question
> and see what response you get:
>
>   Would you rather be able to always read a data file created in a new
>   version of gnucash using an older version of gnucash, where saving that
>   data file will lose data when you re-open it in the new version,

um, this doesn't represent my opinion.  Perhaps a better question would be:

  Do you have any desire to open files with a version of GnuCash older
than the version that saved the file?  If so, would you prefer that,

  a) You must explicitly save your file from the new version as a
backward-compatible datafile, expunging any data related to new
features, OR

  b) You are warned by the old version of GnuCash that data related to
new features will be lost when you save with the old version, OR

  c) Sorry, if you've used any new feature, you just can't go back.

Neither a) nor b) are that hard to do (disregarding potential encoding
issue).  And, I actually believe that some users of old versions would
avoid upgrading if the only option is c).  

To be quite honest, the _only_ reason _I_ would upgrade is that I know
I can hand-edit the xml to retain backward compatibility.  It's
_enough_ of a perceived risk to be keeping other people's books in
non-commercial accounting software that I don't need any _additional_
risk due to a one-way, no-turning-back upgrade path.

-chris
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Dan Widyono
> The user should be allowed to make an informed choice between losing
> certain incompatible data and accessing their data with their older
> software.

I propose this freedom be granted in the new version of GC (2.0) upon Saving
As... (using GC1.8 file format) as opposed to adding logic in 1.8.

Other than that, I agree with giving the user an informed choice.

Dan W.
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Derek Atkins
Quoting Dan Widyono <[hidden email]>:

>> The user should be allowed to make an informed choice between losing
>> certain incompatible data and accessing their data with their older
>> software.
>
> I propose this freedom be granted in the new version of GC (2.0) upon Saving
> As... (using GC1.8 file format) as opposed to adding logic in 1.8.

Well, changing 1.8 is not an option..

> Other than that, I agree with giving the user an informed choice.

While in principle I agree with this, users (in general) are dumb and have
no idea how to do that.  Many times they have no idea what the choice means,
and other times they will just use the default.  However we can certainly
do something similar to what OOo does, let the user choose an output
format, and pop up a dialog if we think that output format will cause
data loss.

The downside of this approach is that it takes up most more code, because you
need multiple format output generators, a way to specify the output format
during output generation, and some way to determine a priori if the chosen
output format cannot encode the actual data.

I'll also point out that once we move to SQLite this issue becomes much
more important, because the schema version is part of the upgrade, so
if we want sqlite compatibility across versions we really need to handle
this "save as <version>" correctly.

> Dan W.

-derek

--
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       [hidden email]                        PGP key available

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Chris Shoemaker
In reply to this post by Dan Widyono
On Fri, Feb 03, 2006 at 03:10:45PM -0500, Dan Widyono wrote:
> > The user should be allowed to make an informed choice between losing
> > certain incompatible data and accessing their data with their older
> > software.
>
> I propose this freedom be granted in the new version of GC (2.0) upon Saving
> As... (using GC1.8 file format) as opposed to adding logic in 1.8.

Well, for 1.8, there's not a lot of choice.  It's not
forward-friendly, never has been, and it's too late to change that.
The only choice is to either allow a backward-compatible "SaveAs.."
or not.  The default is obviously to not, since allowing it actually
requires some coding.  More than one dev has said they didn't think
this was very important, especially as it was never an option before.
FWIW, in my own priorities this is about a 7.5, which puts it above a
lot of other things but below at least four other things.  If it gets
coded, it gets coded.  I don't think Derek would mind this feature too
much, since it's not _unintentional_ data-loss, but quite deliberate.

Now, going _forward_, we can still choose either "forward-friendly",
"SaveAs...", or even both.  This decision is quite independent of what
to do about 1.8->2.0 and it probably needs to be made in the context
of a rethink about our data-file versioning method anyway.

-chris

> Other than that, I agree with giving the user an informed choice.
>
> Dan W.
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Dan Widyono
> Now, going _forward_, we can still choose either "forward-friendly",
> "SaveAs...", or even both.  This decision is quite independent of what
> to do about 1.8->2.0 and it probably needs to be made in the context
> of a rethink about our data-file versioning method anyway.

I suppose my point is that I _always_ feel this way, using "Save As...".

Dan W.
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

locale issues with data format when upgrading 1.8 -> 2.0

Derek Atkins
In reply to this post by Chris Shoemaker
Chris Shoemaker <[hidden email]> writes:

> Now, going _forward_, we can still choose either "forward-friendly",
> "SaveAs...", or even both.  This decision is quite independent of what
> to do about 1.8->2.0 and it probably needs to be made in the context
> of a rethink about our data-file versioning method anyway.

Can we get back to the matter at hand, which is how to deal with the
xml encoding issues in the upgrade from 1.8 to 2.0?  Let's relegate
the larger issue of data format compatibility across versions to
another thread, but unfortunately you've already hijacked this one.

I think it's a major issue that someone in an ascii-like but
non-latin1 locale will get garbage during the default upgrade path.
libxml doesn't really provide a way to do proper detection, and 1.8
doesn't include an encoding in the data file..  Unfortunately the XML
spec says that the lack of an encoding parameter means the data is in
utf-8, but that's not the case in 1.8 -- the data is in whatever
locale the user was using.

So, how do we solve this?

let's ignore the backwards compatibility issues for the moment --
let's worry only about forward compatibility...

-derek

--
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       [hidden email]                        PGP key available
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Conrad Canterford
In reply to this post by Chris Shoemaker

On Fri, Feb 03, 2006 at 12:27:59PM -0500, Derek Atkins wrote:
> We broke it for Business and SX in 1.8.  It was broken from 1.4->1.6
> for other reasons that I don't recall...  But it's broken ONLY IF
> YOU USE THE NEW FEATURE..  Just running the program doesn't break
> compatibility..  (xml encoding issues asside).

I'm pretty sure the 1.7 versions used to put up a dialog on start saying
in effect "If you use any of the new features (SX, Business) you will
not be able to use this datafile in 1.6. Please copy your datafile and
use a copy in this new version". I think the assumption was (and
probably a valid one) that once you were using 1.8 you weren't going
back to 1.6.

(1.4 -> 1.6 was broken because we changed from the old binary format to
XML - a completely non-backwards-compatible change).

For what its worth, I think the OO solution of a "Save As" with warning
dialog (which in my experience /always/ comes up) is the (longer term)
way to go. Not that I recommend copying Microsoft at all, but this is
the way they do things too, and it is an approach users are familiar
with. I'd also not be fussed if it didn't make it into 2.0, since I'm
very familiar with the gnucash way of doing things.

I do think it should be prominent somewhere (on the release advices at
least) that datafiles saved with new features cannot be loaded into 1.8
AT ALL. I think a start-up dialog probably isn't a bad idea either since
(no offence Wilddev) I suspect many people don't really read the release
notices.

Conrad.

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: locale issues with data format when upgrading 1.8 -> 2.0

Josh Sled
In reply to this post by Derek Atkins
On Fri, 2006-02-03 at 16:24 -0500, Derek Atkins wrote:
> I think it's a major issue that someone in an ascii-like but
> non-latin1 locale will get garbage during the default upgrade path.
> libxml doesn't really provide a way to do proper detection, and 1.8
> doesn't include an encoding in the data file..  Unfortunately the XML
> spec says that the lack of an encoding parameter means the data is in
> utf-8, but that's not the case in 1.8 -- the data is in whatever
> locale the user was using.
>
> So, how do we solve this?

We can look for the presence of the "encoding" attribute on the
<?xml ...?> header.

If present, then libxml will do the appropriate encoding conversion.

If not, then we believe the file was written by 1.8.   As such, we
should set libxml to believe that the encoding is the system-default as
determined from
http://gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-get-charset .
It may require a re-parse of the file to get encoding-conversion done;
I'm not sure when it's performed by libxml.

This file [[[

#include <libxml/parser.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
  xmlDocPtr xml = xmlReadFile(argv[1], NULL, 0);
  printf("encoding: [%s]\n", xml->encoding);
}

]]] compiled with [[[
gcc `xml2-config --cflags --libs` -o xml-test xml-test.c
]]] shows that (xmlDocPtr)->encoding contains what we want to know: it's
set when <?xml [...] encoding="whatever"?> is set and NULL otherwise.

--
...jsled
http://asynchronous.org/ - `a=jsled; b=asynchronous.org; echo ${a}@${b}`
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: locale issues with data format when upgrading 1.8 -> 2.0

Derek Atkins
Quoting Josh Sled <[hidden email]>:

> On Fri, 2006-02-03 at 16:24 -0500, Derek Atkins wrote:
>> I think it's a major issue that someone in an ascii-like but
>> non-latin1 locale will get garbage during the default upgrade path.
>> libxml doesn't really provide a way to do proper detection, and 1.8
>> doesn't include an encoding in the data file..  Unfortunately the XML
>> spec says that the lack of an encoding parameter means the data is in
>> utf-8, but that's not the case in 1.8 -- the data is in whatever
>> locale the user was using.
>>
>> So, how do we solve this?
>
> We can look for the presence of the "encoding" attribute on the
> <?xml ...?> header.
>
> If present, then libxml will do the appropriate encoding conversion.

I'm not worried about the case where the encoding exists.  Yes, libxml will
do the right thing.  The problem is the case without the encoding, but
where the data isn't utf-8.

> If not, then we believe the file was written by 1.8.   As such, we
> should set libxml to believe that the encoding is the system-default as
> determined from
> http://gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-get-charset 
> .
> It may require a re-parse of the file to get encoding-conversion done;
> I'm not sure when it's performed by libxml.
>
> This file [[[
>
> #include <libxml/parser.h>
> #include <stdio.h>
>
> int
> main(int argc, char **argv)
> {
>  xmlDocPtr xml = xmlReadFile(argv[1], NULL, 0);
>  printf("encoding: [%s]\n", xml->encoding);
> }
>
> ]]] compiled with [[[
> gcc `xml2-config --cflags --libs` -o xml-test xml-test.c
> ]]] shows that (xmlDocPtr)->encoding contains what we want to know: it's
> set when <?xml [...] encoding="whatever"?> is set and NULL otherwise.

See http://mail.gnome.org/archives/xml/2001-July/msg00165.html for why
this is somewhat problematic.  "might be due to a confusion between locale
and encoding"...

Personally, I kinda like the approach in
http://mail.gnome.org/archives/xml/2001-July/msg00164.html

However I wonder if we want to bring user input into the foray?  Should
we ask the user to choose a charset, or somehow notify the user to check
the data.  And if they check it and the conversion was wrong, what
do we do then?

Also, we should really make sure that if a user is running g2 in a
non-utf8 locale that the data output really /IS/ utf8.  There's lots
of places where we're trusting libxml2 to do what we want, but have
we really verified and tested that it's actually doing what we want?

Any KOI8-R users willing to help us test?

-derek

--
       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
       Member, MIT Student Information Processing Board  (SIPB)
       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
       [hidden email]                        PGP key available

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: IRC discussion on i18n, xml/utf8, and 1.8->2.0 data migration issues

Chris Lyttle
In reply to this post by Conrad Canterford
On Sat, 2006-02-04 at 09:02 +1100, Conrad Canterford wrote:

> For what its worth, I think the OO solution of a "Save As" with warning
> dialog (which in my experience /always/ comes up) is the (longer term)
> way to go. Not that I recommend copying Microsoft at all, but this is
> the way they do things too, and it is an approach users are familiar
> with. I'd also not be fussed if it didn't make it into 2.0, since I'm
> very familiar with the gnucash way of doing things.
>
Actually if you're comparing with M$ I can comment here. I used M$ Money
for a long time before migrating to GnuCash and several times they
upgraded with a new version to a non-backwards compatible new file and
left the old one as money.old-ver# (forget the exact name). Money
handled upgrades very differently to the likes of word, etc as it used a
jet db and not a binary file format.

> I do think it should be prominent somewhere (on the release advices at
> least) that datafiles saved with new features cannot be loaded into 1.8
> AT ALL. I think a start-up dialog probably isn't a bad idea either since
> (no offence Wilddev) I suspect many people don't really read the release
> notices.
>
Er, none taken tho I'm not sure what I should be offended about ;-)
We can easily add to release notices about the incompatibility. To be
honest I think the startup dialog really should be no more than if you
open a 1.8 file then display 'you wont be able to open this file from
now on in 1.8' and then save a copy in the old format. _If_ its possible
and easy to do that.

Chris
--
RedHat Certified Engineer #807302549405490.
Checkpoint Certified Security Expert 2000 & NG
--------------------------------------------
        |^|
        | |   |^|
        | |^| | |  Life out here is raw
        | | |^| |  But we will never stop
        | |_|_| |  We will never quit
        | / __> |  cause we are Metallica
        |/ /    |
        \       /
         |     |
--------------------------------------------

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel