[GNC-dev] Normalizing live data, a suggestion for discussion

classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GNC-dev] Normalizing live data, a suggestion for discussion

GnuCash - Dev mailing list
Situation: someone reports a problem with gnc, at triage it becomes
clear some data is going to be required to identify or solve the
problem. Normal question?  Can you give us a file.

Problem: for any number of reasons ranging from plain old personal
privacy through to people that live in supposed liberal societies
avoiding tax and people in supposed conservative societies avoiding
persecution, sending live data isn't always appropriate.  The USA has
become very weird about this and most of our development people are in
the USA so hopefully they'll understand the politics of privacy, eventually.

Suggestion: we try to make providing a file easier for people.

My suggestion is we ask people to save a *copy* of their data in SQLite
and they then run a script across that copy that munges and obfuscates

1. account names [1]

2. numbers [2]

[1] people following this will probably be aware that gnc doesn't know
about account names much beyond broad classes in spite of providing lots
of names and not accommodating other accounting concepts such as the
fact there is a level one up [3]  My point here is that account names
are important to people but not gnc so why not just randomize them?
Obvious way? copy the actual account name (the guid) to the user visible
one.  this is a one way change unless someone has unusual settings on
their SQLite file, if someone has those settings it seems reasonable to
presume they also know how to turn them off and save the file again.

[2] as long as the transaction stream balances the actual numbers don't
matter (their will be occasions where the numbers are important but
these tend to be number extremes related to commodities rather than
anyone using gnc to do a Mr Putin vs Mr Trump sports bet).  In most
cases multiplying any matching numbers by the same semi-random should
produce a good file for examination so long as it is done consistently [4]

[3] that is a long argument I am interested in conceptually rather than
personally, it doesn't affect me as a UK person but makes me think
Internationally.

[4] I don't think a reductive discussion of true vs near true random [5]
is appropriate, the significant point is the person viewing the data
won't be able to work out the original number without significant effort
and in most cases simply won't be able to work it out at all, we're
talking computing assets I doubt anyone here has access to in order to
get back *and* I believe the gnc people are actually motivated by
solving problems, belief in the project and ordinary stuff like that so
they won't even be looking.

[5] Random is fun if only because there are so many ways of doing it.

Questions: why SQLite rather than XML?  Because if a person runs an
agreed script across their file we can be sure of an outcome.  Editing
an XML file informally is scary, it immediately raises questions about
consistency of data. Other SQL formats are not widely used, my proposal
is we go for LCD where we can achieve normalization.

Normalization will have to be balanced: privacy vs contribution to the
project.

I definitely want contribution from other people that work well with
SQL, let's think about this together, people, I have written some
scripts that confuse *my* data and I know that Geert is still waiting
for me to send him a file.

Geert is a good person, I just don't want to show him very personal
stuff in my file.

I have a plan for making showing a file easier, is anyone interested?

This is the *start* of a conversation, I welcome thoughts.































_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Stephen M. Butler
On 2/1/19 5:36 AM, Wm via gnucash-devel wrote:

> Situation: someone reports a problem with gnc, at triage it becomes
> clear some data is going to be required to identify or solve the
> problem. Normal question?  Can you give us a file.
>
> Problem: for any number of reasons ranging from plain old personal
> privacy through to people that live in supposed liberal societies
> avoiding tax and people in supposed conservative societies avoiding
> persecution, sending live data isn't always appropriate.  The USA has
> become very weird about this and most of our development people are in
> the USA so hopefully they'll understand the politics of privacy,
> eventually.
>
> Suggestion: we try to make providing a file easier for people.
>
> My suggestion is we ask people to save a *copy* of their data in
> SQLite and they then run a script across that copy that munges and
> obfuscates
>
> 1. account names [1]
>
> 2. numbers [2]
>
> [1] people following this will probably be aware that gnc doesn't know
> about account names much beyond broad classes in spite of providing
> lots of names and not accommodating other accounting concepts such as
> the fact there is a level one up [3]  My point here is that account
> names are important to people but not gnc so why not just randomize
> them? Obvious way? copy the actual account name (the guid) to the user
> visible one.  this is a one way change unless someone has unusual
> settings on their SQLite file, if someone has those settings it seems
> reasonable to presume they also know how to turn them off and save the
> file again.
>
> [2] as long as the transaction stream balances the actual numbers
> don't matter (their will be occasions where the numbers are important
> but these tend to be number extremes related to commodities rather
> than anyone using gnc to do a Mr Putin vs Mr Trump sports bet).  In
> most cases multiplying any matching numbers by the same semi-random
> should produce a good file for examination so long as it is done
> consistently [4]
>
> [3] that is a long argument I am interested in conceptually rather
> than personally, it doesn't affect me as a UK person but makes me
> think Internationally.
>
> [4] I don't think a reductive discussion of true vs near true random
> [5] is appropriate, the significant point is the person viewing the
> data won't be able to work out the original number without significant
> effort and in most cases simply won't be able to work it out at all,
> we're talking computing assets I doubt anyone here has access to in
> order to get back *and* I believe the gnc people are actually
> motivated by solving problems, belief in the project and ordinary
> stuff like that so they won't even be looking.
>
> [5] Random is fun if only because there are so many ways of doing it.
>
> Questions: why SQLite rather than XML?  Because if a person runs an
> agreed script across their file we can be sure of an outcome.  Editing
> an XML file informally is scary, it immediately raises questions about
> consistency of data. Other SQL formats are not widely used, my
> proposal is we go for LCD where we can achieve normalization.
>
> Normalization will have to be balanced: privacy vs contribution to the
> project.
>
> I definitely want contribution from other people that work well with
> SQL, let's think about this together, people, I have written some
> scripts that confuse *my* data and I know that Geert is still waiting
> for me to send him a file.
>
> Geert is a good person, I just don't want to show him very personal
> stuff in my file.
>
> I have a plan for making showing a file easier, is anyone interested?
>
> This is the *start* of a conversation, I welcome thoughts. 


It might be better to have a standardized test file that folks could
download, and run their scenario against. 

However, there are situations that arise where the only solution is to
look at the original file.  In that case some obfuscation would be
helpful.  I would think that memos and descriptions would also need to
be randomized.  After a careful read, I realized you did intend to
randomize the transaction amoun  ts (which would have to be careful to
ensure the DR/CR remained balanced.  Otherwise, one could at least get
the total Assets/Liabilities/Income/Expense values known for the
submitter.  That may be sensitive information.  I know that I've shared
some information that later reflection was "did I really give them that!"

Now, to the XML vs SQLite argument.  Whatever script is applied to one
could easily have a counterpart that would apply to the other.  You
wouldn't have to manually (informally) edit the XML.  A known script
should provide a known outcome.  I suspect that many folks are using an
XML back-end and would rather not fiddle with a database back-end.  I'm
in that camp even though I'm a trained Oracle DBA and spent a couple
decades using that back-end professionally.

I think the first step is having a standard test file that a use could
apply to their favorite back-end, run their scenario, check the
results.  If the problem is verified, then we have pretty good evidence
the problem is in the application.  If the problem doesn't show up, then
it indicates the problem may be in the data.  That would require a "data
forensic expert" (aka developer or some assistant) to look deeper into
the user's data file.  In that case a good obfuscation tool would come
in handy.

--Steve

--
Stephen M Butler, PMP, PSM
[hidden email]
[hidden email]
253-350-0166
-------------------------------------------
GnuPG Fingerprint:  8A25 9726 D439 758D D846 E5D4 282A 5477 0385 81D8


_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

GnuCash - Dev mailing list
On 01/02/2019 19:17, Stephen M. Butler wrote:

Ummm, Stephen M. Butler I don't think you were my intended audience.

Let me put you down gently.

> It might be better to have a standardized test file that folks could
> download, and run their scenario against.

Nope, we can do that already, I was addressing other realistic situations.

> However, there are situations that arise where the only solution is to
> look at the original file.  In that case some obfuscation would be
> helpful.  I would think that memos and descriptions would also need to
> be randomized.

My suggestion is they are zapped, no personal stuff at all

> After a careful read, I realized you did intend to
> randomize the transaction amoun  ts (which would have to be careful to
> ensure the DR/CR remained balanced.

I'm one of the more intelligent people here, the tx will remain balanced.

> Otherwise, one could at least get
> the total Assets/Liabilities/Income/Expense values known for the
> submitter.  That may be sensitive information.  I know that I've shared
> some information that later reflection was "did I really give them that!"

Ummmmm

> Now, to the XML vs SQLite argument.  Whatever script is applied to one
> could easily have a counterpart that would apply to the other.  You
> wouldn't have to manually (informally) edit the XML.  A known script
> should provide a known outcome.

Not true in reverse if someone throws in some numbers no other person
knows about.  Think about diminishing returns.

I can't correct this fucked up quote below, must be a Mexican border
issue, sigh.  Looks like a Trump voter, fucked quotient in general.

 >I suspect that many folks are using an
> XML back-end and would rather not fiddle with a database back-end.

We know know that, we ask for a specific db when we need to test stuff.

I've given up correcting the quoting, sorry, folks.

  I'm
> in that camp even though I'm a trained Oracle DBA and spent a couple
> decades using that back-end professionally.

We are unimpressed unless you contribute.

Some of us also think training may have been wasted time if you end up
not knowing much about databases.

> I think the first step is having a standard test file that a use could
> apply to their favorite back-end, run their scenario, check the
> results.

Wrong, please read what I said before.  Grrrr.

I hate it when someone so obviously doesn't read.

> If the problem is verified, then we have pretty good evidence
> the problem is in the application.  If the problem doesn't show up, then
> it indicates the problem may be in the data.  That would require a "data
> forensic expert" (aka developer or some assistant) to look deeper into
> the user's data file.  In that case a good obfuscation tool would come
> in handy.

I'd say something obviously rude around now but Liz would zap me instead
of the fool if past rules are anything to go by :(

I'd like someone with a clue to attempt an answer.

--
Wm


_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

GnuCash - Dev mailing list
In reply to this post by GnuCash - Dev mailing list
On 01/02/2019 13:36, Wm via gnucash-devel wrote:

would someone other than idiot Stephen M Butler attempt a reply please

TIA

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

David Cousens
In reply to this post by GnuCash - Dev mailing list
Wm

As well as the account names you might also want to munge data in the
description/memo fields. This can contain identifying information for
customers/vendors. Also possible any data relating to the owner of the file
which is stored in the file/database. The combination of the above would
probably be considered commercially sensitive information and at a personal
level what banks/service companies etc you deal with might be a possible
problem if it is in the public domain.

David Cousens




-----
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
David Cousens
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

GnuCash - Dev mailing list
On 02/02/2019 00:16, David Cousens wrote:

> As well as the account names you might also want to munge data in the
> description/memo fields. This can contain identifying information for
> customers/vendors.

How about we just zap the stuff in description/memo fields by default?
They're not mathematically significant and rarely cause double entry
problems unless someone introduces unusual UI stuff in which case they
should be able to provide an example.

> Also possible any data relating to the owner of the file
> which is stored in the file/database.

Does your file/database have an obvious owner?  Mine doesn't apart from
the name of the file which is the first and obvious thing to change
before you send it off for someone else to look at.

If you mean bits of text in reports they wouldn't be included in an
SQLite file.

If you mean bits of text in outbound documents I think we've already
zapped them.

Have I missed your point?

Always possible, don't be put off by my rough and tumble impression of
the idiot Trump, I do actually care.

> The combination of the above would
> probably be considered commercially sensitive information and at a personal
> level what banks/service companies etc you deal with might be a possible
> problem if it is in the public domain.

Ummm, that isn't really our problem, David.  If you subscribe to the
"I'm an American and the government supports me" foolishness I'm
wondering why the fuck any of you voted for the imbecile in charge at
the moment!

Any banking account details have already been removed.

Next?

--
Wm





_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Colin Law
Can all users save files as sqlite?  Does that need anything extra
installed on the OS side that may not be there?  Also what about
different builds of GC, do they all have sqlite?Colin

Colin
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Geert Janssens-4
In reply to this post by GnuCash - Dev mailing list
Op zaterdag 2 februari 2019 10:19:02 CET schreef Wm via gnucash-devel:

> On 02/02/2019 00:16, David Cousens wrote:
> > As well as the account names you might also want to munge data in the
> > description/memo fields. This can contain identifying information for
> > customers/vendors.
>
> How about we just zap the stuff in description/memo fields by default?
> They're not mathematically significant and rarely cause double entry
> problems unless someone introduces unusual UI stuff in which case they
> should be able to provide an example.
>
> > Also possible any data relating to the owner of the file
> > which is stored in the file/database.
>
> Does your file/database have an obvious owner?  Mine doesn't apart from
> the name of the file which is the first and obvious thing to change
> before you send it off for someone else to look at.
>
> If you mean bits of text in reports they wouldn't be included in an
> SQLite file.
>
> If you mean bits of text in outbound documents I think we've already
> zapped them.
>
> Have I missed your point?
>

Yes, if you use business features, you may have entered business identifying
data in File->Properties. It think that's what David is referring to.
Similarly there may be customer and vendor data (names addresses) in the book
that should equally be obfuscated. Just random data is fine.

Continuing on that vein, if you have bills and invoices, aside from
randomizing the transaction's split amounts and values you'll also have to do
the same for invoice entries. And to make the book useful for detecting
business data bugs this should happen in such a way that invoice tax and
discount amounts remain consistent after multiplying with random numbers *and*
that the invoice totals continue to match the business transactions amounts in
AR/AP accounts.

And to make that one level more complicated, after that the payment
transactions *also* have to continue to match the new randomized invoice
amount (if the invoice was paid in full).

It doesn't end there, payments can be split over multiple invoices, so again
when one randomizes invoice amounts care must be taken to adjust the payments
in proportion to the invoice amount change or fully paid invoices suddenly can
become partially paid or overpaid.

While this is probably all possible I believe the resulting script will be so
complex that it will become a source of bugs in itself which would divert
developer time to debugging and maintaining this script rather than working on
the effectively reported bug for which a sample data file was asked in the
first place...

Up until a book with only transactions, no business data at all it sounded
like a useful tool.

Oh and we haven't mentioned SXs and budgets yet...

As for Colin's question: on Windows and MacOS sqlite is supported out of the
box. On linux it may require the additional installation of a libdbi driver.
Most distros I know have packages for this driver but they may not be
installed by default.

Geert


_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

David Carlson-4
On Sat, Feb 2, 2019, 9:25 AM Geert Janssens <[hidden email]
wrote:

> Op zaterdag 2 februari 2019 10:19:02 CET schreef Wm via gnucash-devel:
> > On 02/02/2019 00:16, David Cousens wrote:
> > > As well as the account names you might also want to munge data in the
> > > description/memo fields. This can contain identifying information for
> > > customers/vendors.
> >
> > How about we just zap the stuff in description/memo fields by default?
> > They're not mathematically significant and rarely cause double entry
> > problems unless someone introduces unusual UI stuff in which case they
> > should be able to provide an example.
> >
> > > Also possible any data relating to the owner of the file
> > > which is stored in the file/database.
> >
> > Does your file/database have an obvious owner?  Mine doesn't apart from
> > the name of the file which is the first and obvious thing to change
> > before you send it off for someone else to look at.
> >
> > If you mean bits of text in reports they wouldn't be included in an
> > SQLite file.
> >
> > If you mean bits of text in outbound documents I think we've already
> > zapped them.
> >
> > Have I missed your point?
> >
>
> Yes, if you use business features, you may have entered business
> identifying
> data in File->Properties. It think that's what David is referring to.
> Similarly there may be customer and vendor data (names addresses) in the
> book
> that should equally be obfuscated. Just random data is fine.
>
> Continuing on that vein, if you have bills and invoices, aside from
> randomizing the transaction's split amounts and values you'll also have to
> do
> the same for invoice entries. And to make the book useful for detecting
> business data bugs this should happen in such a way that invoice tax and
> discount amounts remain consistent after multiplying with random numbers
> *and*
> that the invoice totals continue to match the business transactions
> amounts in
> AR/AP accounts.
>
> And to make that one level more complicated, after that the payment
> transactions *also* have to continue to match the new randomized invoice
> amount (if the invoice was paid in full).
>
> It doesn't end there, payments can be split over multiple invoices, so
> again
> when one randomizes invoice amounts care must be taken to adjust the
> payments
> in proportion to the invoice amount change or fully paid invoices suddenly
> can
> become partially paid or overpaid.
>
> While this is probably all possible I believe the resulting script will be
> so
> complex that it will become a source of bugs in itself which would divert
> developer time to debugging and maintaining this script rather than
> working on
> the effectively reported bug for which a sample data file was asked in the
> first place...
>
> Up until a book with only transactions, no business data at all it sounded
> like a useful tool.
>
> Oh and we haven't mentioned SXs and budgets yet...
>
> As for Colin's question: on Windows and MacOS sqlite is supported out of
> the
> box. On linux it may require the additional installation of a libdbi
> driver.
> Most distros I know have packages for this driver but they may not be
> installed by default.
>
> Geert
>
>
> _______________________________________________
> gnucash-devel mailing list
> [hidden email]
> https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Wouldn't it be simpler to create a library of template files designed to
exercise various features that a user could find one to illustrate his
concern?

Thiswould bypass the need to figure out how to sanitize every possible user
file.

If the user wants, he could still build his own example file as some users
do now.

David Carlson

>
>
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Geert Janssens-4
Op zaterdag 2 februari 2019 16:40:34 CET schreef David Carlson:
> Wouldn't it be simpler to create a library of template files designed to
> exercise various features that a user could find one to illustrate his
> concern?
>
> Thiswould bypass the need to figure out how to sanitize every possible user
> file.
>
> If the user wants, he could still build his own example file as some users
> do now.

Both approaches have benefits and drawbacks.

The number of possible ways something can go wrong in gnucash is near
infinite. Sometimes the problems only appear purely due to the amount of data,
sometimes it comes from migration issues (migration from older gnucash
versions,...). It would be equally hard to come with a set of template files
that would cover all of those.
From that point of view the idea to be able to look at the user's own data
file is attractive as that is known to illustrate the problem.

But I don't know how feasible it is to effectively obfuscate that data withoug
resorting to a complex script that may introduce its own set of bugs or
inadvertently also obfuscate the actual issue. The latter is quickly tested,
the former is a time waster.

Geert


_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

GnuCash - Dev mailing list
In reply to this post by Colin Law
On 02/02/2019 09:59, Colin Law wrote:
> Can all users save files as sqlite?  Does that need anything extra
> installed on the OS side that may not be there?  Also what about
> different builds of GC, do they all have sqlite?

I'm fairly sure all of the official builds can save SQLite.  If someone
is rolling their own on a platform without the sqlite libraries then I
think it would be unusual for them not to also have access to gnc on one
of the production platforms, the whole idea being that the data should
be easily transferable.

Even if someone didn't have SQLite my suggestion isn't taking something
away from from them.  If someone can't save an SQLite file and run a
script, the existing options are still there.

--
Wm




_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

GnuCash - Dev mailing list
In reply to this post by David Carlson-4
On 02/02/2019 15:40, David Carlson wrote:
> Wouldn't it be simpler to create a library of template files designed to
> exercise various features that a user could find one to illustrate his
> concern?

To some extent this is already done in the build process.  Life always
throws up something unexpected.  Further, users are by definition lazy
and want the devs to look at *their* data rather than being expected to
trawl through a set of files containing data not relevant to their real
life situation in the hope that one of them shows the fault that, by
definition, shouldn't have existed in the first place.  See the circular
bit?

>
> Thiswould bypass the need to figure out how to sanitize every possible user
> file.

Sanitizing isn't that hard and we don't actually need perfection, just
sufficient so that people are confident that the devs aren't snooping on
them.

> If the user wants, he could still build his own example file as some users
> do now.

The problem is that some people build files that don't work for
everyone; it does say "normalizing" in the Subject line, none of this is
ever going to be compulsory.

--
Wm

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

GnuCash - Dev mailing list
In reply to this post by Geert Janssens-4
On 02/02/2019 16:11, Geert Janssens wrote:
> But I don't know how feasible it is to effectively obfuscate that data withoug
> resorting to a complex script

The script will be seen by others that do understand sql before anyone
innocent gets to use it, promise.

If the script is well documented (I don't see the point of obfuscated
sql when we are doing something like this as time is not the major
issue, getting the problem fixed is) then people that can read will use it.

Further, most of the actual gnc code is so fucking obfuscated it is
acknowledged only a handful of people can read it, so do you really want
to raise the issue of obfuscation, Geert?

Seriously, people that don't know how code works are already trusting
their financial data to code they have no clue about.  Why is my
suggestion going to increase or decrease trust or increase or decrease
complexity?

Grrrrr.

 >> that may introduce its own set of bugs

My script cannot introduce a bug, we are normalizing data <-- read that
again, please.

> or
> inadvertently also obfuscate the actual issue.

That is a possibility.  I consider this a positive not a negative from a
triage POV. the user says: "oops, my problem doesn't exist after I ran
the normalizing script" <-- is this good or bad?  if the script is well
documented the user can edit it and run it again, possibly solving the
problem themselves.

 > > The latter is quickly tested,
> the former is a time waster.

This is a very good point and I repeat, this is not suggested as
compulsory, this is intended to make things easier not harder for people
that do want to report things that may be specific to them without
exposing irrelevant details they may consider private or personal.

--
Wm

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

GnuCash - Dev mailing list
In reply to this post by Geert Janssens-4
On 02/02/2019 15:24, Geert Janssens wrote:

> Yes, if you use business features, you may have entered business identifying
> data in File->Properties. It think that's what David is referring to.

I agree, the third party should not be identified.

> Similarly there may be customer and vendor data (names addresses) in the book
> that should equally be obfuscated. Just random data is fine.

Yes.

Geert, at the moment I am putting guid in place of random, do you think
that is a wrong way to approach this?

Actually, the nearer we get to complete random the less useful the file
becomes.  Actual random data is harder than most people think and pretty
much defeats the purpose if you think about it.

> Continuing on that vein, if you have bills and invoices, aside from
> randomizing the transaction's split amounts and values you'll also have to do
> the same for invoice entries.

I don't think that is true in most situations and even if what you say
is true, I don't see it as a good argument against *attempting* a
normalized book for most people.

> And to make the book useful for detecting
> business data bugs this should happen in such a way that invoice tax and
> discount amounts remain consistent after multiplying with random numbers *and*
> that the invoice totals continue to match the business transactions amounts in
> AR/AP accounts.

There will be situations that involve the person doing the triage
needing to see actual transactions, I have already commented on that.

> And to make that one level more complicated, after that the payment
> transactions *also* have to continue to match the new randomized invoice
> amount (if the invoice was paid in full).

Ummmm, I don't think that is true.  If the munged numbers match (and
they will, that is what the script will do) the transaction stream will
be OK.

It is possible I have missed your point, Geert, but I think it is
looking like I understand the contents of the gnc files better than you :(

> It doesn't end there, payments can be split over multiple invoices, so again
> when one randomizes invoice amounts care must be taken to adjust the payments
> in proportion to the invoice amount change or fully paid invoices suddenly can
> become partially paid or overpaid.

Not true.

Geert, I don't want to say this but I believe you are actually wrong,
for once.

> While this is probably all possible I believe the resulting script will be so
> complex that it will become a source of bugs in itself which would divert
> developer time to debugging and maintaining this script rather than working on
> the effectively reported bug for which a sample data file was asked in the
> first place...

Hmmmm, I accept your point and disagree.

> Up until a book with only transactions, no business data at all it sounded
> like a useful tool.

Be a brave man, Geert, most people don't use the business functions :)

> Oh and we haven't mentioned SXs and budgets yet...

Unless they are material to the file being investigated I suggest we
just delete all SXs and budget stuff.

> As for Colin's question: on Windows and MacOS sqlite is supported out of the
> box. On linux it may require the additional installation of a libdbi driver.
> Most distros I know have packages for this driver but they may not be
> installed by default.

It would be an odd distro that excluded SQLite, it is a requisite for a
lot of other stuff like browsers.  Thinking aloud: maybe a server only
install might not have it or someone stupid enough to put their data on
Amazon might not have it available.  The question then becomes, why was
the person so stupid?

As far as I am concerned this conversation is ongoing, if only because
Geert says he still needs a file from me to replicate a basic problem
that I don't think needs any data from me at all.

--
Wm


_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

David Cousens
Wm,

>> It doesn't end there, payments can be split over multiple invoices, so
>> again
>> when one randomizes invoice amounts care must be taken to adjust the
>> payments
>> in proportion to the invoice amount change or fully paid invoices
>> suddenly can
>> become partially paid or overpaid.
>
>Not true.
>
>Geert, I don't want to say this but I believe you are actually wrong,
>for once.
>On 02/02/2019 15:24, Geert Janssens wrote:

In what way is what Geert says here not true?

Payments can be split over multiple invoices.
A single invoice could also have several payments associated with it.

These sort of situations arise frequently in small businesses where you may
need to micro manage your cash flow.

If, in the randomisation process, you do not apply the same random factor to
all the invoices covered by that payment, then what he says is exactly what
will happen. This means your script will have to detect all of the invoices
related to a payment.  OK it can be dealt with,  but again the script
complexity is increased considerably to do so.

>Most people don't use the business functions

I don't since I retired a few years ago, but I did for 8 years prior to
retiring (and I used MYOB for the 10 years prior to that before escaping). I
am certainly not alone. You could have a proviso that the script won't work
for files using the business functions but that then detracts considerably
from its usefulness as a general diagnostic tool.


Sqlite itself and its availability on Linux is not really an issue. Most
distros have it in their software repositories. What may be more of an issue
is that a lot of people who don't use the database backends because they
don't want the additional hassles of learning to use and maintain databases
may be reluctant to install it. It's not that it is all that difficult if
you're familiar with it, but if you are not, it is an an additional hurdle
and learning curve. I'm retired. Taking an extra half day to learn something
new doesn't worry me as long as it happens before my time is up. But if I am
running a busy lfe and/or a business as I used to, I would be more
reluctant. Again not a show stopper, only a limitation on general
applicability.

David Cousens




-----
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
David Cousens
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Frank H. Ellenberger-3
In reply to this post by GnuCash - Dev mailing list
Hello Wm

Am 01.02.19 um 14:36 schrieb Wm via gnucash-devel:
>
> My suggestion is we ask people to save a *copy* of their data in SQLite
> and they then run a script across that copy that munges and obfuscates
>

Did you see https://wiki.gnucash.org/wiki/ObfuscateScript ?

It is targeting xml files and was uploaded in 2010. So it might be
slightly bit rotten.

Regards
Frank

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

David Cousens
In reply to this post by Stephen M. Butler
Steve,

As Geert pointed out whole of program testing is very difficult and rapidly
reaches a situation where complexity is equal to or greater than  the
program complexity and this is really what gave rise to unit testing where
you test individual components which do a specific function.

One area in which an example file  rather than a test file might be useful
is in developing  the documentation. The guide section on Accounts
Transaction following through to Personal Finances
in escence constructs a simple file while doing the tutorial. Here though it
is  the process of constructing the data in the file that is useful. A
completed example file is not of great use.

It is also likely that most problems which are likely to require this depth
of investigation are unlikely to show up in a test file unless you can
execute a series of entries in a scripted manner i.e. interact with the gui
from a script and this is not possible with GnuCash at the moment AFAIK.
The problem is usually somewhere in the process of getting to the results in
the file and what is in the file is merely a symptom of the problem.

David



-----
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
David Cousens
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

David Carlson-4
OK, I want to try https://wiki.gnucash.org/wiki/ObfuscateScript but I am
not a computer programmer.  I have no clue how to use it.  Can someone help
me?

David C

>
>
_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

John Ralls-2


> On Feb 2, 2019, at 8:10 PM, David Carlson <[hidden email]> wrote:
>
> OK, I want to try https://wiki.gnucash.org/wiki/ObfuscateScript but I am
> not a computer programmer.  I have no clue how to use it.  Can someone help
> me?

Run it from a command line using perl, assuming here that you have Strawberry installed on C:

  c:\strawberry\perl\bin\perl.exe ObfuscateScript path/to/myfile.gnucash

Note that it rewrites the file in place, so make a copy and run it on that. The file needs to be uncompressed.

Regards,
John Ralls

_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Reply | Threaded
Open this post in threaded view
|

Re: [GNC-dev] Normalizing/obfuscating live data

Christian Stimming-4
Am Sonntag, 3. Februar 2019, 17:03:06 CET schrieb John Ralls:
> > On Feb 2, 2019, at 8:10 PM, David Carlson <[hidden email]>
> > wrote:
> >
> > OK, I want to try https://wiki.gnucash.org/wiki/ObfuscateScript but I am
> > not a computer programmer.  I have no clue how to use it.  Can someone
> > help me?

Thanks for the pointer. I've copied this script into our git at
  ./util/obfuscate.pl
The script manages to process my 50MB file in approx. 30 seconds. The account
and txn texts are all nicely obfuscated.

The script also contains a random obfuscation for the amounts, which will
simply cause lots of transactions to the equity account upon loading, but this
could be enabled as well.

In a real data file there are still more places with text that need to be
modified, e.g. the scheduled transaction templates, bayes import matching, and
such. Also, the dates are left unmodified which may or may not be a problem.

Anyone please feel free to check with this script and add more obfuscation
steps. I would like to achieve a state where the script will obfuscate my
personal data file enough so that I feel I can make it available as a test
file.

Usage of the script: Save your normal file in uncompressed form to XML file.
Then,

   ./obfuscate.pl  inputfile.gnucash > outputfile.gnucash

(Contrary to the comments, the output is just written to stdout, not in-place
into the file.)

Thanks for the idea here!

Regards,
Christian

 

> Run it from a command line using perl, assuming here that you have
> Strawberry installed on C:
>
>   c:\strawberry\perl\bin\perl.exe ObfuscateScript path/to/myfile.gnucash
>
> Note that it rewrites the file in place, so make a copy and run it on that.
> The file needs to be uncompressed.
>
> Regards,
> John Ralls
>
> _______________________________________________
> gnucash-devel mailing list
> [hidden email]
> https://lists.gnucash.org/mailman/listinfo/gnucash-devel




_______________________________________________
gnucash-devel mailing list
[hidden email]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel
12