Future of powerbasic

Steve Hutchesson · January 20, 2015, 12:24:22 AM

I basically agree with that, while 32 bit bit is in its twilight there is still a lot of life left in it and the current versions of PB are both very useful tools. Now its not that I am biased but in the 32 bit area, MASM is a truly wonderful tool once you get used to its many bad manners. It has never been softened to a friendly consumer toy, pelts unintelligible error messages at you, has a macro engine that has very few peers but at the price of being only ever vaguely intelligible, buggy and under documented. It was clearly designed for the backroom boys and girls at Microsoft where you had to know its quirks but it does force you to write technically correct code.

64 bit is the future but its not coming all that fast and for compilers it has a lot to do with its incredibly messy stack design. Where 32 PE files had STDCALL, C and any flavour of FASTCALL you wanted to use with a 4 byte aligned stack, Win64 has a 16 byte aligned stack where you first have to allocate stack space under the current location then call API functions using 4 specified registers then the stack for any others while maintaining 16 byte stack alignment. Where 32 bit PE format was designed by the old VAX guys and was clean, clear full 32 bit design, Win64 is a mess something like the old 16 bit NE format was for Win3.?? except that its a 32/64 bit hybrid.

We won't see real 64 bit performance until we have full long mode and hardware with terabytes of memory, 32 gig of ram in Win64 is akin to what 4 meg was in win16, Whoopee !

Charles Pegge · January 20, 2015, 07:25:03 AM

I totally agree, the 64 bit calling conventions are horrid, and break the minimal conformity between Linux systems and Microsoft, which we had with CDECL.

It would be far better, in the long term, to adopt a 64 bit CDECL as the universal standard for libraries, and leave the kernel developers to do their own thing. I hope this happens with the next generation of hardware (memristors?)

Patrice Terrier · January 20, 2015, 09:19:25 AM

Steve--

QuoteWe won't see real 64 bit performance until we have full long mode and hardware with terabytes of memory

Then, what about the speed difference between a 64-bit application running on a 64-bit OS,
and a 32-bit application also running on the same 64-bit OS.

...

Steve Hutchesson · January 20, 2015, 01:20:49 PM

Its not a simple question to answer Patrice as there are both hardware and software considerations. Almost exclusively anything to do with multi-media will do better with base 64 registers and twice as many registers but it is coming at the price of other stuff getting slower. I have 3 quad core boxes handy, the i7 I have win7 64 on, the 3 gig core2 quad which was my last dev box with XP SP3 and a NAS box with XP which is a 2.5 gig Q6600 quad.

I regularly benchmark algos and while SSE is clearly faster on the i7, some algos are faster on the much slower Q6600. later hardware is giving more silicon to SSE and less to the integer instructions that use the 8 GP registers.

Charles Pegge · January 20, 2015, 01:35:25 PM

I think the 64 bit calling convention was for the benefit of kernel developers, not application developers, and this is where the speed advantage is gained. Most higher-level functions will not benefit from passing parameters in volatile registers.

Patrice Terrier · January 20, 2015, 02:05:23 PM

From my own benchmak, the same SDK application using only the core flat API runs 20% faster in 64-bit than 32-bit on a 64-bit OS.

Being able to produce either 32-bit or 64-bit from the same source code, is very usefull to check this.
Also i found that several API problems have been fixed in 64-bit that are still there in wow64.
The C++ compiler itself is also more friendly, no more mangled names in 64-bit (no need for .def file).

...

Steve Hutchesson · January 21, 2015, 10:57:46 AM

Hi Patrice,

There will certainly be types of code that will show the advantages of 64 bit and most probably the type of advanced work you do will benefit the most from 64 bit but many others will not. I have attached 2 versions of a Microsoft tool called ZOOMIN which is available in the win2000 SDK and while I cannot post the source code due to licencing conditions, I have built both a 32 and 64 bit version using almost identical Microsoft code and about the only difference is the 64 bit version is twice the size for no performance gain.

Win7 64 makes a mess of the selection rectangle display but both versions work. I have attached the two version to compare.

Patrice Terrier · January 21, 2015, 11:39:36 AM

Steve--

Without knowing the purpose of that code and without seeing how the code is written it is hard to say anything about your zoomin post.

For my own code a see a difference in size, mainly because i am using only UNICODE in 64-bit while i am still using ANSI in 32-bit.

For example the size of my OpenGL pix3D project is 90Kb in C++ 64-bit, and 31Kb in PowerBASIC 32-bit.
But this is because of the use of the compiler flag Multithread (/MT) that embeds the runtime library inside of the EXE. Without embeding the runtime inside of he EXE, or with a tiny.c CRT, the size would be the same.

...

Frederick J. Harris · January 21, 2015, 09:55:36 PM

Are you saying Patrice that your x64/x32 comparisons use wide characters on the x64 versions and narrow on the x32 versions? If so I'm thinking that could easily account for a 20% speed difference.

I have done some comparisons myself - particularly involving string buffer manipulations, and my wide character runs are invariably slower than my ansi runs due presumably to buffers and memory allocations being twice as big. So in my limited tests I'm presumming timing differences due not to x64/x86 differences but between ansi verses wide.

Patrice Terrier · January 21, 2015, 10:52:47 PM

I am saying that i am using UNICODE with 64-bit C++, and ANSI with PowerBASIC 32-bit.

In C++ with either 32-bit or 64-bit, i am always using UNICODE.

And for me the C++ 64-bit version is faster than the C++ 32-bit version.

...

Steve Hutchesson · January 21, 2015, 11:21:09 PM

I don't think there is a debate here, some thing will be faster but some slower. The 2 versions of ZOOMIN are both UNICODE, one built with VC 2003 in 32 bit, the other with VC2010 in 64 bit.

Now while it certainly makes sense for the advanced work Patrice is doing, I mainly write tools these days and you pay every price in terms of size and performance in 64 bit when writing tools, especially those that have to work in non SSE data sizes. Most complex algos do not get faster in 64 bit but often get slower. I have written both 64 bit and 128 bit code in Win32 using later SSE instructions and in areas where streaming fits the task they produce some very high speed results but many tasks cannot be done with streaming instructions.

My beef is not with 64 bit, its the implementation of Win64. What I hope as the hardware gets better is to see that same type of shift we saw from the hybrid 16/32 Win95OEM to Win2000 that was close to full 32 bit. Shifting from the hybrid 32/64 bit of current 64 bit Windows to a full long mode 64 bit will see some big performance gains but only if the tools get a lot better and I am not going to hold my breath waiting.

Frederick J. Harris · January 22, 2015, 12:21:49 AM

Here are some results from a nice little test I just ran from some work I was doing a couple years ago when the issue came up of PowerBASIC's speed in comparison to C and C++. The interesting issue at the time was that some MSVC compilations were killing PowerBASIC in the same tests - by about a factor of 10! When Paul Dixon disassembled the VC code he discovered a very interesting thing. The compiler was examining the algorithm and determining it wasn't efficient, and it was re-writing it! In other words - optimization! The asm code generated by the compiler wasn't anything like the PowerBASIC code, which was just translating the sourse 'as is' into machine instructions. So to make the comparison useful John Gleason suggested something more complicated than those little ditties all the compiler writers know about and hone their code against, which results in 'tainted' speed results. Anyway, here's John Gleason's algorithm - slightly modified by me ...

Code Select


// Exercise
// =======================================
// 1) Create a 2MB string of dashes;
// 2) Change every 7th dash to a "P";
// 3) Replace every "P" with a "PU" (hehehe);
// 4) Replace every dash with an "8";
// 5) Put in a CrLf every 90 characters;
// 6) Output last 4K to Message Box.

I'll shortly post one of my many C++ examples that implement this, but here are my results of 10 runs as follows...

Code Select


x86 32 bit code
===================================================
32 bit ansi string buffers, i.e., 2,000,000 chars and 2,000,000 bytes      18.6 ticks
32 bit wide string buffers, i.e., 2,000,000 wchars and 4,000,000 bytes     31.5 ticks

x64 64 bit code
===================================================
64 bit ansi string buffers, i.e., 2,000,000 chars and 2,000,000 bytes      28.1 ticks
64 bit wide string buffers, i.e., 2,000,000 wchars and 4,000,000 bytes     45.2 ticks

Code Select


Narrow  x86
===========

31
15
15
16
16
31
16
15
16
15
===
186  186/10 = 18.6 ticks


Wide    x86
===========

16
32
31
16
47
32
32
31
47
31
===
315  315/10 = 31.5 ticks


narrow  x64
===========

31
15
47
16
16
31
47
31
16
31
===
281  281/10 = 28.1 ticks


wide  x64
=========
47
47
47
46
32
46
47
46
47
47
===
452  452/10 = 45.2 ticks

As can be seen above, ansi is faster than wide character, and 32 bit is faster than 64 bit. The slowest is unicode under native 64 bit, and the fastest is ansi in 32 bit mode. I used the MinGW GCC x86/x64 compiler for the x64 compilations, and an older MinGW GCC 32 bit compiler for the 32 bit compiles. Here is the source. To compile for unicode just uncomment the defines at top...

Code Select


//#ifndef UNICODE
//#define  UNICODE      //strCls34U.cpp
//#endif
//#ifndef _UNICODE
//#define  _UNICODE
//#endif
#include <Windows.h>  //for MessageBox(), GetTickCount() and GlobalAlloc()
#include <tchar.h>
#include <String.h>   //for strncpy(), strcpy(), strcat(), etc.
#include <cstdio>     //for sprintf()

enum                                              // Exercise
{                                                 // =======================================
 NUMBER         = 2000001,                        // 1) Create a 2MB string of dashes;
 LINE_LENGTH    = 90,                             // 2) Change every 7th dash to a "P";
 NUM_PS         = NUMBER/7+1,                     // 3) Replace every "P" with a "PU" (hehehe);
 PU_EXT_LENGTH  = NUMBER+NUM_PS,                  // 4) Replace every dash with an "8";
 NUM_FULL_LINES = PU_EXT_LENGTH/LINE_LENGTH,      // 5) Put in a CrLf every 90 characters;
 MAX_MEM        = PU_EXT_LENGTH+NUM_FULL_LINES*2  // 6) Output last 4K to Message Box.
};

int __stdcall WinMain(HINSTANCE hInstance, HINSTANCE hPrevIns, LPSTR lpszArg, int nCmdShow)
{
 TCHAR szMsg[64],szTmp[16];             //for message box
 int i=0,iCtr=0,j;                      //iterators/counters
 TCHAR* s1=NULL;                        //pointers to null terminated
 TCHAR* s2=NULL;                        //character array bufers

 DWORD tick=GetTickCount();
 s1=(TCHAR*)GlobalAlloc(GPTR,MAX_MEM*sizeof(TCHAR));  //Allocate two buffers big enough to hold the original NUMBER of chars
 s2=(TCHAR*)GlobalAlloc(GPTR,MAX_MEM*sizeof(TCHAR));  //plus substitution of PUs for Ps and CrLfs after each LINE_LENGTH chunk.

 for(i=0; i<NUMBER; i++)                // 1) Create a 2MB string of dashes
     s1[i]=_T('-');

 for(i=0; i<NUMBER; i++, iCtr++)        // 2) Change every 7th dash to a "P"
 {
     if(iCtr==7)
     {
        s1[i]=_T('P');
        iCtr=0;
     }
 }

 iCtr=0;                                // 3) Substitute 'PUs' for 'Ps'
 for(i=0; i<NUMBER; i++)
 {
     if(_tcsncmp(s1+i,_T("P"),1)==0)
     {
        _tcscpy(s2+iCtr,_T("PU"));
        iCtr+=2;
     }
     else
     {
        s2[iCtr]=s1[i];
        iCtr++;
     }
 }

 for(i=0; i<PU_EXT_LENGTH; i++)         // 4) Replace every '-' with an 8;
 {
     if(s2[i]==_T('-'))
        s2[i]=56;   //56 is '8'
 }

 i=0, j=0, iCtr=0;                      // 5)Put in a CrLf every 90 characters
 while(i<PU_EXT_LENGTH)
 {
    s1[j]=s2[i];
    i++, j++, iCtr++;
    if(iCtr==LINE_LENGTH)
    {
       s1[j]=13, j++;
       s1[j]=10, j++;
       iCtr=0;
    }
 }
 s1[j]=0, s2[0]=0;
 _tcsncpy(s2,&s1[j]-4001,4000);         // 6) Output last (right most) 4 K to
 s2[4000]=0;                            //    MessageBox().
 tick=GetTickCount()-tick;
 _tcscpy(szMsg,_T("Here's Your String John In "));   //Let me clue you in on something.
 _stprintf(szTmp,_T("%u"),(unsigned)tick);           //You'll get real tired of this
 _tcscat(szMsg,szTmp);                               //sprintf(), strcpy(), strcat()
 _tcscat(szMsg,_T(" ticks!"));                       //stuff real fast.  It'll wear you
 MessageBox(0,s2,szMsg,MB_OK);                       //right into the ground!
 GlobalFree(s1), GlobalFree(s2);

 return 0;
}

I might add that a 2,000,000 byte string is kind of tight for using low resolution GetTickCount(). For real fast machines you might want to make the string 10 MB or whatever.

Frederick J. Harris · January 22, 2015, 12:30:10 AM

By the way Hutch, I downloaded your MASM package the other day and installed it. Nice!!! I do hope I can find the time to get back into asm. I did a lot of it many years ago but that DOS stuff is ancient history. I'd really like to try to translate that C code above into masm and see how it runs. I just need to pry myself away from some other stuff I'm working on that likely could wait!

Patrice Terrier · January 22, 2015, 10:14:45 AM

Since one of the programming language i am using has been translated to Mandarin, i have no other choice than using UNICODE. And also because it is a mandatory to use with GDIPLUS.

Now that i am able to offer both 32-bit and 64-bit solutions, i do not say anymore NO to my users, and they can select the version that fulfil their requirements.

The lack of a 64-bit version of PowerBASIC, plus the fact that it is frozen in time, is the reason why i added the C++ to my tool box.

Pragmatism is my moto.

...

Steve Hutchesson · January 23, 2015, 06:44:18 AM

Currently playing with my VC 64 bit setup, absolutely refuse to use that terrible IDE and all of the claptrap that goes with it. Got all of the vc2010 libraries and the matching SDK libraries, include files for both and have the basic templates up and going. Mainly my C is very rusty, been writing MASM for too long but it comes back pretty quick so its no big deal. Typical Microsoft installation was the usual mess, the binaries worked except for cvtres.exe which was broken so I had to go hunt for it. Found a reference on the net to it being in a deep subdirectory of Windows. I could use Pelle's linker but was trying to get the full Microsoft version up and going.

Have got the base windows up for 8.5k with an icon, menu, manifest "amd64" and version control block. If I remove the MSVCRT support it jumps to about 90k. The real win apart from being able to hammer out some utilities in the future is the ASM output which at last gives me a decent look at what 64 bit ASM looks like from a compiler.

Its a shame Bob passed away before he could finish the 64 bit version, most knew that he was working on it but as usual he kept it close to his chest. I guess no-one elects the time they pass away and with the disarray that followed it appears that it was not expected. I can live with the current 32 bit versions as they do a lot of things well, most of it was ignored by the "Mickey Mouse Club" but I knew that Bob did not half kill himself getting the extra capacity up and going for it to be ignored.

News:

Future of powerbasic