Recent blog entries for lethal

11 Mar 2007 (updated 11 Mar 2007 at 12:41 UTC) »

Notes on SH interrupt/exception dispatch path

Some people are seemingly confused about the semantic changes that have happened for the unified exception dispatch path in the SH-3/4 code, so it's probably worth reiterating what changed, and why you don't want to touch the interrupt exception tables.

Most of the exceptions (especially general exceptions and interrupt exceptions) are immediately bounced through handle_exception once the exception code has been appropriately stashed in r2, with the return path sitting in r3. Traditionally this has included the EXPEVT value in the general exception case, and INTEVT for the interrupt exceptions, which could then be used for calculating the offset in to a flat jump-call table (exception_handling_table). This worked well for general exceptions, but rather less so for interrupt exceptions. In the IRQ case we ended up with many CPU subtypes with very sparse IRQ maps, that would only be interested in selectively enabling do_IRQ() dispatch for a handful of vectors. While this worked fine for a very small number of CPU subtypes, it very quickly got out of hand and turned in to a giant ifdef fiasco that was highly prone to off-by-one vector enabling and other ugly things.

In order to get rid of all of these accursed tables, the exception code read-in and the handle_exception dispatch needed a bit of rework. The regular case is that general exceptions exist first in the vector table, with the interrupt exceptions following afterwards. There are some minor corner cases where there is overlap, but those vectors can be overloaded by the CPUs that need special care.

Enter the interrupt exception marker. With the marker scheme, a simple marker is placed in r2 to signify a do_IRQ() fast-path while deferring the INTEVT read. This is then looked at when figuring out whether to take the r2 value as a jump-call table offset or whether to dispatch directly. There are some additional notes regarding this in the tail end of handle_exception for anyone that's too concerned.

The only pitfall with this scheme is that the vector tables have to be padded out so a fixed length in order to allow setting specific exception handlers that happen to reside far out in the table (as is the case with some of the FPU exceptions). Two new routines have been added for this purpose, set_exception_table_evt() and set_exception_table_vec(). The use of both of these is fairly obvious to anyone looking at the vector tables, so there's no point in reiterating it here.

In practice this hasn't worked out too badly:

4 files changed, 40 insertions(+), 721 deletions(-)

For additional reading, consult arch/sh/kernel/traps.c, arch/sh/kernel/cpu/sh3/entry.S and arch/sh/kernel/cpu/sh[34]/ex.S.

I finally got around to starting to test the 64KB page size for SH-4 and SH-4A pages, when I ran in to some rather annoying behaviour. We currently use THREAD_SIZE in a couple of places, namely where we switch from the kernel to user stack, and for fetching the current thread info on nommu. This used to be open-coded for 8k stacks, but got a bit of an overhaul when 4k stacks got introduced. Now we effectively have something like:

mov #(THREAD_SIZE >> 8), reg
shll8 reg

as we're constrained by the ability to do a large immediate load, and simply having the pre-processor shift the constant and then shift it back via shll8 is still far faster than a memory lookup. This worked fine for 4k and 8k pages, but we manage to overrun the immediate size by 1 bit using 64k pages.

Unfortunately we're somewhat constrained by the instruction set, as shll8 and friends exist pretty much across the board, the variants that shift by a loaded immediate are restricted to later CPUs, which is rather unfortunate, as PAGE_SHIFT would save a lot of trouble here. We also have a 20-bit capable immediate load, but that's likewise constrained to later CPUs.

The only portable solution where we can still save the memory access is to just shift it down another 2 bits and then pack in an shll2 to get back to the full size, so we just end up with:

mov #(THREAD_SIZE >> 10), reg
shll8 reg
shll2 reg

but this is not a very appealing solution, and it wastes a cycle for something that's effectively a corner case. A dynamic shift would still cost us the cycle, but would at least provide some future proofing. On the other hand, the likelihood of someone adopting a system page size larger than what we can address as an immediate when shifted down 10 bits is quite low. We can still expand on this model with one more shll2 if it does become a problem, though the most we can shift down THREAD_SIZE is 12 bits, which happens to equate PAGE_SHIFT for 4k pages. After this we're effectively screwed.

And it looks like I need to revisit the PTRS_PER_PGD math again too. Grumble.

770 Notes

pycage, while MPU-side decoding is the easiest way to go, DSP-side will still be beneficial (albeit somewhat more complicated). Whether the benefits are worth the effort is another matter. The tools that you need to roll your own codecs are available, and you can do this mostly in C without having to resort to too much tms320c55x assembly. The biggest issue is likely familiarizing yourself with the DSP kernel, the socket node interfaces, and so forth. Most of this is documented pretty well at the dspgateway page.

For the adventurous, there's still an unused mailbox line between MPU and DSP on 1710 in the current implementation that could probably be round-robin'ed pretty easily. We also presently don't make use of hardware page table walking, which makes the exmap interface a bit clunky (essentially wiring TLB entries by hand, but at least they're pre-faulted).

It would also be interesting to see how the FP-driven codecs compare to the integer-based one under EABI with a soft-float toolchain. ogg123 might even be usable out of the box with soft-float (though at likely higher than the CPU utilization numbers that have been quoted). On another note, it's also pretty easy to figure out DSP load average through the sysfs interface, so it may be worthwhile to profile some of that, especially if the DSP ends up getting more heavily loaded.

Haven't posted here in awhile. Work is keeping me busy. As is getting the kernel running on SH-2A on the MS7206SE01 board.

On the sh front, things have been progressing nicely with the new clock and timer frameworks. The timer stuff is still in need of being extended to more transparently deal with multiple timer channels, but this can wait until the timesource driver stuff on l-k sorts itself out. No use redoing the timer stuff twice..

On another note, the cpufreq driver still needs to be reworked for the clock framework as well. This will still take a bit of doing, but in the end it should leave us with a single driver capable of dynamic scaling on every CPU subtype that hooks in to the clock framework (this will go on the TODO list for now).

With sh64, things have also been pretty quiet. Ran in to some fairly consistent slab corruption that seems to have only popped up in recent kernels, suppose its time to dig out the redzoning for non-BYTES_PER_WORD minaligned architectures patch and get slab debugging working again. Unfortunately the UW SCSI drives I was using that managed to trigger this on my Cayman both ended up killing themselves. Lets see how far we get with onboard IDE.. judging by the schematics, at least PIO was wired right, and should mostly work (DMA on the other hand..). Some of the GPIO configuration in the SuperIO is probably still off (since much of that was borrowed from microdev), so it seems there will be more than one thing to debug..

And just to show how often I actually log in to this thing, I seem to have had this following paragraph started, which was amusingly retained (from some time in 2003):

--
More uClibc hacking today and the last couple of days. Started working on the shared loader backend for sh64, which is now at the point where most of the work is done, but now there's just a lot of debugging and testing left. At least some good has come out of it so far, it turned out that the R_SH_IMM_MEDLOW16 relocation was broken in multiple ways in glibc, so I ended up fixing that while writing up the relocation handling code for uClibc. Regardless, the uClibc stuff is in pretty good shape now, so the next logical step is to start tinkering with buildroot and friends, though that will still have to wait till after some more debugging time.
--

The ironic thing is that years later, the sh64 ldso stuff needs to be fixed again due to some ABI changes, though I have so far been successfully putting it off. ldso is vindictive ;-)

Disclaimer: As nothing really interesting has been happening lately, be forewarned that this entry will be somewhat dry and generally boring, even if for some reason you _are_ interested in the state of Linux/SH-2 support.

Lots of SH-2 hacking lately, quite exhausting, though still quite fun. The VBR semantics are completely different in relation to the SH-3/4, so this buys entry.S a much needed overhaul. Unfortunately this also required some changes in semantics, at least on the SH-2 side for the general-purpose exception handling code -- though this is all quite hacky already, especially given the number of different registers and register names, etc.

Another minor nuisance, gcc sanely labels things like saving off ssr as an SH-3 and up instruction, but binutils subsequently defaults to accepting virtually anything as valid. binutils CVS now seems to properly support a processor family flag that clearly defines this, so that should be dealt with relatively well once I get finished hacking that.. this will be an interesting contrast to gcc flags by ABI level, so hopefully that will all work out cleanly. Between that and the latest -fno-zero-initialized-in-bss mess with 3.3, I definitely hope we won't need more stupid gcc/binutils version specific checks for the kernel build, as these are already starting to add up..

Additionally, the fixed references to arch/foo/kernel/vmlinux.lds.S in the top-level kernel Makefile are truly annoying. This now forces anyone who wants to use multiple ld scripts to either make a wrapper script with ifdef abuse, or do gross symlinking hacks at build time. This is certainly a disappointing step back in comparison to the 2.4 behavior..

Back to the SH-2 issue, it should be a lot easier to identify what still needs to be done (other then things like the system call interface, which still needs cleanup for things like TRA referencing, INTEVT/EXPEVT stuff I just finished) once the aforementioned binutils issues are out of the way. It's quite bothersome to identify problem spots when the assembler will knowingly accept accesses to things like different register banks and ssr/spc, etc. even when these don't actually exist on the SH-2.. though I'm sure there will be quite a few. At least now with the exception vector, early SCI console, XIP, etc. out of the way, we should be set to actually start debugging on live silicon.. Now the only other trick is getting the page_alloc2 stuff updated and merged, and getting the overly pesky inode and dentry cache hast tables reduced in size -- there's not a whole lot of room when you've only got 512KiB of RAM to work with..

Also got some 7760 IPR patches sitting in my home directory, this is pretty much the last remaining portion of the 7760 backend that needs to go in (I did the exception vector / sh-sci / etc. stuff previously). So this is definitely good news, even though it reminds me that I still need to get the 7040/7044/7045/etc. stuff figured out and written..

Lastly, also got some uClibc hacking done. Some relatively uneventful sh64 syscall updates to satisfy current busybox, etc. Just finished off the pthreads work, so now we should be good to go for static pthreads.. that still leaves the ldso work, but that can wait for another day (particularly as it's rather mind numbing). After that, we should be able to start doing sh64 builds under buildroot, should be fun.

Well, decided to give gnome 2.4 a try. This proved to be rather entertaining, as the last time I attempted to build gnome by source was many years ago, and that required much hacking just to get the thing to pretend to build. Gave garnome a try, and that seemed to work pretty well, though several packages needed some persistent prodding before they wanted to work on my rh7.3 workstation. Now just have to wait and see how painful this will be under osx. Also, in regards to all of the recent gnome-blog traffic in the recentlog, I was surprised to have it die randomly after hardly any text entry. I'll stick with mozilla and safari for now.

Got around to starting on some DocBook stuff for sh, which also proved to be interesting. Most of it is behaving quite well, except I seem to be getting duplicated description entries from referenced source. I've not seen this before, and don't see anything obvious looking at the parser. This needs more investigation. Oddly enough, this seems to only occur on certain source files, and is completely isolated to the description, as all other fields are parsed correctly. Most irritating.

Minor other work on the sh tree. Added in compatability hooks for the old ISA DMA API to wrap to the SH DMA API, which did a pretty decent job of outlining a lot of the limitations with the old API on this particular hardware. However, for anyone wanting less-than-exciting single-address DMA transfers without hacking things for the new API, this seems to work just fine. We also now do proper cpu flag reporting as well as some cache reporting in cpuinfo, though nothing particularly exciting.

Spent a bit of time working on AICA / SPU related things on the DC today. Started out writing a module for the g2 dmac to do spu dma directly from the aica channel, though this still needs much debugging. So far we don't seem to be able to keep consistent data in registers (ie, write-out a p2seg addr and get back the same address in an entirely different segment). other things, always read 0, which in itself isn't a problem, but the lack of the completion interrupt firing certainly creates some issues. the joys of undocumented hardware.. back to the wince dump.

On a similar note, we can use a channel on the sh dmac itself for writing out the buffer, which is what is happening for testing now. unfortunately since we only have 4 channels on the 7750, this isn't an option. at least now I can look at optimizing some of the completion / signalling code in the subsystem so we don't use quite as many cycles (polling for residue sucks). however, the good news is, even when we're constantly polling for residue, cpu usage is still down from the old manual copy / wait on fifo method. Once the remaining performance things are ironed out, there should be no problems dealing with any high-bitrate samples thrown at it. This will be even more fun once zx80user gets his alsa driver finished and supporting all of the aica channels, instead of just the two channels supported by the oss driver.

Merged / cleaned / rewrote random parts of SnapGear's SH-DSP patch from the uClinux tree a few days ago, which turned out to be quite fun. As a result, rewrote most of the cpu init code, which should now be much easier to work with (especially for adding probe hooks / setting cpu flags). So far this is proving to be quite clean.

Also got most of the 8139too hacks cleaned up and merged in both 2.4 and 2.6, which handily knocks off another thing from the HEAD TODO list, and nukes yet more common code cruft from CVS.

More random tree maintenance. Cleaned up random bits of the SMP code, and made the SMP kernel compile again. Though it will likely be awhile before I'll get around to testing this. Being the sole user of the SMP code however, makes this perfectly reasonable.

Quite a bit of this stuff also needs documenting, which I can safely say is one of my least favourite activities (which Documentation/sh seems to reflect). Perhaps its time to take another look at the DocBook stuff, as it would be quite nice to organize most of this into a general sort of sh architecture guide, instead of just random text files. This also incidentally happens to be another point on the TODO list. Now I just need the motivation to write documentation instead of hacking on code. The general tediousness of debugging SPU DMA might spur this on quicker than anticipated.

Falling behind on posting again, so here's a quick recap on what I've been hacking on recently.

Made quite a lot of progress on everything DMAC related the last few days. We're now much closer to something resembling a real subsystem, though there's still a few minor things to work out (including polling threads per DMA engine for doing large unblocked transfers). As a test, I also wrote a quick and dirty clear_page() and copy_page() using a dual-address mode configured channel on the SH DMAC, which ended up working quite well, despite some icache oddities in the clear_page() case which still need further debugging.

With the birth of the new SH-specific dma subsystem, I was also prompted to move the PCI stuff around again, and now we have a nice new shiny arch/sh/drivers/ where pci and dma stuff live. In the future, I suppose I'll move the sh cpufreq stuff here as well, as it really has no place in the arch/sh/kernel/ heirarchy .. though that's something for another day.

Anyways, now that I've got dual-address mode DMA as well as cascading to the PVR2 DMA in the Dreamcast, I suppose it's time to try to figure out how to tie this into pvr2fb in some sane fashion. For one, I don't seem to be able to use the user address of the write buffer as a source address for the DMA transaction, so something needs to be done here so we don't have to have the copy_from_user() overhead before we can start up the DMA transfer. Not entirely sure how DRM/DRI deals with this sort of stuff, though I suppose that's the next logical point to start looking at.

This would be much less of a headache in uClinux.. ;-)

Falling somewhat behind in posting frequency, so I suppose now's a good time to make yet another entry. Nothing overly eventful lately, managed to finish off most of the remaining issues with the store queue API I've been working on, which was nice. Unfortunately the only way I could get the cleanup and flushing for userspace mappings implemented in a clean fashion entailed adding back in the unmap and sync ops to the vm_operations_struct. These seem to have been removed in 2.4-test time, mostly because no one was using them. Hopefully it won't be too much of a fight to try and get these merged back into 2.6 proper .. otherwise it'll just be another thing stopping sh from working out of the box on vanilla 2.6.

On another note, it appears that mrbrown's ps2 "exploit" has been slashdotted. This wouldn't be much of a problem, except for the fact that that exploit happens to be posted on the same machine I use as a mailserver and for my IRC sessions. Lag suddenly has new meanings. This particular exploit is quite exciting though, I'm almost tempted to take one of my ps2s and write a native pong clone that doesn't happen to be RTE or reload1 encumbered .. though I shouldn't get carried away. Unfortunately for some (particularly certain petty individuals with some inferiority issues to work out), this is instantly viewed as a method for furthering rampant piracy. Regardless, this certainly seems like good news for the ps2dev community.

Did a number of non-kernel things today -- which in itself doesn't seem to happen very often. Ported uClibc CVS HEAD to sh64 this morning, based off of some ancient patches from SuperH. Nothing too exciting port-wise, static linking is the only thing that really works at the moment, as I still have to sit down and hack on the ldso and libpthread interfaces, but I'll leave that for another day. A static hello world stripped comes in at a hefty 41k, which certainly has glibc beat. Patches off to andersee, and should hopefully be merged soon. So far so good.

Also spent some time hacking on a simple RDF/RSS parser using libxml2. This is a result of not being able to find a suitable tool that did what I wanted. Originally I was going to hack this into mutt directly, but it seems much more logical to roll this into some sort of fetchmail-like tool for doing the initial fetching/parsing/sorting. Then I can dump this stuff straight into mbox format and pick it up through mutt that way (since I can also add the mailboxes directly and get notification on updates, while keeping the fetching tool running in the background). I suppose another alternative would be to hack this into fetchmail directly, but the tools are different enough that this probably isn't an overly useful approach (not to mention the blatant disregard for sane headers).

4 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!