From sacadmin Mon Mar 1 12:57:58 2004 Date: Mon, 1 Mar 2004 15:52:48 -0500 (EST) From: James Carlson To: psarc@sac.sfbay.sun.com cc: Kanoj.Sarcar@Sun.COM Subject: 2004/181 GLD MDT Support For InfiniBand Content-Length: 5933 I am sponsoring this fast-track request for Kanoj Sarcar. The timer is set to 03/08/2004. BACKGROUND "TCP Multi-Data Transmit" (PSARC 2002/276) defined the multidata interfaces (henceforth referred to as MDT) which allows IP/TCP to send down multiple network packets in a single STREAMS message, leading to IP stack throughput and CPU utilization improvements. Currently, Cassini is the only network driver that uses these interfaces. "GLD Hardware Checksum" (PSARC 2003/399) added support into GLD to handle DL_CAPABILITY_[REQ/ACK] messages, themselves defined in "DL_{CAPABILITY,CONTROL}_REQ" (PSARC 2001/070). This is the mechanism used by IP to determine whether a DLPI driver supports MDT. PROBLEM GLD version 2 (originally defined in PSARC 1997/382) has not yet been enhanced to handle multidata messages. "IP over Infiniband" (PSARC 2001/289) is a high bandwidth medium, which uses GLD and will realize noticeable improvements if GLD supports MDT. SOLUTION A 'patch/micro' release binding is asserted for the following Contracted Project Private interfaces, with both provider and consumer delivered via the ON consolidation. The contract is provided in the case directory as contract-01.txt. In , three new driver entry points are defined: gldm_mdt_pre (part of the gld_mac_info(9S) structure) gldm_mdt_send (part of the gld_mac_info(9S) structure) gldm_mdt_post (part of the gld_mac_info(9S) structure) These are exported by the GLD and imported by drivers. (Note: binary compatibility with older GLD drivers is maintained despite the additional field in the gld_mac_info(9S) structure because GLD is responsible for allocation and freeing of such structures. Drivers merely fill in the structure members that they are aware of.) DETAILS A driver wishing to make use of these interfaces must fill in the above three entry points prior to invoking gld_register(9F). GLD is free to return DDI_FAILURE out of gld_register(9F) if it can not support MDT for the underlying medium. gld_register(9F) will fail if these entry points (ie pointer values) are set and the medium specified is anything other than InfiniBand. The owner of the GLD interface does not believe that this is necessarily the right direction to take for future work, so it is being limited to just this one performance-related case. The prototype for these MAC driver entry points are: int _mdt_pre(gld_mac_info_t *macinfo, mblk_t *mp, void **cookie); void _mdt_send(gld_mac_info_t *macinfo, void *cookie, pdescinfo_t *dl_pkt_info); void _mdt_post(gld_mac_info_t *macinfo, mblk_t *mp, void *cookie); Each MDT message that comes to GLD comprises multiple IP packets contained in two contiguous regions, one for the headers, one for the payload, and represented by a single STREAMS mblk. The IP layer can hand a set of such MDT messages to GLD, represented by a chain of mblks, chained via the mblk's b_cont field. GLD will walk the chain and hand each individual MDT message to the MAC driver's gldm_mdt_*() functions. This allows the driver to freemsg(9F) each MDT message on successful transmission even though they were chained by the IP stack. GLD will first invoke the gldm_mdt_pre() entry point, which tells the driver a sequence of packets is being transmitted; at this point, using interfaces described in MDT, the driver can determine the number of packets in the message (possibly to allocate required number of hardware transmit descriptors), the location of the header and payload areas (possibly to register the area for DMA), the destination address etc, and must allocate all required resources. If required, the driver can allocate (using kmem_alloc(9F) or otherwise) a data structure to maintain state for this MDT message, and pass a handle to this data back to GLD via the "cookie" input. In the second stage, GLD will invoke the gldm_mdt_send() entry point for each of the packets in the MDT message, passing in the cookie so that the MAC driver can track whatever information it needs; the "dl_pkt_info" input can be parsed using MDT interfaces, and indicates the payload and header (including link layer) location for the specific packet, which the driver should use to program its transmit descriptor. Finally, in the last stage, GLD will invoke the gldm_mdt_post() entry point, at which point the driver can free resources (such as the allocated cookie); additionally, the driver can issue the pio to kick the hardware to transmit all the packets (or alternately, the driver can issue a pio per packet, possibly in the gldm_mdt_send() entry point). The return value of gldm_mdt_pre() entry point tells GLD how the MAC driver intends to handle the set of packets described by the MDT message. The return value can be any one of: a. -1: the MAC driver is indicating there is something wrong with the MDT message and refuses to transmit it. In this case, the driver should not do a freemsg(9F) on the input "mp", leaving that for GLD (similar to GLD_FAILURE out of gldm_send(9E)). GLD will not invoke the gldm_mdt_send() (s) and gldm_mdt_post() for this mblk. b. The number of packets (in the MDT message) that the driver can transmit currently. In this case, GLD will invoke gldm_mdt_send() an appropriate number of times to transmit the first few (or all) packets in the MDT message, followed by gldm_mdt_post() (in the special case of 0, no gldm_mdt_send() or gldm_mdt_post() will be invoked). If the driver indicates it can not process all packets, it must invoke a gld_sched(9F) subsequently when resources are freed up as an indication that GLD should retry the gldm_mdt_pre() for the remainder of the packets. From sacadmin Wed Mar 3 10:53:44 2004 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Wed, 3 Mar 2004 13:48:36 -0500 From: James Carlson To: psarc@sac.sfbay.sun.com cc: Kanoj.Sarcar@Sun.COM Subject: 2004/181 GLD MDT Support For InfiniBand Content-Length: 345 This fast-track request was approved during ARC business at today's PSARC meeting. The contract signatures will follow. -- James Carlson, IP Systems Group Sun Microsystems / 1 Network Drive 71.234W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.497N Fax +1 781 442 1677 From sacadmin Wed Mar 3 11:01:37 2004 Date: Wed, 03 Mar 2004 10:56:29 -0800 From: Darrin Johnson User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: en-us, en MIME-Version: 1.0 To: psarc@sac.sfbay.sun.com CC: john.a.wright@sun.com, Kanoj.Sarcar@sun.com Subject: Re: 2004/181 GLD MDT Support For InfiniBand Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Length: 306 I agree to the terms of the contract. Darrin James Carlson wrote: > The contract for these interfaces needs your signature. The contract > is here: > > /shared/sac/PSARC/2004/181/contract-01.txt > > To sign, just reply to this message and indicate that you agree to the > terms of the contract. > From sacadmin Wed Mar 3 14:50:29 2004 Date: Wed, 03 Mar 2004 14:45:02 -0800 From: John A Wright Subject: Re: 2004/181 GLD MDT Support For InfiniBand To: Darrin Johnson Cc: psarc@sac.sfbay.sun.com, Kanoj Sarcar MIME-version: 1.0 Content-type: text/plain Content-transfer-encoding: 7bit Content-Length: 802 I agree to the terms of the contract as well. Thanks, John Wright On Wed, 2004-03-03 at 10:56, Darrin Johnson wrote: > I agree to the terms of the contract. > > Darrin > > James Carlson wrote: > > > The contract for these interfaces needs your signature. The contract > > is here: > > > > /shared/sac/PSARC/2004/181/contract-01.txt > > > > To sign, just reply to this message and indicate that you agree to the > > terms of the contract. > > -- +------------------------------------------------------------------+ | John A Wright John.A.Wright@sun.com 650.786.4800 | | Engineering Manager HEES Software West x84800 | | Sun Microsystems 14 Network Circle, Menlo Park CA 94025 | +------------------------------------------------------------------+